“Fix Verification Isn’t a Code Review. It’s a Risk Decision.”

by Mona Ghadiri

Part 1 of this series ended with a question: for your last ten AI-remediated findings, was the fix scanned before it merged?

If that question made you uncomfortable, this part is for you.

The instinct most security teams reach for is more review. Better review. Senior engineer on the diff. Mandatory approval gate. That instinct is wrong – not because review doesn’t matter, but because it’s answering the wrong question.

A code review tells you whether a fix looks correct. It does not tell you whether the fix is safe.

Those are different questions. Conflating them is where AI-assisted remediation workflows break down.

Two Ways a Fix Can Fail

When an AI proposes a remediation, there are two distinct failure modes. Most verification processes are only built to catch one of them.

Failure mode one: semantic incorrectness.
The fix is syntactically valid and passes tests, but doesn’t actually resolve the root cause. It addresses the symptom. A reviewer with a strong mental model of the code might catch this. A reviewer working with AI-generated code they didn’t write and don’t fully understand probably won’t. The diff looks clean. The underlying logic is still broken.

Failure mode two: fix-introduced vulnerability.
The remediation itself contains new exploitable code. This is the failure mode Part 1 introduced. It doesn’t require the original fix to be wrong – the fix can correctly address the original finding and still introduce something new. A code review is not designed to catch this. It requires the same class of agentic scanning that caught the original vulnerability.

Traditional CTEM programs have implicit controls for failure mode one – code review, peer approval, testing. They have almost no controls for failure mode two. Because until recently, the remediation was human-written code that went through the same development process as everything else. The assumption was reasonable.

It isn’t anymore.

Why “Have a Human Review It” Isn’t a Control

A control is only as strong as the assumption it rests on.

Code review as a verification mechanism rests on this assumption: the reviewer has sufficient comprehension of the code to independently evaluate the fix. For human-written code in a well-understood codebase, that assumption usually holds.

For AI-generated code fixing AI-introduced vulnerabilities, it holds less often than security teams realize, and the gap widens as codebases scale. AI-generated code increases as a percentage of total volume, and as the vulnerability classes being fixed move from well-understood patterns toward complex business logic.

The reviewer is not failing at their job. They are being asked to verify something using a tool – human judgment – that wasn’t designed for the input they’re receiving.

This is not an argument against code review. It’s an argument that code review is necessary but not sufficient, and that treating it as sufficient is a control gap in your CTEM mobilization stage.

The Tier Model

Not every AI-proposed fix carries the same verification burden. Treating them all the same wastes capacity on low-risk remediations and under-invests in high-risk ones.

The decision rule is straightforward: verification tier is determined by the intersection of fix complexity and asset criticality.

Tier 1: Automated rescan gate

Applies when:

The vulnerability class is well-understood (dependency version bump, known injection pattern, standard cryptographic weakness)
The fix pattern is established and the change is bounded in scope
The asset is not internet-exposed or does not have a path to sensitive data

Verification mechanism: MDASH runs as a post-fix pipeline gate before PR merge. The fix does not merge until it passes a clean scan. No human escalation required beyond standard code review.

This covers the majority of AI-proposed fixes by volume. It’s fast, it’s automated, and it closes failure mode two without adding review overhead.

Tier 2: Senior practitioner review

Applies when any of the following are true:

The fix touches authentication, authorization, or session management code
The underlying code is AI-generated and no human has a complete mental model of the logic
The asset is internet-exposed with a path to sensitive data
The vulnerability class involves business logic rather than a known pattern
The fix scope is broad – multiple files, multiple functions, architectural change

Verification mechanism: Automated rescan gate plus explicit senior practitioner sign-off against defined criteria. The practitioner is not reviewing the diff for correctness alone – they are making a risk judgment about whether the fix complexity exceeds the team’s ability to independently verify semantic correctness.

If it does, that’s an escalation, not a failure. It means the finding stays open until the verification burden can be met.

The Decision Rule in Practice

The tier assignment question is: does a human reviewer have sufficient comprehension of this code to independently evaluate whether the fix is semantically correct?

If yes, Tier 1 with automated rescan is sufficient.

If no, or if the asset criticality means the cost of being wrong is high, Tier 2.

This sounds simple. The operational challenge is that most teams don’t have a defined threshold for “sufficient comprehension.” They have a vague norm that senior engineers review complex changes. That norm doesn’t translate into a repeatable decision rule, and it doesn’t account for AI-generated code specifically.

Define the threshold explicitly. Write it down. Make it part of your CTEM mobilization criteria, not a judgment call made differently by every reviewer.

What Mobilization Complete Looks Like Under This Model

Under Tier 1: finding resolved, fix deployed, post-fix rescan clean. Three conditions, all automated, documented in the pipeline.

Under Tier 2: finding resolved, fix deployed, post-fix rescan clean, senior practitioner sign-off recorded against defined criteria. Four conditions. The fourth one is a human judgment, but it’s a specific judgment with a specific scope – not “does this look right” but “have the Tier 2 criteria been met.”

The difference matters. A defined judgment is auditable. A vague norm isn’t.

The Staffing Question Nobody Is Asking Yet

Tier 2 requires senior practitioners with both security depth and code comprehension. That is a specific skill profile. It is not the same as a senior SOC analyst. It is not the same as a senior developer who hasn’t worked in security.

Most organizations don’t have enough of these people to run Tier 2 verification at scale. That’s not an argument against the model – it’s a reason to be honest about what you can actually run internally versus what requires a different resourcing approach.

Part 3 of this series covers exactly that question: who owns the second loop, and what does it take to close it reliably?

This is a three-part series on what CTEM’s mobilization stage gets wrong in an AI-assisted development environment, and what to build instead.

Part 1: AI proposed your last fix. Did anyone scan it?
Part 3: Who Closes the Second Loop?