AI Security

Copilot Code Review Security: What It Misses

Copilot's code review is useful. It is also not a security review, and treating it as one is how vulnerabilities ship. Here is what it actually catches.

Shadab Khan
Security Engineer
7 min read

GitHub Copilot's code review feature has matured into something genuinely useful in 2026. It catches real bugs, surfaces maintainability issues, and gives junior engineers a faster feedback loop than waiting on a human reviewer. What it is not, despite occasional marketing suggestion to the contrary, is a security review. Treating it as one is a predictable way to ship vulnerabilities that a purpose-built SAST tool would have caught. This post covers where Copilot helps, where it misses, and how to integrate it without generating false confidence.

What does Copilot code review actually catch well?

Local correctness issues with clear signals in the diff. Null dereferences where the code clearly walked into one, obvious off-by-one errors, unhandled error returns in a pattern the model has seen a hundred thousand times, typos in test assertions. For these, the LLM's pattern-matching is strong and the noise is low. A reviewer who skims Copilot's comments will catch a non-trivial share of real bugs that a careful human would also have caught, just faster.

It is also reasonable at surface-level security hygiene in contexts it has strong training for: a hardcoded password, an obvious SQL string concatenation, a call to eval on user input. These are the "shallow SAST" findings, and for code in the language patterns Copilot has deep familiarity with, the tool is a competent junior reviewer on these patterns. Teams report catching a meaningful number of these during PR review, which is better than catching them in production.

Where does it miss security issues?

Anywhere the bug requires reasoning across files, across services, or across time. Copilot's review context is usually the diff plus some adjacent file content. Vulnerabilities that live in the interaction between files (an authorization check that should have been in the middleware but is missing from one route) or in the interaction between services (an RPC call that trusts input the caller shouldn't be trusted to provide) are mostly invisible to it. This is not a criticism of the tool; it is a limit of the context window and of the review scope.

Concretely, the class of issues Copilot routinely misses: IDOR and broken access control that depend on the caller's identity, SSRF through clever URL construction, XXE in custom XML parsers, path traversal in file-serving logic that uses helper functions defined elsewhere, race conditions in concurrent code, deserialization gadgets that are harmless in isolation but dangerous when chained, and anything that depends on the state of a configuration file that isn't in the diff.

It also struggles with security issues in domain-specific code. Cryptographic protocol bugs, nonce reuse, weak random number generation in niche libraries, and side-channel leaks in constant-time code are all things the model has thin training coverage for. A crypto engineer reviewing the same diff will catch things Copilot reliably misses.

How does it handle vulnerable dependencies?

Poorly, if at all, unless the CVE is famous. The diff-level review does not have a full view of the dependency graph, and even when it flags a dependency update it does so based on prose cues rather than a structured vulnerability database. A Copilot comment saying "this package had a CVE recently" is not a reliable signal; it's a hunch. Production dependency management has to be handled by a real SCA tool that queries a vulnerability database against a full SBOM.

This matters because teams that rely on Copilot code review sometimes under-invest in SCA, assuming the AI will catch it. It won't. The Ultralytics PyPI compromise in late 2024 is a good example: a dependency update that looked ordinary at the diff level but was carrying malicious code. Diff-level review could not catch it. Only provenance-aware scanning could, and even that required the tooling to have visibility into package behavior, not just version numbers.

What about AI-generated code in the PR?

Copilot reviewing Copilot-written code creates a confirmation loop that can be subtly wrong. The generator and the reviewer share training distributions. Patterns that are common but insecure (hardcoded secrets in examples, unsanitized shell commands in quick scripts, weak crypto defaults) can pass review precisely because the reviewer has seen the same patterns in the same contexts during training. The review does not know that the code was AI-generated, but even if it did, the reviewer's priors are not independent of the generator's.

The practical mitigation is that any AI-generated code in a PR should trigger additional review, human or tool-based, rather than less. Some teams run AI-generated changes through a stricter SAST profile, which catches the patterns that both the generator and the reviewer overlooked. Treating "Copilot reviewed, Copilot approved" as sufficient is the pattern most likely to produce 2026's embarrassing bug bounty reports.

How should we integrate it without false confidence?

Treat Copilot code review as a first-pass filter that runs alongside, not instead of, the tools that actually know about security. The stack that holds up: Copilot review for broad coverage and fast feedback, SAST (Semgrep, CodeQL, or equivalent) for rule-backed security coverage, SCA with reachability analysis for dependency risk, and human review on anything that touches auth, data access, or payment flow. Each layer catches different things. Dropping any layer because another layer feels thorough is the cost-optimization that generates incidents.

The other integration rule: do not let Copilot's approval count toward required reviewers on security-sensitive paths. GitHub's codeowners and required-reviewers config is where this gets enforced. If your payment code's CODEOWNERS file lets an AI bot be the approver, you have built a policy hole regardless of how good the AI is.

What about the meta-question: will the tool get better?

Yes, and that is not a reason to change the integration pattern. As context windows grow and as the tools learn to pull in more of the codebase during review, the share of security issues Copilot can catch will grow. The ones listed above as "misses" will shift. But the structural point stays: a general-purpose code reviewer is not a substitute for purpose-built security tooling, and the two have different failure modes. Keep both.

How should the feedback loop for misses be structured?

Capture every production security issue that made it past review and ask whether Copilot flagged it, a SAST rule flagged it, a human reviewer flagged it, or none of the above. Tag the issue with which layer caught it. Over a few quarters, you build a real picture of where each layer is strong and weak, and you make informed investment decisions rather than guessing.

Teams that do this often discover something counterintuitive: Copilot is actually strong in a few security-adjacent areas where SAST is weak (explanations of tricky code paths, for instance, or spotting patterns that a human reviewer would describe as "smells" but that no rule captures). Other teams discover that Copilot's signal is so noisy in their codebase that it drowns out the real catches. Either conclusion is actionable. Neither is reachable without the tagged issue data, which means the first investment is in the tagging process, not in any particular tool.

How Safeguard.sh Helps

Safeguard.sh's reachability analysis catches exactly the dependency-level issues that Copilot code review structurally cannot, while cutting 60 to 80 percent of the false positives that make SCA tooling miserable to live with. Griffin AI complements code-level review with provenance analysis on every package update, flagging compromises like the Ultralytics PyPI incident or PyTorch nightly-style attacks where the diff looks normal but the supply chain is not. SBOM and TPRM workflows keep visibility into the full dependency graph at 100-level depth, so transitive risks do not hide behind a clean top-level review. Container self-healing rebuilds downstream images when fixes land, closing the gap between a CVE disclosure and production rollout.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.