OpenAI's Codex lineage has always been oriented around one thing: helping engineers write code faster. It is very good at that. The modern Codex-style agents can read a repository, understand a task, open a pull request, and iterate on feedback. For greenfield work and well-scoped refactors, this is close to magical.
Security remediation, however, is not greenfield work. It looks like code generation on the surface but behaves very differently underneath. This post walks through why a Codex-shaped tool is not the right shape for security workflows, and how Griffin AI—which uses frontier reasoning from Anthropic's Claude family as its core—is built around the problem security teams actually face.
A housekeeping note before we start: Griffin is not competing with Codex as a general coding agent. Codex writes features. Griffin remediates risk. The comparison below is about what happens when security teams try to treat the former as the latter.
What Codex is optimized for
Codex-style agents are optimized for a shape of task that goes: "read this repo, understand the user's intent, produce code that satisfies the intent, iterate on tests and reviewer feedback." The core loop is well defined. The success signal is well defined. The model is encouraged to be creative, because creativity in code generation usually produces cleaner abstractions.
This is a very different loop from "given a CVE, a dependency graph, a VEX statement, a policy gate, and a reachability analysis, produce a remediation plan that respects the organization's SLAs and does not break production."
The security workflow has a different shape
A security remediation task has characteristics that a general coding agent is not built to handle:
- The primary inputs are not code. The primary inputs are a CVE record, an SBOM, a dependency graph, exploit intelligence, VEX statements, and a set of organizational policies. The code is the output, not the input.
- Creativity is often wrong. A coding agent that invents a clever new abstraction to solve a problem is valuable. A security agent that invents a clever mitigation nobody has reviewed is dangerous. Security wants conservative, evidence-backed choices.
- Blast radius matters more than throughput. Pushing ten feature PRs with small bugs is fine. Pushing ten security PRs that subtly change authentication behavior is a serious incident.
- Grounding is not optional. A coding agent can hallucinate a library name and the tests will catch it. A security agent that hallucinates a VEX statement or a patched version has just introduced a false reassurance into production.
This is not a knock on Codex. Codex is excellent at what it does. It is simply not the right tool when the task is risk management.
What Griffin does differently
Griffin AI treats security remediation as its own shape of workflow. Frontier reasoning—the same kind of reasoning that makes Codex good at writing code—is available at every step, but it never runs unguarded.
Here is the skeleton of a Griffin remediation task:
- Grounding. Before the model reasons about anything, Griffin pulls the current SBOM, the dependency graph, the CVE record, any VEX statements, and the applicable policies. The model does not guess.
- Reachability. Griffin evaluates whether the vulnerable symbol is actually callable from the application's entrypoints. A large fraction of CVEs are not reachable, and remediating them is busywork.
- Candidate generation. The frontier model proposes remediation candidates: upgrade to a specific version, pin a transitive dependency, apply a configuration workaround, accept the risk with a VEX statement. Each candidate is annotated with evidence.
- Policy evaluation. Candidates are scored against the organization's policies. A candidate that upgrades a major version on a frozen service might be rejected even if the model prefers it.
- Plan emission. The surviving candidates become a remediation plan. The plan references exact evidence—SBOM IDs, policy clauses, CVSS vectors—and is auditable.
- Execution, under guardrails. If the organization has approved automated remediation for this class of risk, Griffin opens a PR through a Codex-style execution layer. But it does so only within the boundaries the policy allows.
Notice that step six is the only place where Codex-shaped code generation actually happens. Steps one through five are where security workflows live.
A concrete contrast
Consider a CVE-triage task. An engineer has one morning to deal with fifty new advisories landing against their microservice portfolio.
In a pure Codex-style flow, the engineer would open each advisory, ask the agent to read the repo and propose a fix, review the PR, and merge. At fifty advisories, this is an all-day job even with the agent's help. More importantly, a chunk of the work is wasted—many of those CVEs are not reachable, are already mitigated upstream, or do not apply to the language version in use.
In a Griffin flow, the engineer opens the dashboard and sees that forty-one of the fifty advisories are non-exploitable based on reachability and VEX analysis. Seven require remediation within the policy window and have plans already drafted. Two require a human judgment call because the candidate fixes touch a frozen integration. The engineer reviews the seven drafted PRs—each generated by the Codex-style execution layer under Griffin's guardrails—and spends the morning on the two judgment calls. That is a day recovered.
Same morning. Same engineer. Same frontier reasoning under the hood. Radically different outcome, because the shape of the system matches the shape of the problem.
The thing neither tool should do
There is one more category worth naming: work that neither Codex nor Griffin should do on its own. Changes to authentication logic, cryptographic primitives, session handling, and access control boundaries should not be autonomous. Griffin's default posture for these classes of change is to produce a plan, flag the risk, and route to a human reviewer with appropriate context. A Codex-style agent, absent a security-aware orchestrator, will happily rewrite an auth handler to make a test pass.
This is not a capability problem with the underlying model. It is a governance problem. Governance lives in the orchestrator, not in the model.
Where frontier reasoning actually helps
Everywhere. Griffin leans on frontier models heavily, because the reasoning quality is genuinely excellent. It writes the explanatory text that developers will read. It weighs candidate mitigations. It drafts the pull request description. It reviews its own plan for internal consistency before emitting it. None of this would be possible at today's quality with a smaller model.
But it all happens inside an engine that keeps the model grounded, scoped, and accountable. That is the positioning. Griffin uses frontier reasoning; Griffin also grounds it with an engine.
The practical takeaway
If your engineering organization is evaluating a Codex-style agent for feature development, it probably belongs there. Use it. Enjoy the throughput.
If your security organization is evaluating the same agent for remediation work, ask the harder question: does this tool know my SBOMs, my policies, my VEX statements, my reachability graph, my SLAs? If the answer is no, the tool is not the problem—the shape of the system is the problem. Security workflows need an engine around the model, and that engine is what Griffin provides.