I design evaluation systems that expose where LLMs fail on real coding tasks — multi-model pipelines, adversarial task design, and deterministic grading to reliably separate real capability from illusion.
These tasks are structured as self-contained environments with automated grading — similar to RL-style evaluation environments used in frontier model training.
def is_valid_token(token):
return token.startswith("auth_") and len(token) > 10
Prompt: “Is this authentication logic secure?”
Expected: Reject — predictable token structure, no entropy.
Result: Neutral/knowledge framing → model approves logic.
Audit framing → model flags vulnerability.
Failure Mode: Model evaluates surface pattern, not security properties.
Prompt framing determines reasoning path.
Task: Validate input parsing logic.
Expected: Identify crash on empty input ("").
Result: Model validates parsing logic correctly for typical inputs.
Fails to identify crash condition on empty string ("") —
reports code as “clean” despite exploitable edge case.
Failure Mode: Superficial correctness without robustness.
Model evaluates the happy path, not the failure surface.
Task: Multi-file vulnerability detection across a production codebase.
Expected: All critical issues identified.
Result: 4 models evaluated the same code. 1 model identified a critical
issue. 3 models missed it entirely.
Failure Mode: Detection depends on model-specific reasoning path.
~25% of critical findings in our dataset were found by only one model.
Single-model evaluation is fundamentally incomplete.
When asked to evaluate its own certainty, the model is most confident dismissing findings on clean files and most uncertain on files with real bugs. The confidence gradient itself is a classification signal — but inverted from what an automated gate would need.
The same model (GPT-4o), given the same file, produces opposite conclusions depending on whether the prompt uses knowledge-retrieval framing vs code-audit framing. The prompt structure activates fundamentally different reasoning pathways.
Adding authority assertions (“already audited by senior engineers”) causes the model to agree code is perfect — on both clean and dirty files equally. Introducing doubt produces different hedging patterns: hedged findings on dirty files vs hedged ignorance on clean files.
In multi-model audits across production codebases, roughly one quarter of critical findings were detected by a single model in the ensemble. Any single-model approach would systematically miss these findings. This is the empirical case for multi-model consensus.
H(F) = P(F) × (1 - I(F))
| Metric | Multi-Model Pipeline | PR Review Tool |
|---|---|---|
| Verified findings | 59 | 25 |
| Critical bugs identified | 9 | 1 |
| 6,000+ line file coverage | 20 findings, 3 CRITs | N/A |
| BugsInPy known bugs detected | 80% (4/5) | ~60% |
| CRIT+HIGH false positive rate | ~8% | ~12% |
| Review time (8 files) | ~9 min | ~15 min |
This comparison illustrates the difference between two complementary approaches. PR review tools excel at workflow integration, noise reduction, and continuous developer feedback. Deep audit pipelines sacrifice speed and polish for maximum vulnerability depth. Neither replaces the other.
Standard inputs pass. Empty strings, malformed data, and boundary values expose whether the model reasoned about the code or pattern-matched the structure.
If the same model produces opposite conclusions on the same code under different prompt framing, the reasoning is prompt-dependent, not evidence-dependent.
Multi-model consensus separates real findings from model-specific artifacts. ~25% of critical findings in our dataset were detected by only one model.
Not every failure is a model limitation. Some failures are evaluation design failures — the task didn't test what we thought it tested. Separating these is the hardest part.