Benchmark methodology, evaluation results, and raw data for Ejentum's Logic API. cognitive infrastructure for AI agents.
The Logic API retrieves engineered cognitive operations (not information) and injects them into an LLM's context at inference time. These benchmarks measure the behavioral effect of that injection across eight independent evaluation frameworks, covering four product layers: Reasoning, Code, Anti-Deception, and Memory.
Under Pressure: RA²R and the Emergence of Uninstructed Reasoning Behaviors in Scaffold-Augmented Language Models
Franko Luci, Ejentum. April 2026.
This paper synthesizes all benchmark findings into a unified thesis: suppression is pressure, and emergence is the model's response. 25 pages, 9 figures, all negative findings reported.
- Download PDF
- Zenodo: 10.5281/zenodo.19392715
- SSRN: Abstract ID 6512038
- ORCID: 0009-0000-7086-6991
- Blog post: ejentum.com/blog/under-pressure-research-paper
| Benchmark | Tasks | Type | Model | Primary Finding |
|---|---|---|---|---|
| EjBench | 180 custom professional | Single-turn, 7-factor blind rubric | Claude Opus 4.6 | +10.1pp composite quality lift. Self-monitoring nearly doubled. Correctness flat. |
| BBH / CausalBench / MuSR | 70 published academic | Single-turn, 7-factor blind rubric | Claude Opus 4.6 | +20.8pp composite lift on focused tasks. Correctness improved +7.1pp. |
| ARC-AGI-3 | 25 steps x 2 conditions | Interactive multi-step reasoning | Claude Sonnet 4.6 | RHAE 0.0 = 0.0 (both failed). Injection persisted 24 steps. Memory decay reversed. |
| Benchmark | Tasks | Type | Model | Primary Finding |
|---|---|---|---|---|
| LiveCodeBench Hard | 28 hard competitive programming | Code generation + correctness | Claude Opus 4.6 (max effort) | 85.7% -> 100%. +14.3pp. 4 tasks gained, 0 lost. Zero regressions. |
| SciCode | 10 hard scientific computing | Dual injection (reasoning + code) | Claude Opus 4.6 | 7 bugs -> 0 bugs. 10/10 blind evaluation chose injection. |
| Benchmark | Tasks | Type | Model | Primary Finding |
|---|---|---|---|---|
| ELEPHANT | 40 real Reddit scenarios | Sycophancy measurement | GPT-4o (cross-model) | 5.8% composite sycophancy. 7.5% framing sycophancy. |
| Adversarial 20-Turn | 20-turn adaptive attack | Social engineering detection | GPT-4o | Detected at Turn 6. 27/30 blind evaluation. |
| Hallucination Prevention | 5 fabrication tests | Hallucination measurement | GPT-4o | Zero hallucinations across all 5 tests. |
| Benchmark | Tasks | Type | Model | Primary Finding |
|---|---|---|---|---|
| State Tracking | 20-turn Vantage scenario | Implicit state changes | GPT-4o | 50% fewer stale facts served as current. Blind eval 4.1/5 vs 3.5/5. |
| Perceptual Detection | 15-turn Morgan scenario | Signal detection in coaching | GPT-4o | 3x signal detection rate. 43% vs 14%. |
| Selective Metrics | 10-turn Casey scenario | Perception + reframing | GPT-4o | Earlier detection (1 turn) on 2 of 5 signals. |
Total: 250 single-turn reasoning tasks + 50 interactive reasoning steps + 28 competitive programming tasks + 10 scientific computing tasks + 40 sycophancy scenarios + 20-turn adversarial attacks + 5 hallucination tests + 45 memory turns across eight benchmark suites.
All benchmarks follow a consistent protocol adapted to each product layer:
-
Agent-native execution. Agents called Ejentum's production Logic API themselves via tool use. The agent summarized the task, called the endpoint, received the injection, and applied it before reasoning. This mirrors real deployment: the retrieval variance is real, not simulated.
-
Blind evaluation. For reasoning and memory benchmarks: a separate evaluator scored outputs without knowing which condition was augmented. Generation and evaluation are separate stages. For code benchmarks: exact-match pass/fail on test cases.
-
Cross-model validation. Anti-deception and memory benchmarks were tested on GPT-4o, validating that the mechanism works across model families.
-
Negative findings reported. Correctness dips, domain regressions, and unexpected results are in the reports. We do not omit results that challenge the thesis.
| Term | Definition |
|---|---|
| Logic API | Ejentum's REST endpoint (POST /logicv1/). Retrieves engineered cognitive operations from 679 abilities across four product layers. |
| Injection | A structured cognitive payload containing a negative gate (failure pattern to avoid), reasoning topology (execution structure), suppression signals (failure modes to block), amplification signals (patterns to prioritize), and a falsification test (verification criterion). |
| Harness | A product layer (Reasoning, Code, Anti-Deception, Memory). Each harness is a curated collection of abilities targeting a specific class of AI failure. |
| Ability | One engineered cognitive operation. The atomic unit retrieved from the database. |
| Suppression | Named failure modes injected as constraints. Suppression signals reduce the probability of specific reasoning shortcuts. In testing, suppression produces larger behavioral effects than amplification alone. |
API modes: reasoning, reasoning-multi, code, code-multi, anti-deception, memory, memory-multi
| Benchmark | Baseline | Best Condition | Delta |
|---|---|---|---|
| EjBench (180 tasks) | 0.621 | 0.722 | +10.1pp |
| BBH/CausalBench/MuSR (70 tasks) | 0.476 | 0.684 | +20.8pp |
| Benchmark | Baseline | With Injection | Delta |
|---|---|---|---|
| LiveCodeBench Hard (28 tasks) | 85.7% | 100.0% | +14.3pp |
| SciCode (10 tasks) | 7 bugs | 0 bugs | -100% |
| Metric | Result |
|---|---|
| ELEPHANT composite sycophancy | 5.8% |
| Social engineering detection | Turn 6 of 20 |
| Hallucinations (5 tests) | 0 |
| Metric | Baseline | With Injection | Delta |
|---|---|---|---|
| Stale facts served | 1.6 | 0.8 | -50% |
| Perceptual detection rate | 14% | 43% | 3x |
| Blind evaluation score | 3.5/5 | 4.1/5 | +17% |
| Metric | Baseline | Augmented |
|---|---|---|
| Memory decay slope | -0.005 (degrading) | +0.014 (improving) |
| Injection half-life | 0 | 24 steps |
| Reasoning depth trend | 0.86 | 10.50 (12.2x) |
- Correctness dipped under reasoning-multi on EjBench (-0.11 on 3-point scale). Thorougher reasoning occasionally trades accuracy for caution.
- Spatial domain regressed under reasoning-multi on BBH (-20.0pp on 5 tasks). Multi-perspective injection confused spatial constraint tracking.
- Reasoning-multi correctness dropped on BBH (-0.12). Focused tasks need focused injections. Single mode outperformed multi on every single-domain task.
- Contradiction rate increased 1.9x on ARC-AGI-3 (token-normalized). Whether this is productive cognitive conflict or destructive interference is unresolved.
- ARC-AGI-3: RHAE 0.0 = 0.0. Neither condition cleared Level 0. All process metrics are measured in a failure context.
The core claim is that structured cognitive injection produces measurable behavioral changes in LLM outputs. This claim is falsified if:
- The same injection format produces zero behavioral change on a different model family. (Partially addressed: anti-deception and memory validated on GPT-4o.)
- Random injection (shuffled suppression signals, mismatched topologies) produces equivalent lift, meaning the specific cognitive operation doesn't matter.
- The 7-factor rubric scoring shows evaluator bias that systematically inflates injected conditions.
- Replication on a second independent run produces directionally different results.
- Reasoning benchmarks are Claude-only. Anti-deception and memory are cross-model (GPT-4o). Full cross-model reasoning testing is in progress.
- LLM-as-judge. Two-stage blind protocol mitigates but does not eliminate the possibility of systematic bias. Human evaluation on a subset would strengthen the evidence.
- Custom task design bias. EjBench tasks were designed by Ejentum. The BBH/CausalBench/MuSR benchmark addresses this with externally designed tasks.
- Small samples on sub-analyses. Spatial navigation regression rests on 5 tasks. ARC-AGI-3 is n=1 per condition.
benchmarks/
README.md
LICENSE
ejbench/ # 180 custom professional reasoning tasks
bbh-causalbench-musr/ # 70 published academic reasoning tasks
arc-agi-3/ # Interactive multi-step reasoning (25 steps)
lcb-hard/ # 28 hard competitive programming tasks
coding-benchmark/ # SciCode: 10 hard scientific computing problems
elephant/ # ELEPHANT sycophancy benchmark (40 scenarios)
memory-retention/ # 20-turn implicit state change tracking
perception-hard/ # Perceptual signal detection (Morgan + Casey)
research/
COGNITIVE_SCAFFOLDING_THESIS.md
VALIDATED_CLAIMS.md
paper/under_pressure.pdf
- Product: ejentum.com
- Documentation: ejentum.com/docs
- Product layers: Reasoning · Code · Anti-Deception · Memory
- Skill files: Ejentum (all modes) · Reasoning · Code · Anti-Deception · Memory
- Blog: ejentum.com/blog
- 49 benchmark tasks with outputs: ejentum.com/use-cases
- Integration examples: github.com/ejentum/examples
- MCP server (Claude Code, Cursor, Cline, Windsurf, Continue): github.com/ejentum/ejentum-mcp
Released under CC BY 4.0. Share and adapt with attribution.