Neurely is an open-source research harness that empirically validates a simple claim: on structured manipulation tasks (Towers of Hanoi, 8-puzzle), a neuro-symbolic pipeline — CNN perception, BFS symbolic planner, replanning on error — outperforms end-to-end behavioral cloning by up to +100 percentage points, using tiny models (10K–160K parameters) trained on CPU in under 80 seconds.
We also report extensive negative results: cross-N generalization, memory systems, predictive perception validation, test-time augmentation, and LLM-guided self-improvement all failed to improve the simple pipeline. The contribution is both the validation and the falsification.
Inspired by "The Price Is Not Right" (ICRA 2026).
→ Full results at neurely.alexandrudan.com
- Neuro-symbolic (NS) beats behavioral cloning (BC) on long-horizon structured tasks — the advantage scales with horizon length, not noise alone.
- BC matches NS at short horizons with matched training — 10K/90K/357K-param BC all reach 100% on 4-disk Hanoi extreme when given the same noise curriculum NS uses. Training distribution matters, not model scale.
- NS pulls ahead on long horizons — on 8-puzzle hard starts (25+ moves), BC plateaus at 60-79% while NS reaches 93-99% (+17 to +33pp gap). On naive-trained BC, the Hanoi gap is ~100pp.
- Hanoi scaling holds to 127 steps — NS 100% from 3 to 7 disks, with ~1.2× optimal step efficiency.
- Re-planning is the critical mechanism — removing it drops NS from 100% to 10% on Hanoi extreme.
- Simplicity wins — memory systems, predictive perception validation, cross-N generalization, and TTA all either failed to help or actively hurt performance.
All numbers reported with Wilson 95% confidence intervals.
| Disks | Optimal | NS Success | Baseline Success | Avg Steps (NS) |
|---|---|---|---|---|
| 3 | 7 | 100% | 100% | 7.4 |
| 4 | 15 | 100% [98.1%, 100%] | 0% | 18.2 |
| 5 | 31 | 100% [98.1%, 100%] | 0% | 37.0 |
| 6 | 63 | 100% [96%, 100%] | — | 74.9 |
| 7 | 127 | 100% [93%, 100%] | — | 154.2 |
| Start Difficulty | NS Success | Baseline Success | Gap |
|---|---|---|---|
| Easy (~5-7 optimal moves) | 98-100% | 89-92% | +8 to +11pp |
| Hard (20+ optimal moves) | 93-99% | 60-79% | +17 to +33pp |
Gap scales with horizon — the easy 8-puzzle result was misleading because 15-random-move shuffles have short actual optimal paths. Hard starts reveal the true gap, comparable in magnitude to Hanoi.
| Baseline | Training | 4-Disk Extreme Success |
|---|---|---|
| Behavioral Cloning (10K params, naive training) | Clean data, light aug | 0% |
| Behavioral Cloning (10K params, + noise curriculum) | Matched NS training | 100% [96%, 100%] |
| Behavioral Cloning (90K params, + curriculum) | Matched | 100% [96%, 100%] |
| Behavioral Cloning (357K params, + curriculum) | Matched | 100% [96%, 100%] |
| DAgger (10K params, 5 rounds) | 46 min training | 52% |
| NS (ours, 10K params) | 77s | 100% |
Honest take: the spectacular 0% → 100% for BC with curriculum shows the naive BC failure was a training-setup artifact, not a structural limitation. On 4-disk (15-step horizon), both approaches work when trained properly. The NS structural advantage only emerges at longer horizons, as demonstrated on 8-puzzle hard starts where BC plateaus at 60-79% even with matching curriculum training.
| Component Removed | Success Rate | Impact |
|---|---|---|
| None (full system) | 100% | baseline |
| Re-planning | 10% | -90pp |
| Optimal planner (use greedy) | 50% | -50pp |
| Noise curriculum training | 54% | -46pp |
| Temporal consistency filter | 54% | -46pp |
We report failed approaches to keep claims calibrated:
| Failed Approach | Result | Why It Fails |
|---|---|---|
| Cross-N generalization (train 3,4 → test 5,6) | 0% | Visual layout too different across N |
| Memory + prediction + temporal features | 100% → 83% | Added complexity without benefit |
| Prediction validation (override perception) | 100% → 7% | Bad graph predictions suppress correct perception |
| Test-time augmentation (TTA) | 63% → 37% | Adds noise on top of noisy input |
| Qwen 0.5B coding agent (local MLX) | Malformed JSON | Model too small for structured code gen |
| Ultra-tiny perception (<3K params) | 88% on 4d extreme | Insufficient capacity |
Despite building a system with adaptive-depth perception, Cross-N queries, memory, predictive planning, and LLM-guided self-improvement, the simplest pipeline performed best:
Winning Architecture (validated)
├── CNN Perception (10K-160K params, noise-curriculum trained)
├── BFS Symbolic Planner (optimal)
├── Noisy Action Primitive
└── Replan on Error
git clone https://github.com/danlex/neurely.git
cd neurely
# One command
make full # setup + data + train + experiment
# Or step-by-step
make setup && make data && make train && make experiment
# Reproduce 5-disk extreme (100% success)
make headline
# Cross-task validation
python scripts/puzzle_experiment.py --n-episodes 100
# Run tests (19 tests, ~3 seconds)
pytest tests/ -v
# Build the paper
make paperneurely/ # Core library (20 modules, ~2800 LOC)
├── env.py # Towers of Hanoi simulation
├── puzzle.py # 8-puzzle (cross-task validation)
├── planner.py # BFS symbolic planner
├── perception.py # Fixed-N CNN perception
├── baseline.py # End-to-end BC baseline
├── pipeline.py # Validated NS pipeline (100%)
├── novel_arch.py # Cross-N + Adaptive Depth (partial success)
├── memory.py # STM + LTM + State Transition Graph (hurt)
├── prediction.py # Predictive planner (harmful)
├── full_system.py # Everything integrated
├── coding_agent.py # LLM self-programming (tested: Qwen 0.5B failed)
├── robust_pipeline.py # Temporal filter + re-observation
├── failure_mining.py # Hard example mining
├── ensemble.py, mc_dropout.py, tta.py # Alternatives explored
├── harder_tests.py # Adversarial, DAgger, random starts
├── metrics.py # Wilson CI tracking
└── evolve.py # Architecture search
scripts/ # Runnable experiments
├── generate_data.py
├── train.py
├── run_experiment.py
├── puzzle_experiment.py # Cross-task validation
├── autonomous_loop.py # Continuous experimentation
├── evolve_loop.py
├── visualize.py # Main paper figures
├── visualize_extensions.py # Cross-task + scaling figures
└── visualize_harder.py # Horizon-gap figure
paper/
├── main.tex # Publication-ready LaTeX (~1300 lines)
├── references.bib # 20+ references
└── figures/ # 9 PDF + PNG plots
docs/ # GitHub Pages (neurely.alexandrudan.com)
└── index.html
tests/
└── test_pipeline.py # 19 unit tests
results/ # All experiment JSON
├── mindmap.md # Exploration log
├── 4disk_comparison.json
├── 5disk_comparison.json
├── 6disk_scaling.json
├── 7disk_scaling.json
├── puzzle_experiment.json # Easy 8-puzzle
├── puzzle_harder.json # Hard 8-puzzle
├── ablation.json
├── planner_comparison.json
└── ... (14 total JSON files)
| Experiment | Episodes | Key Finding |
|---|---|---|
| 3-disk battery | 250 | Both 100% — task too easy |
| 4-disk battery | 300 | NS 100%, BC 0-10% under noise |
| 5-disk battery | 250 | NS 100% under curriculum |
| 6-disk scaling | 300 | NS 100% at 63-step horizon |
| 7-disk scaling | 150 | NS 100% at 127-step horizon |
| Ablation: no-replan | 100 | 100% → 10-44% |
| Planner: BFS vs greedy | 300 | 100% vs 50% |
| Architecture comparison | 400 | Deep (23K) optimal |
| DAgger baseline | 100 | 52%, 35× slower |
| Adversarial noise | 400 | NS 2× more step-efficient |
| Random starts | 100 | NS 100%, BC 16% |
| Noise curriculum | 200 | Solves 22% → 100% |
| Statistical validation | 400 | 100% [98.1%+] CI confirmed |
| Full system ablation | 150 | Memory/prediction HURT |
| Cross-N generalization | 240 | Failed (0% on unseen) |
| Self-improving loop | 150 | 0% → 52% after 5 rounds |
| Cross-task: easy 8-puzzle | 400 | NS +9-11pp |
| Cross-task: hard 8-puzzle | 450 | NS +17-33pp (horizon matters) |
Publication-ready LaTeX paper with 20+ references. Build with:
cd paper && pdflatex main && bibtex main && pdflatex main && pdflatex mainThe paper includes:
- Main comparison (NS vs BC on Hanoi)
- Ablation study (re-planning, planner optimality)
- Cross-task validation (8-puzzle)
- Scaling analysis (3 → 7 disks)
- Negative results section (Cross-N, memory features, TTA, prediction validation)
- Honest limitations
Q: Does neuro-symbolic AI beat end-to-end approaches? A: Yes, on structured manipulation tasks. Neurely shows the simple neuro-symbolic pipeline achieves 100% success while behavioral cloning collapses to 0% under observation noise on Towers of Hanoi. DAgger (a stronger baseline) reaches only 52% on 4-disk extreme conditions while taking 35× more training time.
Q: What is the key mechanism that makes neuro-symbolic AI robust? A: Re-planning on error. Ablation shows removing replanning drops success from 100% to 10%. Optimal planning (BFS) matters too — greedy planning drops to 50%.
Q: Why does the NS–BC gap scale with horizon difficulty? A: Behavioral cloning cannot recover from a single wrong move over long horizons. On the 8-puzzle: easy starts give +9-11pp gap, hard starts give +17 to +33pp. The NS approach replans each step; wrong actions recover automatically.
Q: What does NOT work? A: Cross-N generalization (0%), memory + prediction features (100%→83%), prediction validation (→7%), TTA (63%→37%), Qwen 0.5B coding agent (malformed output). Complexity must be justified empirically.
Q: How does Neurely compare to VLA models like RT-2 or OpenVLA? A: Neurely uses 10K–160K parameters trained in 77s on CPU. The structural advantage of separating perception from planning is what matters, not model scale. Aligns with findings in The Price Is Not Right (ICRA 2026).
Q: What tasks have been validated? A: Towers of Hanoi (3–7 disks, 7–127 step horizons) and the 8-puzzle (181K states). 3,500+ episodes across 18 experiments with Wilson 95% confidence intervals.
Q: How do I reproduce the results?
A: git clone, make full (complete pipeline) or make headline (5-disk 100% result). Training in minutes on CPU, no GPU required. 19 unit tests pass in ~3 seconds.
@software{neurely2026,
author = {Dan, Alexandru},
title = {Neurely: A Rapid-Iteration Neuro-Symbolic AI Harness},
year = {2026},
url = {https://github.com/danlex/neurely},
license = {MIT}
}MIT License — Copyright (c) 2026 Alexandru Dan