Neurely — Neuro-symbolic AI beats end-to-end baselines on structured manipulation

Neurely is an open-source research harness that empirically validates a simple claim: on structured manipulation tasks (Towers of Hanoi, 8-puzzle), a neuro-symbolic pipeline — CNN perception, BFS symbolic planner, replanning on error — outperforms end-to-end behavioral cloning by up to +100 percentage points, using tiny models (10K–160K parameters) trained on CPU in under 80 seconds.

We also report extensive negative results: cross-N generalization, memory systems, predictive perception validation, test-time augmentation, and LLM-guided self-improvement all failed to improve the simple pipeline. The contribution is both the validation and the falsification.

Inspired by "The Price Is Not Right" (ICRA 2026).

→ Full results at neurely.alexandrudan.com

TL;DR Findings

Neuro-symbolic (NS) beats behavioral cloning (BC) on long-horizon structured tasks — the advantage scales with horizon length, not noise alone.
BC matches NS at short horizons with matched training — 10K/90K/357K-param BC all reach 100% on 4-disk Hanoi extreme when given the same noise curriculum NS uses. Training distribution matters, not model scale.
NS pulls ahead on long horizons — on 8-puzzle hard starts (25+ moves), BC plateaus at 60-79% while NS reaches 93-99% (+17 to +33pp gap). On naive-trained BC, the Hanoi gap is ~100pp.
Hanoi scaling holds to 127 steps — NS 100% from 3 to 7 disks, with ~1.2× optimal step efficiency.
Re-planning is the critical mechanism — removing it drops NS from 100% to 10% on Hanoi extreme.
Simplicity wins — memory systems, predictive perception validation, cross-N generalization, and TTA all either failed to help or actively hurt performance.

Headline Results

All numbers reported with Wilson 95% confidence intervals.

Towers of Hanoi (extreme: noise σ=0.25, action failure=15%)

Disks	Optimal	NS Success	Baseline Success	Avg Steps (NS)
3	7	100%	100%	7.4
4	15	100% [98.1%, 100%]	0%	18.2
5	31	100% [98.1%, 100%]	0%	37.0
6	63	100% [96%, 100%]	—	74.9
7	127	100% [93%, 100%]	—	154.2

Cross-Task Validation: 8-Puzzle

Start Difficulty	NS Success	Baseline Success	Gap
Easy (~5-7 optimal moves)	98-100%	89-92%	+8 to +11pp
Hard (20+ optimal moves)	93-99%	60-79%	+17 to +33pp

Gap scales with horizon — the easy 8-puzzle result was misleading because 15-random-move shuffles have short actual optimal paths. Hard starts reveal the true gap, comparable in magnitude to Hanoi.

Baselines on 4-Disk Extreme

Baseline	Training	4-Disk Extreme Success
Behavioral Cloning (10K params, naive training)	Clean data, light aug	0%
Behavioral Cloning (10K params, + noise curriculum)	Matched NS training	100% [96%, 100%]
Behavioral Cloning (90K params, + curriculum)	Matched	100% [96%, 100%]
Behavioral Cloning (357K params, + curriculum)	Matched	100% [96%, 100%]
DAgger (10K params, 5 rounds)	46 min training	52%
NS (ours, 10K params)	77s	100%

Honest take: the spectacular 0% → 100% for BC with curriculum shows the naive BC failure was a training-setup artifact, not a structural limitation. On 4-disk (15-step horizon), both approaches work when trained properly. The NS structural advantage only emerges at longer horizons, as demonstrated on 8-puzzle hard starts where BC plateaus at 60-79% even with matching curriculum training.

Ablation-Proven Mechanisms

Component Removed	Success Rate	Impact
None (full system)	100%	baseline
Re-planning	10%	-90pp
Optimal planner (use greedy)	50%	-50pp
Noise curriculum training	54%	-46pp
Temporal consistency filter	54%	-46pp

What Does NOT Work (Honest Failures)

We report failed approaches to keep claims calibrated:

Failed Approach	Result	Why It Fails
Cross-N generalization (train 3,4 → test 5,6)	0%	Visual layout too different across N
Memory + prediction + temporal features	100% → 83%	Added complexity without benefit
Prediction validation (override perception)	100% → 7%	Bad graph predictions suppress correct perception
Test-time augmentation (TTA)	63% → 37%	Adds noise on top of noisy input
Qwen 0.5B coding agent (local MLX)	Malformed JSON	Model too small for structured code gen
Ultra-tiny perception (<3K params)	88% on 4d extreme	Insufficient capacity

Core Finding: Simplicity Wins

Despite building a system with adaptive-depth perception, Cross-N queries, memory, predictive planning, and LLM-guided self-improvement, the simplest pipeline performed best:

Winning Architecture (validated)
├── CNN Perception (10K-160K params, noise-curriculum trained)
├── BFS Symbolic Planner (optimal)
├── Noisy Action Primitive
└── Replan on Error

Quick Start

git clone https://github.com/danlex/neurely.git
cd neurely

# One command
make full     # setup + data + train + experiment

# Or step-by-step
make setup && make data && make train && make experiment

# Reproduce 5-disk extreme (100% success)
make headline

# Cross-task validation
python scripts/puzzle_experiment.py --n-episodes 100

# Run tests (19 tests, ~3 seconds)
pytest tests/ -v

# Build the paper
make paper

Project Structure

neurely/                     # Core library (20 modules, ~2800 LOC)
├── env.py                   # Towers of Hanoi simulation
├── puzzle.py                # 8-puzzle (cross-task validation)
├── planner.py               # BFS symbolic planner
├── perception.py            # Fixed-N CNN perception
├── baseline.py              # End-to-end BC baseline
├── pipeline.py              # Validated NS pipeline (100%)
├── novel_arch.py            # Cross-N + Adaptive Depth (partial success)
├── memory.py                # STM + LTM + State Transition Graph (hurt)
├── prediction.py            # Predictive planner (harmful)
├── full_system.py           # Everything integrated
├── coding_agent.py          # LLM self-programming (tested: Qwen 0.5B failed)
├── robust_pipeline.py       # Temporal filter + re-observation
├── failure_mining.py        # Hard example mining
├── ensemble.py, mc_dropout.py, tta.py  # Alternatives explored
├── harder_tests.py          # Adversarial, DAgger, random starts
├── metrics.py               # Wilson CI tracking
└── evolve.py                # Architecture search

scripts/                     # Runnable experiments
├── generate_data.py
├── train.py
├── run_experiment.py
├── puzzle_experiment.py     # Cross-task validation
├── autonomous_loop.py       # Continuous experimentation
├── evolve_loop.py
├── visualize.py             # Main paper figures
├── visualize_extensions.py  # Cross-task + scaling figures
└── visualize_harder.py      # Horizon-gap figure

paper/
├── main.tex                 # Publication-ready LaTeX (~1300 lines)
├── references.bib           # 20+ references
└── figures/                 # 9 PDF + PNG plots

docs/                        # GitHub Pages (neurely.alexandrudan.com)
└── index.html

tests/
└── test_pipeline.py         # 19 unit tests

results/                     # All experiment JSON
├── mindmap.md               # Exploration log
├── 4disk_comparison.json
├── 5disk_comparison.json
├── 6disk_scaling.json
├── 7disk_scaling.json
├── puzzle_experiment.json   # Easy 8-puzzle
├── puzzle_harder.json       # Hard 8-puzzle
├── ablation.json
├── planner_comparison.json
└── ... (14 total JSON files)

Experiments Conducted (3,500+ total episodes)

Experiment	Episodes	Key Finding
3-disk battery	250	Both 100% — task too easy
4-disk battery	300	NS 100%, BC 0-10% under noise
5-disk battery	250	NS 100% under curriculum
6-disk scaling	300	NS 100% at 63-step horizon
7-disk scaling	150	NS 100% at 127-step horizon
Ablation: no-replan	100	100% → 10-44%
Planner: BFS vs greedy	300	100% vs 50%
Architecture comparison	400	Deep (23K) optimal
DAgger baseline	100	52%, 35× slower
Adversarial noise	400	NS 2× more step-efficient
Random starts	100	NS 100%, BC 16%
Noise curriculum	200	Solves 22% → 100%
Statistical validation	400	100% [98.1%+] CI confirmed
Full system ablation	150	Memory/prediction HURT
Cross-N generalization	240	Failed (0% on unseen)
Self-improving loop	150	0% → 52% after 5 rounds
Cross-task: easy 8-puzzle	400	NS +9-11pp
Cross-task: hard 8-puzzle	450	NS +17-33pp (horizon matters)

Research Paper

Publication-ready LaTeX paper with 20+ references. Build with:

cd paper && pdflatex main && bibtex main && pdflatex main && pdflatex main

The paper includes:

Main comparison (NS vs BC on Hanoi)
Ablation study (re-planning, planner optimality)
Cross-task validation (8-puzzle)
Scaling analysis (3 → 7 disks)
Negative results section (Cross-N, memory features, TTA, prediction validation)
Honest limitations

Frequently Asked Questions

Q: Does neuro-symbolic AI beat end-to-end approaches? A: Yes, on structured manipulation tasks. Neurely shows the simple neuro-symbolic pipeline achieves 100% success while behavioral cloning collapses to 0% under observation noise on Towers of Hanoi. DAgger (a stronger baseline) reaches only 52% on 4-disk extreme conditions while taking 35× more training time.

Q: What is the key mechanism that makes neuro-symbolic AI robust? A: Re-planning on error. Ablation shows removing replanning drops success from 100% to 10%. Optimal planning (BFS) matters too — greedy planning drops to 50%.

Q: Why does the NS–BC gap scale with horizon difficulty? A: Behavioral cloning cannot recover from a single wrong move over long horizons. On the 8-puzzle: easy starts give +9-11pp gap, hard starts give +17 to +33pp. The NS approach replans each step; wrong actions recover automatically.

Q: What does NOT work? A: Cross-N generalization (0%), memory + prediction features (100%→83%), prediction validation (→7%), TTA (63%→37%), Qwen 0.5B coding agent (malformed output). Complexity must be justified empirically.

Q: How does Neurely compare to VLA models like RT-2 or OpenVLA? A: Neurely uses 10K–160K parameters trained in 77s on CPU. The structural advantage of separating perception from planning is what matters, not model scale. Aligns with findings in The Price Is Not Right (ICRA 2026).

Q: What tasks have been validated? A: Towers of Hanoi (3–7 disks, 7–127 step horizons) and the 8-puzzle (181K states). 3,500+ episodes across 18 experiments with Wilson 95% confidence intervals.

Q: How do I reproduce the results? A: git clone, make full (complete pipeline) or make headline (5-disk 100% result). Training in minutes on CPU, no GPU required. 19 unit tests pass in ~3 seconds.

Citation

@software{neurely2026,
  author = {Dan, Alexandru},
  title = {Neurely: A Rapid-Iteration Neuro-Symbolic AI Harness},
  year = {2026},
  url = {https://github.com/danlex/neurely},
  license = {MIT}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
docs		docs
neurely		neurely
paper		paper
results		results
scripts		scripts
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neurely — Neuro-symbolic AI beats end-to-end baselines on structured manipulation

TL;DR Findings

Headline Results

Towers of Hanoi (extreme: noise σ=0.25, action failure=15%)

Cross-Task Validation: 8-Puzzle

Baselines on 4-Disk Extreme

Ablation-Proven Mechanisms

What Does NOT Work (Honest Failures)

Core Finding: Simplicity Wins

Quick Start

Project Structure

Experiments Conducted (3,500+ total episodes)

Research Paper

Frequently Asked Questions

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Neurely — Neuro-symbolic AI beats end-to-end baselines on structured manipulation

TL;DR Findings

Headline Results

Towers of Hanoi (extreme: noise σ=0.25, action failure=15%)

Cross-Task Validation: 8-Puzzle

Baselines on 4-Disk Extreme

Ablation-Proven Mechanisms

What Does NOT Work (Honest Failures)

Core Finding: Simplicity Wins

Quick Start

Project Structure

Experiments Conducted (3,500+ total episodes)

Research Paper

Frequently Asked Questions

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages