Skip to content

danlex/neurely

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Neurely — Neuro-symbolic AI beats end-to-end baselines on structured manipulation

License: MIT Python 3.11+ CI Tests Website

Neurely is an open-source research harness that empirically validates a simple claim: on structured manipulation tasks (Towers of Hanoi, 8-puzzle), a neuro-symbolic pipeline — CNN perception, BFS symbolic planner, replanning on error — outperforms end-to-end behavioral cloning by up to +100 percentage points, using tiny models (10K–160K parameters) trained on CPU in under 80 seconds.

We also report extensive negative results: cross-N generalization, memory systems, predictive perception validation, test-time augmentation, and LLM-guided self-improvement all failed to improve the simple pipeline. The contribution is both the validation and the falsification.

Inspired by "The Price Is Not Right" (ICRA 2026).

Full results at neurely.alexandrudan.com


TL;DR Findings

  1. Neuro-symbolic (NS) beats behavioral cloning (BC) on long-horizon structured tasks — the advantage scales with horizon length, not noise alone.
  2. BC matches NS at short horizons with matched training — 10K/90K/357K-param BC all reach 100% on 4-disk Hanoi extreme when given the same noise curriculum NS uses. Training distribution matters, not model scale.
  3. NS pulls ahead on long horizons — on 8-puzzle hard starts (25+ moves), BC plateaus at 60-79% while NS reaches 93-99% (+17 to +33pp gap). On naive-trained BC, the Hanoi gap is ~100pp.
  4. Hanoi scaling holds to 127 steps — NS 100% from 3 to 7 disks, with ~1.2× optimal step efficiency.
  5. Re-planning is the critical mechanism — removing it drops NS from 100% to 10% on Hanoi extreme.
  6. Simplicity wins — memory systems, predictive perception validation, cross-N generalization, and TTA all either failed to help or actively hurt performance.

Headline Results

All numbers reported with Wilson 95% confidence intervals.

Towers of Hanoi (extreme: noise σ=0.25, action failure=15%)

Disks Optimal NS Success Baseline Success Avg Steps (NS)
3 7 100% 100% 7.4
4 15 100% [98.1%, 100%] 0% 18.2
5 31 100% [98.1%, 100%] 0% 37.0
6 63 100% [96%, 100%] 74.9
7 127 100% [93%, 100%] 154.2

Cross-Task Validation: 8-Puzzle

Start Difficulty NS Success Baseline Success Gap
Easy (~5-7 optimal moves) 98-100% 89-92% +8 to +11pp
Hard (20+ optimal moves) 93-99% 60-79% +17 to +33pp

Gap scales with horizon — the easy 8-puzzle result was misleading because 15-random-move shuffles have short actual optimal paths. Hard starts reveal the true gap, comparable in magnitude to Hanoi.

Baselines on 4-Disk Extreme

Baseline Training 4-Disk Extreme Success
Behavioral Cloning (10K params, naive training) Clean data, light aug 0%
Behavioral Cloning (10K params, + noise curriculum) Matched NS training 100% [96%, 100%]
Behavioral Cloning (90K params, + curriculum) Matched 100% [96%, 100%]
Behavioral Cloning (357K params, + curriculum) Matched 100% [96%, 100%]
DAgger (10K params, 5 rounds) 46 min training 52%
NS (ours, 10K params) 77s 100%

Honest take: the spectacular 0% → 100% for BC with curriculum shows the naive BC failure was a training-setup artifact, not a structural limitation. On 4-disk (15-step horizon), both approaches work when trained properly. The NS structural advantage only emerges at longer horizons, as demonstrated on 8-puzzle hard starts where BC plateaus at 60-79% even with matching curriculum training.


Ablation-Proven Mechanisms

Component Removed Success Rate Impact
None (full system) 100% baseline
Re-planning 10% -90pp
Optimal planner (use greedy) 50% -50pp
Noise curriculum training 54% -46pp
Temporal consistency filter 54% -46pp

What Does NOT Work (Honest Failures)

We report failed approaches to keep claims calibrated:

Failed Approach Result Why It Fails
Cross-N generalization (train 3,4 → test 5,6) 0% Visual layout too different across N
Memory + prediction + temporal features 100% → 83% Added complexity without benefit
Prediction validation (override perception) 100% → 7% Bad graph predictions suppress correct perception
Test-time augmentation (TTA) 63% → 37% Adds noise on top of noisy input
Qwen 0.5B coding agent (local MLX) Malformed JSON Model too small for structured code gen
Ultra-tiny perception (<3K params) 88% on 4d extreme Insufficient capacity

Core Finding: Simplicity Wins

Despite building a system with adaptive-depth perception, Cross-N queries, memory, predictive planning, and LLM-guided self-improvement, the simplest pipeline performed best:

Winning Architecture (validated)
├── CNN Perception (10K-160K params, noise-curriculum trained)
├── BFS Symbolic Planner (optimal)
├── Noisy Action Primitive
└── Replan on Error

Quick Start

git clone https://github.com/danlex/neurely.git
cd neurely

# One command
make full     # setup + data + train + experiment

# Or step-by-step
make setup && make data && make train && make experiment

# Reproduce 5-disk extreme (100% success)
make headline

# Cross-task validation
python scripts/puzzle_experiment.py --n-episodes 100

# Run tests (19 tests, ~3 seconds)
pytest tests/ -v

# Build the paper
make paper

Project Structure

neurely/                     # Core library (20 modules, ~2800 LOC)
├── env.py                   # Towers of Hanoi simulation
├── puzzle.py                # 8-puzzle (cross-task validation)
├── planner.py               # BFS symbolic planner
├── perception.py            # Fixed-N CNN perception
├── baseline.py              # End-to-end BC baseline
├── pipeline.py              # Validated NS pipeline (100%)
├── novel_arch.py            # Cross-N + Adaptive Depth (partial success)
├── memory.py                # STM + LTM + State Transition Graph (hurt)
├── prediction.py            # Predictive planner (harmful)
├── full_system.py           # Everything integrated
├── coding_agent.py          # LLM self-programming (tested: Qwen 0.5B failed)
├── robust_pipeline.py       # Temporal filter + re-observation
├── failure_mining.py        # Hard example mining
├── ensemble.py, mc_dropout.py, tta.py  # Alternatives explored
├── harder_tests.py          # Adversarial, DAgger, random starts
├── metrics.py               # Wilson CI tracking
└── evolve.py                # Architecture search

scripts/                     # Runnable experiments
├── generate_data.py
├── train.py
├── run_experiment.py
├── puzzle_experiment.py     # Cross-task validation
├── autonomous_loop.py       # Continuous experimentation
├── evolve_loop.py
├── visualize.py             # Main paper figures
├── visualize_extensions.py  # Cross-task + scaling figures
└── visualize_harder.py      # Horizon-gap figure

paper/
├── main.tex                 # Publication-ready LaTeX (~1300 lines)
├── references.bib           # 20+ references
└── figures/                 # 9 PDF + PNG plots

docs/                        # GitHub Pages (neurely.alexandrudan.com)
└── index.html

tests/
└── test_pipeline.py         # 19 unit tests

results/                     # All experiment JSON
├── mindmap.md               # Exploration log
├── 4disk_comparison.json
├── 5disk_comparison.json
├── 6disk_scaling.json
├── 7disk_scaling.json
├── puzzle_experiment.json   # Easy 8-puzzle
├── puzzle_harder.json       # Hard 8-puzzle
├── ablation.json
├── planner_comparison.json
└── ... (14 total JSON files)

Experiments Conducted (3,500+ total episodes)

Experiment Episodes Key Finding
3-disk battery 250 Both 100% — task too easy
4-disk battery 300 NS 100%, BC 0-10% under noise
5-disk battery 250 NS 100% under curriculum
6-disk scaling 300 NS 100% at 63-step horizon
7-disk scaling 150 NS 100% at 127-step horizon
Ablation: no-replan 100 100% → 10-44%
Planner: BFS vs greedy 300 100% vs 50%
Architecture comparison 400 Deep (23K) optimal
DAgger baseline 100 52%, 35× slower
Adversarial noise 400 NS 2× more step-efficient
Random starts 100 NS 100%, BC 16%
Noise curriculum 200 Solves 22% → 100%
Statistical validation 400 100% [98.1%+] CI confirmed
Full system ablation 150 Memory/prediction HURT
Cross-N generalization 240 Failed (0% on unseen)
Self-improving loop 150 0% → 52% after 5 rounds
Cross-task: easy 8-puzzle 400 NS +9-11pp
Cross-task: hard 8-puzzle 450 NS +17-33pp (horizon matters)

Research Paper

Publication-ready LaTeX paper with 20+ references. Build with:

cd paper && pdflatex main && bibtex main && pdflatex main && pdflatex main

The paper includes:

  • Main comparison (NS vs BC on Hanoi)
  • Ablation study (re-planning, planner optimality)
  • Cross-task validation (8-puzzle)
  • Scaling analysis (3 → 7 disks)
  • Negative results section (Cross-N, memory features, TTA, prediction validation)
  • Honest limitations

Frequently Asked Questions

Q: Does neuro-symbolic AI beat end-to-end approaches? A: Yes, on structured manipulation tasks. Neurely shows the simple neuro-symbolic pipeline achieves 100% success while behavioral cloning collapses to 0% under observation noise on Towers of Hanoi. DAgger (a stronger baseline) reaches only 52% on 4-disk extreme conditions while taking 35× more training time.

Q: What is the key mechanism that makes neuro-symbolic AI robust? A: Re-planning on error. Ablation shows removing replanning drops success from 100% to 10%. Optimal planning (BFS) matters too — greedy planning drops to 50%.

Q: Why does the NS–BC gap scale with horizon difficulty? A: Behavioral cloning cannot recover from a single wrong move over long horizons. On the 8-puzzle: easy starts give +9-11pp gap, hard starts give +17 to +33pp. The NS approach replans each step; wrong actions recover automatically.

Q: What does NOT work? A: Cross-N generalization (0%), memory + prediction features (100%→83%), prediction validation (→7%), TTA (63%→37%), Qwen 0.5B coding agent (malformed output). Complexity must be justified empirically.

Q: How does Neurely compare to VLA models like RT-2 or OpenVLA? A: Neurely uses 10K–160K parameters trained in 77s on CPU. The structural advantage of separating perception from planning is what matters, not model scale. Aligns with findings in The Price Is Not Right (ICRA 2026).

Q: What tasks have been validated? A: Towers of Hanoi (3–7 disks, 7–127 step horizons) and the 8-puzzle (181K states). 3,500+ episodes across 18 experiments with Wilson 95% confidence intervals.

Q: How do I reproduce the results? A: git clone, make full (complete pipeline) or make headline (5-disk 100% result). Training in minutes on CPU, no GPU required. 19 unit tests pass in ~3 seconds.


Citation

@software{neurely2026,
  author = {Dan, Alexandru},
  title = {Neurely: A Rapid-Iteration Neuro-Symbolic AI Harness},
  year = {2026},
  url = {https://github.com/danlex/neurely},
  license = {MIT}
}

License

MIT License — Copyright (c) 2026 Alexandru Dan

Releases

No releases published

Packages

 
 
 

Contributors