A framework for visual intelligence that operates below language. Videos are mapped to numerical valence vectors on domain-specific outcome axes. Zero is homeostasis. Everything above zero is cost.
No words anywhere in the pipeline.
Current visual AI maps pixels to words. Biology does the opposite: a gazelle flees before it knows the word "lion." A firefighter reads a blaze and acts before articulating why. The evaluation is pre-linguistic, operating on outcome-relevant axes (threat, speed, containment, momentum) through threshold dynamics, not categorization.
vScore formalizes this as:
pixels → encoder → valence scores → trajectory projection → threshold trigger → [language, optionally]
Built on V-JEPA 2 (LeCun et al., Meta FAIR) as the frozen visual encoder. If the encoder learns to see without words, what should the evaluation layer look like? Our answer: biological valence scoring.
213 videos, 13 categories, 2 datasets (Kinetics-mini, HACS). Within-domain MAE: 0.79 on a 0-10 scale.
Cross-domain transfer (train on 12 categories, test on held-out 13th):
| Axis | Mean MAE | Transfers in | Status |
|---|---|---|---|
| coordination | 1.45 | 11/13 (85%) | Universal |
| impact | 1.47 | 9/13 (69%) | Universal |
| speed | 1.74 | 8/13 (62%) | Universal |
| verticality | 2.16 | 7/13 (54%) | Semi-universal |
| precision | 2.56 | 6/13 (46%) | Semi-universal |
| tension | 3.13 | 5/13 (38%) | Domain-specific |
Coordination, impact, and speed are universal visual primitives that transfer to unseen domains. Tension is genuinely domain-specific.
pip install torch transformers av numpy
git clone https://github.com/Tennisee-data/vScore.git
cd vScorepython -m vScore.demopython -m vScore.demo_actionsShows how four vectors with identical magnitude (~10.0) produce four different actions (flee, fight, bond, freeze) depending on direction. Magnitude is arousal. Direction is meaning.
python -m vScore.demo_multimodalSame scoring mechanism across vision and audio. A locomotive approaching: vision scores fear=4, audio scores fear=9. The word "locomotive" is never needed.
python -m vScore.demo_dual_systemvScore gates the LLM. Clear threats are handled in milliseconds without words. Ambiguous situations escalate to linguistic reasoning. "I flinched before I knew why" = vScore fired before the LLM started.
# Extract features (one-time, cached forever)
python -m vScore.extract_batch
# Train and run cross-domain transfer
python -m vScore.train_v2vScore/
├── core/
│ ├── metaclass.py # ScoredDomain metaclass
│ ├── valence.py # ValenceVector, ValenceTrajectory
│ ├── threshold.py # Dynamic threshold triggering
│ ├── action_space.py # Geometric action inference
│ ├── memory_bayesian.py # Posterior-aware experience memory
│ ├── multimodal.py # Cross-modal fusion in valence space
│ ├── dual_system.py # vScore + LLM integration
│ └── prosody.py # Voice intonation scoring
├── domains/
│ ├── survival.py # Panksepp's 7 primal circuits
│ ├── fire.py # Firefighting
│ ├── hockey.py # Sport dynamics
│ ├── trading.py # Market dynamics
│ ├── weather.py # Atmospheric conditions
│ ├── sound.py # Auditory threat/safety
│ ├── industrial.py # Machine monitoring
│ └── music.py # Affective musical scoring
├── encoder/
│ └── bridge.py # V-JEPA 2 to valence head
├── projection/
│ └── temporal.py # Trajectory projection
└── paper/
└── vscore_paper.md # Full paper
10 lines of Python:
from vScore.core.metaclass import DomainBase
class Surgery(DomainBase):
axes = [
"bleeding", # Hemorrhage severity
"tissue_exposure", # Surgical field visibility
"instrument_proximity", # Distance to critical structures
"patient_stability", # Vital sign deviation
"time_pressure", # Urgency of completion
]The encoder, valence head, action inference, trajectory projection, threshold triggering, and Bayesian memory all work identically on this new domain without any code changes.
Zero is homeostasis. Every non-zero score is a cost. The system exists to return to zero.
Direction, not magnitude. Four vectors with the same energy produce four different actions depending on which axes are activated. A classifier says "high activation" for all four. The valence scorer says flee, fight, nurture, or freeze.
Trajectory, not snapshot. The system projects where each axis is heading and acts on the projection. A fire at intensity=5 that is accelerating triggers earlier than a fire at intensity=7 that is stable.
Bayesian memory. Experiences are stored, recalled, and selectively forgotten based on their statistical contribution to the posterior. Surprising events are kept. Routine ones are discarded. The prior sharpens over time. Replay recomputes all retention scores against the current posterior to eliminate sequential bias.
Modality-agnostic. The valence vector is the universal interface. Vision, audio, and any future modality produce vectors in the same space. Fusion happens in valence space, not feature space. Modality conflict (audio says danger, vision says safe) is itself a diagnostic signal.
Language is optional. Words enter at the last layer as a lookup table. The intelligence lives in levels 0-2. Level 3 is serialization for human consumption.
- V-JEPA 2 (Bardes, LeCun et al., Meta FAIR) for visual encoding
- Panksepp (1998) for the primal affective circuits
- LeCun (2022) for the JEPA architecture and the argument that LLMs are insufficient for world understanding
- HACS (Zhao et al., 2019) for action video data
Preliminary research. The approach shows promise but more extensive experiments are needed: larger datasets, real expert annotations, audio encoder integration, and temporal sequence testing. See the paper for full discussion.
MIT