GitHub - RonaldSit/robovet: Vet your robot datasets — 37 checks, repair & scoring for LeRobot data. Know it's UNSAFE TO TRAIN before wasting the run. Vets hf:// repos from 82 KB.

 _____   ____  ____   ______      ________ _______
|  __ \ / __ \|  _ \ / __ \ \    / /  ____|__   __|
| |__) | |  | | |_) | |  | \ \  / /| |__     | |
|  _  /| |  | |  _ <| |  | |\ \/ / |  __|    | |
| | \ \| |__| | |_) | |__| | \  /  | |____   | |
|_|  \_\\____/|____/ \____/   \/   |______|  |_|

vet your robot datasets — before you waste the training run

Install · Quick start · How to use · 37 checks

You spent an evening teleoperating a robot. Before you spend a GPU-day training on those episodes, spend 30 seconds making sure they aren't lying to you. This is what a lying dataset looks like:

$ robovet doctor ./my_dataset

  FAIL DATA-104   1 episode where metadata 'length' disagrees with the parquet
                  row count — the classic signature of a corrupted episode map.
  FAIL STATS-302  1 stat block disagrees with the actual data — every training
                  run normalizes with these numbers.
  WARN TIME-202   Loading this dataset requires tolerance_s ≥ 7.7e-03
                  (77× the default). Worst: episode 2, 7.29 ms off the grid.
  FAIL META-502   Σ episode lengths = 1086 but info.json total_frames = 1037 —
                  the metadata contradicts itself before a single file is read.

  5 fail · 4 warn · 23 pass
  UNSAFE TO TRAIN — fix the FAILs first.        (exit code 1 — CI-gate it)

Quick start (60 seconds, no robot needed)

pip install "robovet[video]"

robovet demo ./demo      # builds a fake dataset with 10 real-world defects
robovet doctor ./demo    # catches all of them, tells you which episode, exits 1
robovet fix ./demo --apply   # repairs the metadata problems (.bak backups)
robovet doctor ./demo    # the metadata FAILs are gone

Want to see what healthy looks like? robovet demo ./d --clean builds the same dataset with zero defects. There's a v3 flavor too: robovet demo ./d3 --v3.

How you'll actually use it

① You just finished recording. Run robovet doctor ./my_task. Green means train. Red means it tells you exactly which episodes are broken and why — in plain English, with the issue number it reproduces. Most metadata problems are one robovet fix ./my_task --apply away (it backs everything up as .bak first).

② You found a dataset on the Hub and don't want to download 4 GB to find out it's broken.

pip install "robovet[hub]"
robovet doctor hf://lerobot/svla_so100_pickplace

This pulls only the meta/ folder (usually under 1 MB) and cross-checks the dataset's own ledger: does the episode↔frame index math add up, do the counters match, are the per-episode stats stale, do the video time windows fit. The nastiest corruption class (lerobot#2401) is visible from metadata alone. To be clear about what this can't see: values, timestamps and video decoding still need the files, so a remote pass says META CLEAN, never CLEAN. (--meta-only also works on local paths when you want a one-second pre-check.)

③ You want bad data blocked before it reaches your team's training runs. robovet doctor exits 1 on any FAIL, so CI can gate dataset merges the same way Codecov gates coverage:

name: robovet
on: [push, pull_request]
jobs:
  vet:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.12" }
      - run: pip install "robovet[video]"
      - run: robovet doctor ./datasets/my_task   # FAIL blocks the merge

④ You want to drop your worst episodes before training.

robovet score ./my_task --worst 10     # the 10 episodes to look at first
robovet score ./my_task --csv scores.csv

Every episode gets a 0–100 score from cheap, fast signals computed in one pass: jerky motion, long idle stretches, gripper chatter, weird durations, saturated actions, exact duplicates. It's a triage list, not a judge — look at the flagged episodes yourself before deleting anything. (The 2026 curation papers — rinse, Demo-SCORE, QoQ — all argue for exactly this kind of cheap smoothness-first pass before any expensive policy-based filtering.)

And when something goes wrong mid-training, start here:

Saw this error? Run this

You hit	Look at	You get
`ValueError: timestamps … tolerance_s` on load	TIME-202	the exact minimal `tolerance_s`, and which episode is worst
wrong frames / IndexError after a v2→v3 conversion	DATA-104/105 + META-501	which episodes' ledgers lie, cross-checked three ways
TorchCodec/AV1 decode errors	VIDEO-403	per-camera codec tiers and what to re-encode
`loss=NaN` out of nowhere	DATA-107 + STATS-302	NaN/Inf locations and stale normalization stats

Why this exists

Robot learning's bottleneck moved from models to data, and the data is quietly broken. An April 2026 audit of 10 popular open robot datasets found floating-point drift that breaks video decoding after ~45 episodes, a v2.1→v3.0 conversion bug that silently scrambles which frames belong to which episode (training "works" — on jumbled sequences), and datasets that only load with tolerance_s cranked to 100× the default. Hugging Face's own cleanup of community datasets found 111 of 240 failed validation — and that pipeline is internal; you can't run it on yours. Meanwhile everyone agrees a well-curated 500-demo fine-tune beats a sloppy one 10× the size. The missing piece is tooling, and that's what this is.

Every check maps to a documented, real-world failure — the lerobot issue numbers are right there in the table below.

What it checks

Group	Catches	Maps to
`STRUCT-0xx`	missing/invalid metadata, dangling episodes, orphan files	lerobot#761 (no validator for hand-rolled conversions)
`DATA-1xx`	episode↔frame mapping corruption, schema drift, NaN/Inf, dead dims	lerobot#2401 (silent v2.1→v3.0 corruption)
`TIME-2xx`	off-grid timestamps with the exact `tolerance_s` you'd need, non-monotonic time, cumulative FP drift	lerobot#933, lerobot#3177
`STATS-3xx`	stored normalization stats that disagree with the data, broken quantile stats (q01/q99)	HF docs warning; phospho repair post; lerobot#2189
`META-5xx`	the dataset's ledger contradicting itself — works without downloading the data	lerobot#2401 class, caught from metadata alone
`VIDEO-4xx`	video/parquet frame-count desync — including per-episode windows inside shared v3 files, codec tiers (h264 ✓ / AV1 info — it's lerobot's own default / mpeg4-hevc warn), fps mismatch	Correll-lab postmortem; phospho notes

What `fix` will never do to your data

robovet fix is dry-run by default. With --apply it only rewrites metadata — episode lengths, normalization stats, info.json counters. It backs up every file it touches as .bak, it never modifies parquet or video payloads, and it preserves everything it doesn't understand: your quantile keys, image-stat blocks, episode tags. A repair tool must never be the thing that deletes your data, and the test suite enforces every one of these promises. Frame surgery (trimming desynced tails, re-gridding timestamps) is planned under the same rules.

Honest limits

v2.0/v2.1 and v3.x are both fully supported for diagnosis (each has its own fixture and tests; v3 gets per-episode video alignment inside shared files plus per-episode stats checks). fix currently repairs v2.x metadata; v3 stats regeneration is planned.
robovet doesn't merge, split or delete episodes — lerobot does that natively now. This tool does what the official stack doesn't: deep validation, metadata repair, and quality triage.
Local-first. Your data never leaves your disk.

Use it from Python

from robovet import load_dataset, run_doctor, score_dataset

ds  = load_dataset("./my_dataset")
rep = run_doctor(ds)                     # rep.exit_code, rep.results, rep.counts
sc  = score_dataset(ds, scan=rep.scan)   # reuses the same single IO pass

Apache-2.0. Issues and broken-dataset war stories are very welcome — if your dataset breaks in a way robovet doesn't catch, that's exactly the bug report we want.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
src/robovet		src/robovet
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick start (60 seconds, no robot needed)

How you'll actually use it

Saw this error? Run this

Why this exists

What it checks

What `fix` will never do to your data

Honest limits

Use it from Python

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Quick start (60 seconds, no robot needed)

How you'll actually use it

Saw this error? Run this

Why this exists

What it checks

What fix will never do to your data

Honest limits

Use it from Python

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

What `fix` will never do to your data

Packages