This repository contains the official OpenClaw implementation of VideoARM accepted at CVPR 2026. VideoARM is a coarse-to-fine video reasoning paradigm over hierarchical multimodal memory (HM3) for long-form video understanding. By leveraging the proposed complementary toolsets and HM3, VideoARM can progressively localize, interpret, and abstract evidence in an adaptive observe–think–act–memorize loop. Extensive experiments on prevalent long-video understanding benchmarks demonstrate that VideoARM maintains strong performance while significantly reducing token consumption.
Main Session (clean context)
└── spawn VideoQA Controller
├── videoarm-download → download YouTube / URL
├── videoarm-info → get metadata (fps, duration, frames)
├── videoarm-audio → transcribe audio segments
├── videoarm-extract-frames → extract frame grids
├── spawn Image Analyzer → analyze frames in clean context
└── return final answer
The controller runs in an isolated sub-agent to keep the main session context clean. Frame analysis is delegated to further sub-agents — no vision API keys needed, just OpenClaw's built-in image understanding.
Prerequisites:
git clone https://github.com/qiankemeng/VideoARM-skill.git
cd VideoARM-skill
# Basic install
pip install -e .
# With audio transcription
pip install -e ".[audio]"
# With YouTube download
pip install -e ".[download]"
# Everything
pip install -e ".[all]"After installing, verify your environment:
videoarm-doctorThis checks Python version, ffmpeg, yt-dlp, Python packages, and Whisper model status.
# Install
git clone https://github.com/qiankemeng/VideoARM-skill.git
cd VideoARM-skill && pip install -e ".[all]"
# Verify installation
videoarm-doctor
# Try it out
videoarm-info /path/to/video.mp4
videoarm-audio /path/to/video.mp4 --start 0 --end 60
videoarm-extract-frames --video /path/to/video.mp4 \
--ranges '[{"start_frame":0,"end_frame":1500}]' --num-frames 20VideoARM works with any video source that yt-dlp supports, including:
- YouTube
- Bilibili (哔哩哔哩)
- Twitter/X
- TikTok / 抖音
- Local files (.mp4, .mkv, .avi, etc.)
For region-restricted content, set HTTPS_PROXY in your .env file.
Audio transcription works out of the box — faster-whisper is included as a default dependency. No API keys needed.
On first run, the Whisper base model (~150MB) is automatically downloaded.
Whisper auto-detects the language — no configuration needed. It supports 99+ languages. For better accuracy with non-English videos, use a larger model:
# In .env
WHISPER_MODEL=small # better for non-English, accents, dialectsIf you prefer faster transcription or better accuracy, set a cloud API in .env:
cp .env.example .env| Option | Setup | Speed |
|---|---|---|
| Local (default) | Nothing to configure | ~1x realtime on CPU |
| Groq | Set WHISPER_API_KEY |
Very fast, free tier |
| OpenAI | Set WHISPER_API_KEY |
Fast, ~$0.006/min |
Set WHISPER_MODEL in .env to choose model size:
| Model | Size | RAM | Accuracy |
|---|---|---|---|
tiny |
39MB | ~1GB | Basic |
base |
74MB | ~1GB | Good (default) |
small |
244MB | ~2GB | Better |
medium |
769MB | ~5GB | Great |
large-v3 |
1.5GB | ~10GB | Best |
Place this directory in your OpenClaw workspace. When you ask a video question, the agent reads SKILL.md and spawns a controller sub-agent automatically.
User: Analyze this video and tell me who opened the watermelon
https://www.youtube.com/watch?v=...
Agent: [reads SKILL.md → spawns controller → returns answer]
Each tool works independently from the command line:
# Get video metadata
videoarm-info video.mp4
# Extract 30 frames from a time range (returns grid image path)
videoarm-extract-frames --video video.mp4 \
--ranges '[{"start_frame":0,"end_frame":1500}]' \
--num-frames 30
# Transcribe audio (start/end in seconds)
videoarm-audio video.mp4 --start 0 --end 300
# Download from YouTube
videoarm-download "https://www.youtube.com/watch?v=..."| Command | Description | Output |
|---|---|---|
videoarm-info <video> |
Video metadata | JSON: fps, total_frames, duration, has_audio |
videoarm-extract-frames |
Extract frame grid image | JSON: image_path, frame_ranges |
videoarm-audio <video> |
Transcribe audio segment | JSON: transcript, segments[] |
videoarm-download <url> |
Download video | JSON: path, cached |
videoarm-doctor |
Check all dependencies | Human-readable or --json |
videoarm-clean |
Clean temporary files | Human-readable or --json; supports --dry-run, --downloads |
VideoARM-skill/
├── SKILL.md # OpenClaw skill instructions (the brain)
├── videoarm_cli/ # CLI tools
│ ├── videoarm_info.py
│ ├── videoarm_extract_frames.py
│ ├── videoarm_audio.py
│ ├── videoarm_download.py
│ ├── videoarm_doctor.py # Dependency checker
│ └── videoarm_clean.py # Cache cleaner
├── videoarm_lib/ # Shared library
│ ├── config.py # Paths and database
│ ├── frames.py # Frame extraction logic
│ ├── resolve.py # Video path resolution
│ ├── video_meta.py # Metadata extraction
│ └── logger.py # Tool tracer
├── videoarm_local_whisper/ # Local Whisper server
│ ├── server.py
│ └── setup.py
├── examples/ # Usage examples
├── .env.example # Configuration template
├── pyproject.toml # Package config
└── LICENSE # MIT
- Download & Inspect — Get video metadata (duration, fps, audio availability)
- Strategy — Choose audio-first (dialogue questions) or frames-first (visual questions)
- Extract & Analyze — Use tools iteratively, writing findings to memory
- Cross-verify — Confirm with a second modality if confidence is low
- Answer — Return answer with evidence chain and confidence score
All cached data stored under ~/.videoarm/:
video_database/temp/downloads/— Downloaded videosvideo_database/temp/processing/— Temporary processing files
- VideoARM — The research paper this skill is based on
- OpenClaw — The agent platform this skill runs on
| Problem | Solution |
|---|---|
yt-dlp not found |
pip install yt-dlp |
ffmpeg not found |
brew install ffmpeg (macOS) or apt install ffmpeg (Linux) |
opencv-python not found |
pip install opencv-python |
| Download timeout | Set HTTPS_PROXY=http://... in .env |
| Poor transcription accuracy | Use larger model: WHISPER_MODEL=small in .env |
videoarm-doctor shows issues |
Follow the suggested fix for each item |
Run videoarm-doctor to diagnose most issues automatically.
MIT — see LICENSE