Two-stage pipeline. Stage 1 (CNN) + feature extraction run once and feed
both temporal models. MS-TCN vs TeCNO differ only by the --causal flag.
All commands assume the dedicated venv. Either source .venv/bin/activate first,
or prefix every command with .venv/bin/python.
cd /home/KHUser/cholec80_phase
PY=.venv/bin/pythonThe Cholec80 dataset is not in this repo (it is ~70 GB and license-restricted —
request access from the CAMMA group). We keep
our copy as cholec80.zip in a private Cloudflare R2 bucket.
# 1. install rclone
curl https://rclone.org/install.sh | sudo bash
# 2. configure an R2 remote (fill in YOUR own credentials — keep them secret)
rclone config create r2 s3 \
provider=Cloudflare \
access_key_id=<YOUR_R2_ACCESS_KEY_ID> \
secret_access_key=<YOUR_R2_SECRET_ACCESS_KEY> \
endpoint=https://<YOUR_ACCOUNT_ID>.r2.cloudflarestorage.com \
region=auto no_check_bucket=true
# 3. download (~70 GB)
rclone copy r2:<YOUR_BUCKET>/cholec80.zip ./data/ --progress- Use an R2 API token with Object Read & Write; it cannot
ListBucketsorCreateBucket, so pass--s3-no-check-bucketon upload. - For a 70 GB file add
--s3-disable-checksum— otherwise rclone hashes the whole file first and appears to hang at0 B/s(it is busy ins3.prepareUpload). - Tune large uploads with
--s3-chunk-size 64M --s3-upload-concurrency 4. - Keep the bucket private (disable the public
r2.devURL) — Cholec80's license does not permit public redistribution.
Then unzip into data/:
cd data && python -c "import zipfile; zipfile.ZipFile('cholec80.zip').extractall('.')"data/videos/ video01.mp4 ... video80.mp4 (+ videoXX-timestamp.txt)
data/phase_annotations/ video01-phase.txt ...
data/tool_annotations/ video01-tool.txt ... (used by train_cnn_mtl.py)
$PY extract_frames.py --videos data/videos --out data/frames
# resumable; ~minutes per video. Produces data/frames/videoXX/00000000.jpg ...$PY train_cnn.py --frames data/frames --anno data/phase_annotations \
--epochs 5 --bs 64 --out checkpoints/resnet50.pt
# T4: ~10-20 min/epoch. Saves best-val-acc checkpoint.$PY extract_features.py --frames data/frames --anno data/phase_annotations \
--ckpt checkpoints/resnet50.pt --out features
# Writes features/videoXX.pt = {feats:(T,2048), labels:(T,)} for all 80 videos.$PY train_tcn.py --features features --out checkpoints/mstcn.pt$PY train_tcn.py --features features --out checkpoints/tecno.pt --causalStage 2 is fast (features are tiny): tens of seconds per epoch on the T4.
$PY evaluate.py --features features --ckpt checkpoints/mstcn.pt
$PY evaluate.py --features features --ckpt checkpoints/tecno.ptReports frame accuracy, video-averaged accuracy, and per-phase precision/recall/jaccard.
Run the whole pipeline on a few videos (e.g. ids 1,2,33,41) before the full run
to confirm everything is wired correctly, using --only on extract_frames and
small --epochs.
| Temporal model | Frame acc | Mean Jaccard |
|---|---|---|
| MS-TCN (offline) | 90.81% | 76.3 |
| TeCNO (online) | 88.95% | 71.4 |
| LoViT-style (offline) | 83.22% | 61.8 |
| LoViT-style (online) | 86.19% | 65.0 |
Stage-1 ResNet50 per-frame val accuracy: 81.75%; the temporal model lifts this by up to ~9%. TeCNO's 88.95% reproduces the original paper (~88.6%).
Best model: an ensemble over diverse features (ResNet50 + fine-tuned EndoViT, with fusion) reaches video-averaged accuracy 91.53% / Jaccard 77.5, beating the single-feature baseline (90.80 / 76.3) on the standard per-video metric. The gain came from better/complementary features + ensembling, not a fancier temporal head — our from-scratch LoViT-style Transformer overfit the 32-video set and lost to the TCNs. Full analysis, ablation, and the "features > architecture on small data" lesson in RESULTS.md.
lovit.py is a long-short causal Transformer (attention + conv, multi-stage),
a drop-in replacement for the TCN head:
./run_lovit.sh # trains LoViT causal+offline on the same features, compares all four