Cholec80 Phase Recognition — Run Guide

Two-stage pipeline. Stage 1 (CNN) + feature extraction run once and feed both temporal models. MS-TCN vs TeCNO differ only by the --causal flag.

All commands assume the dedicated venv. Either source .venv/bin/activate first, or prefix every command with .venv/bin/python.

cd /home/KHUser/cholec80_phase
PY=.venv/bin/python

0. Getting the dataset

The Cholec80 dataset is not in this repo (it is ~70 GB and license-restricted — request access from the CAMMA group). We keep our copy as cholec80.zip in a private Cloudflare R2 bucket.

Download from R2 (with rclone)

# 1. install rclone
curl https://rclone.org/install.sh | sudo bash

# 2. configure an R2 remote (fill in YOUR own credentials — keep them secret)
rclone config create r2 s3 \
  provider=Cloudflare \
  access_key_id=<YOUR_R2_ACCESS_KEY_ID> \
  secret_access_key=<YOUR_R2_SECRET_ACCESS_KEY> \
  endpoint=https://<YOUR_ACCOUNT_ID>.r2.cloudflarestorage.com \
  region=auto no_check_bucket=true

# 3. download (~70 GB)
rclone copy r2:<YOUR_BUCKET>/cholec80.zip ./data/ --progress

Upload notes (lessons learned)

Use an R2 API token with Object Read & Write; it cannot ListBuckets or CreateBucket, so pass --s3-no-check-bucket on upload.
For a 70 GB file add --s3-disable-checksum — otherwise rclone hashes the whole file first and appears to hang at 0 B/s (it is busy in s3.prepareUpload).
Tune large uploads with --s3-chunk-size 64M --s3-upload-concurrency 4.
Keep the bucket private (disable the public r2.dev URL) — Cholec80's license does not permit public redistribution.

Then unzip into data/:

cd data && python -c "import zipfile; zipfile.ZipFile('cholec80.zip').extractall('.')"

Data layout (after unzip)

data/videos/             video01.mp4 ... video80.mp4  (+ videoXX-timestamp.txt)
data/phase_annotations/  video01-phase.txt ...
data/tool_annotations/   video01-tool.txt ...         (used by train_cnn_mtl.py)

1. Extract frames at 1 fps (Stage 0)

$PY extract_frames.py --videos data/videos --out data/frames
# resumable; ~minutes per video. Produces data/frames/videoXX/00000000.jpg ...

2. Train ResNet50 (Stage 1)

$PY train_cnn.py --frames data/frames --anno data/phase_annotations \
   --epochs 5 --bs 64 --out checkpoints/resnet50.pt
# T4: ~10-20 min/epoch. Saves best-val-acc checkpoint.

3. Extract 2048-d features (Stage 1.5)

$PY extract_features.py --frames data/frames --anno data/phase_annotations \
   --ckpt checkpoints/resnet50.pt --out features
# Writes features/videoXX.pt = {feats:(T,2048), labels:(T,)} for all 80 videos.

4a. Train MS-TCN (non-causal)

$PY train_tcn.py --features features --out checkpoints/mstcn.pt

4b. Train TeCNO (causal)

$PY train_tcn.py --features features --out checkpoints/tecno.pt --causal

Stage 2 is fast (features are tiny): tens of seconds per epoch on the T4.

5. Evaluate & compare

$PY evaluate.py --features features --ckpt checkpoints/mstcn.pt
$PY evaluate.py --features features --ckpt checkpoints/tecno.pt

Reports frame accuracy, video-averaged accuracy, and per-phase precision/recall/jaccard.

Smoke test first (recommended)

Run the whole pipeline on a few videos (e.g. ids 1,2,33,41) before the full run to confirm everything is wired correctly, using --only on extract_frames and small --epochs.

Results (test = videos 41-80, 32/8/40 split)

Temporal model	Frame acc	Mean Jaccard
MS-TCN (offline)	90.81%	76.3
TeCNO (online)	88.95%	71.4
LoViT-style (offline)	83.22%	61.8
LoViT-style (online)	86.19%	65.0

Stage-1 ResNet50 per-frame val accuracy: 81.75%; the temporal model lifts this by up to ~9%. TeCNO's 88.95% reproduces the original paper (~88.6%).

Best model: an ensemble over diverse features (ResNet50 + fine-tuned EndoViT, with fusion) reaches video-averaged accuracy 91.53% / Jaccard 77.5, beating the single-feature baseline (90.80 / 76.3) on the standard per-video metric. The gain came from better/complementary features + ensembling, not a fancier temporal head — our from-scratch LoViT-style Transformer overfit the 32-video set and lost to the TCNs. Full analysis, ablation, and the "features > architecture on small data" lesson in RESULTS.md.

Transformer temporal head (LoViT-style)

lovit.py is a long-short causal Transformer (attention + conv, multi-stage), a drop-in replacement for the TCN head:

./run_lovit.sh   # trains LoViT causal+offline on the same features, compares all four

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
bootstrap		bootstrap
docs		docs
.gitignore		.gitignore
README.md		README.md
RESEARCH_PLAN.md		RESEARCH_PLAN.md
asformer.py		asformer.py
calibrate.py		calibrate.py
cataract_dataset.py		cataract_dataset.py
cataract_extract_frames.py		cataract_extract_frames.py
cataract_features.py		cataract_features.py
cataract_frontier_ci.py		cataract_frontier_ci.py
cataract_qcd.py		cataract_qcd.py
compare_gate.py		compare_gate.py
compare_smoothers.py		compare_smoothers.py
dataset.py		dataset.py
ensemble_eval.py		ensemble_eval.py
evaluate.py		evaluate.py
extract_features.py		extract_features.py
extract_features_endovit.py		extract_features_endovit.py
extract_features_mae.py		extract_features_mae.py
extract_frames.py		extract_frames.py
frontier.py		frontier.py
frontier_ci.py		frontier_ci.py
fuse_features.py		fuse_features.py
learned_detector.py		learned_detector.py
lovit.py		lovit.py
mae_pretrain.py		mae_pretrain.py
make_figures.py		make_figures.py
metrics.py		metrics.py
mstcn.py		mstcn.py
phases.py		phases.py
qcd.py		qcd.py
qcd_p2.py		qcd_p2.py
run.sh		run.sh
run_baselines.sh		run_baselines.sh
run_endovit.sh		run_endovit.sh
run_endovit_ft.sh		run_endovit_ft.sh
run_expansion.sh		run_expansion.sh
run_full.sh		run_full.sh
run_lovit.sh		run_lovit.sh
run_p3.py		run_p3.py
run_phase2.sh		run_phase2.sh
run_qcd.py		run_qcd.py
run_reliability.sh		run_reliability.sh
run_rerun.sh		run_rerun.sh
run_rerun_std.sh		run_rerun_std.sh
run_sota.sh		run_sota.sh
run_ssl.sh		run_ssl.sh
significance.py		significance.py
smooth.py		smooth.py
splits.py		splits.py
standardize_features.py		standardize_features.py
stats_corrected.py		stats_corrected.py
status.sh		status.sh
surgformer.py		surgformer.py
sweep_smoothing.py		sweep_smoothing.py
test_causality.py		test_causality.py
train_cnn.py		train_cnn.py
train_cnn_endovit.py		train_cnn_endovit.py
train_cnn_mtl.py		train_cnn_mtl.py
train_e2e.py		train_e2e.py
train_tcn.py		train_tcn.py
transsvnet.py		transsvnet.py
verify_baselines.py		verify_baselines.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cholec80 Phase Recognition — Run Guide

0. Getting the dataset

Download from R2 (with rclone)

Upload notes (lessons learned)

Data layout (after unzip)

1. Extract frames at 1 fps (Stage 0)

2. Train ResNet50 (Stage 1)

3. Extract 2048-d features (Stage 1.5)

4a. Train MS-TCN (non-causal)

4b. Train TeCNO (causal)

5. Evaluate & compare

Smoke test first (recommended)

Results (test = videos 41-80, 32/8/40 split)

Transformer temporal head (LoViT-style)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cholec80 Phase Recognition — Run Guide

0. Getting the dataset

Download from R2 (with rclone)

Upload notes (lessons learned)

Data layout (after unzip)

1. Extract frames at 1 fps (Stage 0)

2. Train ResNet50 (Stage 1)

3. Extract 2048-d features (Stage 1.5)

4a. Train MS-TCN (non-causal)

4b. Train TeCNO (causal)

5. Evaluate & compare

Smoke test first (recommended)

Results (test = videos 41-80, 32/8/40 split)

Transformer temporal head (LoViT-style)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages