Skip to content

TUI-NICR/DVEFormer

Repository files navigation

DVEFormer: Efficient Prediction of Dense Visual Embeddings via Distillation and RGB-D Transformers

This repository contains the code to our paper "Efficient Prediction of Dense Visual Embeddings via Distillation and RGB-D Transformers" (IROS 2025 – IEEE Xplore, arXiv).

DVEFormer builds upon EMSAFormer and EMSANet to efficiently predict dense visual embeddings for scene understanding. DVEFormer uses a Swin Transformer encoder and employs knowledge distillation from Alpha-CLIP to learn dense (pixel-wise) visual embeddings. DVEFormer's dense, text-aligned pixel embeddings supports flexible text-based querying, enable semantic segmentation, and can be integrated into existing 3D mapping pipelines such as PanopticNDT.

model architecture

The repository includes code for training, evaluating, and applying our network. We also provide code for exporting the model to the ONNX format as well as measuring the inference time with TensorRT.

License and Citations

The source code is published under Apache 2.0 license, see license file for details.

If you use the source code or the network weights, please cite the following paper (IEEE Xplore, arXiv):

Fischedick, S., Seichter, D., Stephan, B., Schmidt, R., Gross, H.-M. Efficient Prediction of Dense Visual Embeddings via Distillation and RGB-D Transformers, in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2400-2407, 2025.

BibTeX
@inproceedings{dveformer2025iros,  
  title     = {{Efficient Prediction of Dense Visual Embeddings via Distillation and RGB-D Transformers}},
  author    = {Fischedick, S{\"o}hnke and Seichter, Daniel and Stephan, Benedict and Schmidt, Robin and Gross, Horst-Michael},
  booktitle = {IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
  pages     = {2400-2407},
  year      = {2025}
}

Content

Installation

  1. Clone repository:

    # do not forget the '--recursive'
    git clone --recursive https://github.com/TUI-NICR/DVEFormer
    
    # navigate to the cloned directory (required for installing some dependencies and to run the scripts later)
    cd DVEFormer
  2. Create conda environment and install all dependencies:
    Option 1: Updated environment from 2026 (Python 3.12, PyTorch 2.10.0):

    conda env create -f env_dveformer2026.yaml   # linux with cuda (sm_70 - sm_120)
    
    conda activate dveformer2026

    Option 2: Create your own conda environment:

    conda create --name "dveformer2026" python=3.12
    conda activate dveformer2026
    
    # important dependencies
    python -m pip install numpy opencv-python matplotlib tqdm
    python -m pip install torch torchvision
    python -m pip install torchmetrics
    python -m pip install wandb
    python -m pip install gdown
    
    # install patched version of AlphaCLIP that works with setuptools >= 82 and removed pkg_resources
    # for more details, see: https://github.com/SunzeY/AlphaCLIP/issues/75
    python -m pip install "git+https://github.com/Tripton/AlphaCLIP.git"

    Option 3: Environment from 2025 - original publication (Python 3.12.12, PyTorch 2.8.0 (sm_70 - sm_120), see env_dveformer2025.yaml for reference) - go back to ef73eb4 and follow the instructions given there.

  3. Install submodule packages:

    # dataset package
    python -m pip install -e "./lib/nicr-scene-analysis-datasets[withpreparation, withauxiliarydata]"
    
    # multitask scene analysis package
    python -m pip install -e "./lib/nicr-multitask-scene-analysis"
  4. Prepare datasets:
    We trained our networks on NYUv2, SUNRGB-D, ScanNet, Hypersim, and ADE20K.

    Please follow the instructions in ./lib/nicr-scene-analysis-datasets or HERE to prepare the datasets. Each datasets README describes how to generate the required Alpha-CLIP embeddings for knowledge distillation as well as the DepthAnythingV2-based depth images required for ADE20K. In the following, we assume that they are stored at ./datasets.

Results & Weights

We provide weights for our DVEFormer model trained on a dataset combination (NYUv2 + Hypersim + SUNRGB-D + ScanNet + ADE20K) using the modified SwinV2-T-128 backbone from EMSAFormer, available in two configurations: a full resolution decoder that preserves the input resolution and a throughput-optimized variant outputting 1/4 of the input resolution. We evaluated the closed-set semantic segmentation performance of our model on the three common indoor RGB-D datasets (NYUv2, SUNRGB-D and ScanNet).

Full-resolution model

Download and extract the checkpoint to ./trained_models/. The model keeps the output resolution at the input resolution, reaching 26.3 FPS on an NVIDIA Jetson AGX Orin 64 GB with TensorRT, and can be evaluated directly for text-based and visual mean-based segmentation. Linear probing attaches dataset-specific single linear layer that run on top of the frozen base model. Download the weights from the "Linear Probing Weights" column, extract them to ./trained_models/, and evaluate them by reusing main.py with the additional --enable-linear-probing and --linear-probing-weights-path argument. These layers are optional, and can be used to get the best segmentation performance on each dataset. See the notes in the training section below.

To download all full-resolution checkpoints at once, run:

gdown 1FQzXfP08pYI7RGlYjODPQRpCxDBI2eDX -O trained_models/dveformer_fullres_mixed.pth.tar.gz
gdown 1OGCOq5QDcF8e7UCFF-pK5c2HfOVyxvqX -O trained_models/dveformer_fullres_linear_probing_nyuv2.pth.tar.gz
gdown 12mCDbn1s5VQsnMMdf6FZMxLCrl9rKkQy -O trained_models/dveformer_fullres_linear_probing_sunrgbd.pth.tar.gz
gdown 1BqsVabSwFDNYRMv20eWB4yXu07FazEAO -O trained_models/dveformer_fullres_linear_probing_scannet.pth.tar.gz
gdown 1dSnkJh-Ker9CQZNQTjVj3IOcoj1zqmWk -O trained_models/dveformer_fullres_linear_probing_scannet20.pth.tar.gz

find ./trained_models -type f -name "dveformer_fullres*.tar.gz" -exec tar -xzf {} -C ./trained_models \;
Dataset Split #Classes Text-based mIoU Visual Mean-based mIoU Linear Probing mIoU Linear Probing Weights
NYUv2 test 40 44.07 50.31 57.05 Download
SUNRGB-D test 37 44.56 46.25 51.28 Download
ScanNet valid 40 33.77 39.59 49.06 Download
ScanNet valid 20 50.16 56.57 67.76 Download
ScanNet test 20 - - 62.6 same as above

Note

The reported metrics slightly differ from the numbers reported in the paper due to refactoring-induced numerical changes and minor bug fixes. However, the results remain close to the originally published values.

Note

As described in the paper, we subtract an alpha-scaled full-image embedding from each Alpha-CLIP segment embedding to suppress scene context. The --dense-visual-embedding-diff-factor argument can be used to modify alpha (default 0.65). Leave it unchanged unless you want to reproduce ablations such as alpha = 0.

Reduced-resolution model

Download and extract the checkpoint to ./trained_models/. The model keeps the output resolution at 1/4 of the original input resolution, reaching 77.0 FPS on an NVIDIA Jetson AGX Orin 64 GB with TensorRT, and can be evaluated directly for text-based and visual mean-based segmentation. Evaluation can be done in the same way as for the full resolution model, but require --dense-visual-embedding-decoder-n-upsamplings 0 as an additional argument.

To fetch all reduced-resolution checkpoints at once, run:

gdown 1tkhkEA6RsvjznY-WlWqL3LtAGYU3wse_ -O trained_models/dveformer_reducedres_mixed.pth.tar.gz
gdown 1Pdm1l8YLg7Spy9rEffFOwrfWVJHEwCiY -O trained_models/dveformer_reducedres_linear_probing_nyuv2.pth.tar.gz
gdown 19FUiUXhzC1o0CuEcJJwLyDbgp-9ypKD9 -O trained_models/dveformer_reducedres_linear_probing_sunrgbd.pth.tar.gz
gdown 13LplKyltdVLHaALY9NmmFk3Ib1w1veDO -O trained_models/dveformer_reducedres_linear_probing_scannet.pth.tar.gz
gdown 1vbg0MnH6VRAg-tiMGaLy1P5uX0blX5sw -O trained_models/dveformer_reducedres_linear_probing_scannet20.pth.tar.gz

find ./trained_models -type f -name "dveformer_reducedres*.tar.gz" -exec tar -xzf {} -C ./trained_models \;
Dataset Split #Classes Text-based mIoU Visual Mean-based mIoU Linear Probing mIoU Linear Probing Weights
NYUv2 test 40 43.45 50.02 56.28 Download
SUNRGB-D test 37 44.01 45.71 50.01 Download
ScanNet valid 40 33.58 39.19 48.75 Download
ScanNet valid 20 50.32 56.75 67.44 Download

Note

As ScanNet test evaluation can only be done using the official evaluation server, which only allow a single submission, we only provide validation results for this model.

Evaluation

We support three evaluation modes:

  • Text-based mIoU embeds each class name with the CLIP text encoder and assigns the class whose text embedding has the highest cosine similarity to the pixel embedding.
  • Visual mean-based mIoU uses the mean per class embedding of the teacher and assigns the class with the highest cosine similarity to each pixel embedding.
  • Linear probing mIoU keeps the encoder frozen and trains an additional 1x1 convolution that maps embeddings to class logits. See Linear Probing for details.

The evaluation assumes the datasets are stored at ./datasets. The examples include --enable-tf32, --encoder-amp bfp16, --context-module-amp bfp16, and --decoder-amp bfp16, which were used for faster training. Drop these flags if your GPU does not support TensorFloat-32/bfloat16 execution. However doing so may slightly change the reported metrics.

Text-based and visual mean-based evaluation require no additional training. Run main.py as done for training with --validation-only.

Optionally, if a dataset specific linear probing head was trained, the --enable-linear-probing and --linear-probing-weights-path arguments can be added, to evaluate the linear probing layer as well. All commands below include the linear probing head so that all reported numbers are reproduced. However, if this evaluation is not of interest, the arguments can be dropped.

The commands below reproduce the semantic mIoU for text-based, visual mean-based, and linear probing segmentation for the full resolution and lower resolution (1/4 of input resolution) model.

NYUv2 (test split, 40 classes)

Note

Evaluation using --validation-batch-size 2 requires at least 21.9 GB of GPU memory, --validation-batch-size 1 reduces the memory requirement to roughly 16.6 GB.

To evaluate the full resolution model run:

python main.py \
    --dataset nyuv2 \
    --dataset-path ./datasets/nyuv2 \
    --tasks dense-visual-embedding \
    --raw-depth \
    --enable-tf32 \
    --encoder-amp bfp16 \
    --context-module-amp bfp16 \
    --decoder-amp bfp16 \
    --weights-filepath ./trained_models/mixed/dveformer_fullres_mixed.pth \
    --dense-visual-embedding-decoder-n-channels-out 512 \
    --enable-linear-probing \
    --linear-probing-weights-path ./trained_models/nyuv2/dveformer_fullres_linear_probing_nyuv2.pth \
    --validation-batch-size 2 \
    --validation-only \
    --skip-sanity-check \
    --wandb-mode disabled
Validation results:
{
  ...,
  'valid_dense_visual_embedding_text_based_miou': tensor(0.4407),
  ...,
  'valid_dense_visual_embedding_visual_mean_based_miou': tensor(0.5031),
  ...,
  'valid_linear_probing_miou': tensor(0.5705),
  ...
}

To evaluate the reduced resolution model, add --dense-visual-embedding-decoder-n-upsamplings 0 and adjust the checkpoint files:

Note

Evaluation using --validation-batch-size 2 requires at least 3.1 GB of GPU memory, --validation-batch-size 1 reduces the memory requirement to roughly 1.9 GB.

python main.py \
    --dataset nyuv2 \
    --dataset-path ./datasets/nyuv2 \
    --tasks dense-visual-embedding \
    --raw-depth \
    --enable-tf32 \
    --encoder-amp bfp16 \
    --context-module-amp bfp16 \
    --decoder-amp bfp16 \
    --weights-filepath ./trained_models/mixed/dveformer_reducedres_mixed.pth \
    --dense-visual-embedding-decoder-n-channels-out 512 \
    --dense-visual-embedding-decoder-n-upsamplings 0 \
    --enable-linear-probing \
    --linear-probing-weights-path ./trained_models/nyuv2/dveformer_reducedres_linear_probing_nyuv2.pth \
    --validation-batch-size 2 \
    --validation-only \
    --skip-sanity-check \
    --wandb-mode disabled
Validation results:
{
  ...,
  'valid_dense_visual_embedding_text_based_miou': tensor(0.4345),
  ...,
  'valid_dense_visual_embedding_visual_mean_based_miou': tensor(0.5002),
  ...,
  'valid_linear_probing_miou': tensor(0.5628),
  ...
}

SUNRGB-D (test split, 37 classes)

To evaluate the full resolution model, run:

python main.py \
    --dataset sunrgbd \
    --dataset-path ./datasets/sunrgbd \
    --tasks dense-visual-embedding \
    --raw-depth \
    --enable-tf32 \
    --encoder-amp bfp16 \
    --context-module-amp bfp16 \
    --decoder-amp bfp16 \
    --weights-filepath ./trained_models/mixed/dveformer_fullres_mixed.pth \
    --dense-visual-embedding-decoder-n-channels-out 512 \
    --enable-linear-probing \
    --linear-probing-weights-path ./trained_models/sunrgbd/dveformer_fullres_linear_probing_sunrgbd.pth \
    --validation-batch-size 2 \
    --validation-only \
    --skip-sanity-check \
    --wandb-mode disabled
Validation results:
{
  ...,
  'valid_dense_visual_embedding_text_based_miou': tensor(0.4456),
  ...,
  'valid_dense_visual_embedding_visual_mean_based_miou': tensor(0.4625),
  ...,
  'valid_linear_probing_miou': tensor(0.5128),
  ...
}

To evaluate the reduced resolution model, run:

python main.py \
    --dataset sunrgbd \
    --dataset-path ./datasets/sunrgbd \
    --tasks dense-visual-embedding \
    --raw-depth \
    --enable-tf32 \
    --encoder-amp bfp16 \
    --context-module-amp bfp16 \
    --decoder-amp bfp16 \
    --weights-filepath ./trained_models/mixed/dveformer_reducedres_mixed.pth \
    --dense-visual-embedding-decoder-n-channels-out 512 \
    --dense-visual-embedding-decoder-n-upsamplings 0 \
    --enable-linear-probing \
    --linear-probing-weights-path ./trained_models/sunrgbd/dveformer_reducedres_linear_probing_sunrgbd.pth \
    --validation-batch-size 2 \
    --validation-only \
    --skip-sanity-check \
    --wandb-mode disabled
Validation results:
{
  ...,
  'valid_dense_visual_embedding_text_based_miou': tensor(0.4401),
  ...,
  'valid_dense_visual_embedding_visual_mean_based_miou': tensor(0.4571),
  ...,
  'valid_linear_probing_miou': tensor(0.5001),
  ...
}

ScanNet (validation split, 40 classes)

To evaluate the full resolution model, run:

python main.py \
    --dataset scannet \
    --dataset-path ./datasets/scannet \
    --tasks dense-visual-embedding \
    --raw-depth \
    --enable-tf32 \
    --encoder-amp bfp16 \
    --context-module-amp bfp16 \
    --decoder-amp bfp16 \
    --weights-filepath ./trained_models/mixed/dveformer_fullres_mixed.pth \
    --dense-visual-embedding-decoder-n-channels-out 512 \
    --scannet-semantic-n-classes 40 \
    --enable-linear-probing \
    --linear-probing-weights-path ./trained_models/scannet/dveformer_fullres_linear_probing_scannet.pth \
    --validation-split valid \
    --validation-batch-size 2 \
    --validation-only \
    --skip-sanity-check \
    --wandb-mode disabled
Validation results:
{
  ...,
  'valid_dense_visual_embedding_text_based_miou': tensor(0.3377),
  ...,
  'valid_dense_visual_embedding_visual_mean_based_miou': tensor(0.3959),
  ...,
  'valid_linear_probing_miou': tensor(0.4906),
  ...
}

To evaluate the reduced resolution model, run:

python main.py \
    --dataset scannet \
    --dataset-path ./datasets/scannet \
    --tasks dense-visual-embedding \
    --raw-depth \
    --enable-tf32 \
    --encoder-amp bfp16 \
    --context-module-amp bfp16 \
    --decoder-amp bfp16 \
    --weights-filepath ./trained_models/mixed/dveformer_reducedres_mixed.pth \
    --dense-visual-embedding-decoder-n-channels-out 512 \
    --dense-visual-embedding-decoder-n-upsamplings 0 \
    --scannet-semantic-n-classes 40 \
    --enable-linear-probing \
    --linear-probing-weights-path ./trained_models/scannet/dveformer_reducedres_linear_probing_scannet.pth \
    --validation-split valid \
    --validation-batch-size 2 \
    --validation-only \
    --skip-sanity-check \
    --wandb-mode disabled
Validation results:
{
  ...,
  'valid_dense_visual_embedding_text_based_miou': tensor(0.3358),
  ...,
  'valid_dense_visual_embedding_visual_mean_based_miou': tensor(0.3919),
  ...,
  'valid_linear_probing_miou': tensor(0.4875),
  ...
}

ScanNet (validation split, 20 classes)

To evaluate the full resolution model, run:

python main.py \
    --dataset scannet \
    --dataset-path ./datasets/scannet \
    --tasks dense-visual-embedding \
    --raw-depth \
    --enable-tf32 \
    --encoder-amp bfp16 \
    --context-module-amp bfp16 \
    --decoder-amp bfp16 \
    --weights-filepath ./trained_models/mixed/dveformer_fullres_mixed.pth \
    --dense-visual-embedding-decoder-n-channels-out 512 \
    --scannet-semantic-n-classes 20 \
    --enable-linear-probing \
    --linear-probing-weights-path ./trained_models/scannet/dveformer_fullres_linear_probing_scannet20.pth \
    --validation-split valid \
    --validation-batch-size 2 \
    --validation-only \
    --skip-sanity-check \
    --wandb-mode disabled
Validation results:
{
  ...,
  'valid_dense_visual_embedding_text_based_miou': tensor(0.5016),
  ...,
  'valid_dense_visual_embedding_visual_mean_based_miou': tensor(0.5657),
  ...,
  'valid_linear_probing_miou': tensor(0.6776),
  ...
}

To evaluate the reduced resolution model, run:

python main.py \
    --dataset scannet \
    --dataset-path ./datasets/scannet \
    --tasks dense-visual-embedding \
    --raw-depth \
    --enable-tf32 \
    --encoder-amp bfp16 \
    --context-module-amp bfp16 \
    --decoder-amp bfp16 \
    --weights-filepath ./trained_models/mixed/dveformer_reducedres_mixed.pth \
    --dense-visual-embedding-decoder-n-channels-out 512 \
    --dense-visual-embedding-decoder-n-upsamplings 0 \
    --scannet-semantic-n-classes 20 \
    --enable-linear-probing \
    --linear-probing-weights-path ./trained_models/scannet/dveformer_reducedres_linear_probing_scannet20.pth \
    --validation-split valid \
    --validation-batch-size 2 \
    --validation-only \
    --skip-sanity-check \
    --wandb-mode disabled
Validation results:
{
  ...,
  'valid_dense_visual_embedding_text_based_miou': tensor(0.5032),
  ...,
  'valid_dense_visual_embedding_visual_mean_based_miou': tensor(0.5675),
  ...,
  'valid_linear_probing_miou': tensor(0.6744),
  ...
}

Inference

We provide scripts for inference on both samples drawn from one of our used datasets (main.py with additional arguments) and samples located in ./samples (inference_samples.py)

Dataset Inference

To run inference on a dataset with dense visual embedding prediction, use main.py together with --validation-only and --visualize-validation. By default the visualised outputs are written next to the checkpoint; override the destination with --visualization-output-path if required.

python main.py \
    --dataset nyuv2 \
    --dataset-path ./datasets/nyuv2 \
    --tasks dense-visual-embedding \
    --raw-depth \
    --enable-tf32 \
    --encoder-amp bfp16 \
    --context-module-amp bfp16 \
    --decoder-amp bfp16 \
    --weights-filepath ./trained_models/mixed/dveformer_fullres_mixed.pth \
    --dense-visual-embedding-decoder-n-channels-out 512 \
    --enable-linear-probing \
    --linear-probing-weights-path ./trained_models/nyuv2/dveformer_fullres_linear_probing_nyuv2.pth \
    --validation-batch-size 2 \
    --validation-only \
    --visualize-validation \
    --visualization-output-path ./results/visualized_outputs/nyuv2 \
    --skip-sanity-check \
    --wandb-mode disabled

The same setup applies to SUNRGB-D and ScanNet by adjusting --dataset, --dataset-path, the checkpoint file, and the linear probing file. inference_dataset.py can be used to generate ScanNet submissions in the official evaluation format.

Sample Inference

inference_samples.py applies a trained model to the Kinect v2 example stored in ./samples.

python inference_samples.py \
    --dataset nyuv2 \
    --dataset-path ./datasets/nyuv2 \
    --tasks dense-visual-embedding \
    --raw-depth \
    --weights-filepath ./trained_models/mixed/dveformer_fullres_mixed.pth \
    --enable-linear-probing \
    --linear-probing-weights-path ./trained_models/nyuv2/dveformer_fullres_linear_probing_nyuv2.pth \
    --dense-visual-embedding-decoder-n-channels-out 512 \
    --depth-max 8000 \
    --depth-scale 8 \
    --show-results

Note

The dataset argument is required to determine the correct dataset configuration (classes, colors, ...). The --dataset-path is required for text-based and visual mean-based evaluation, as it provides the reference embeddings and is optional. If not provided, both predictions will be skipped. Linear probing is optional as well.

img

Time Inference

We timed the inference on an NVIDIA Jetson AGX Orin with Jetpack 6.2 (TensorRT 10.3, PyTorch 2.7.0/torchvision 0.22.0).

Reproducing the timings on an NVIDIA Jetson AGX Orin further requires:

  • installing PyTorch and TorchVision via pip3 install --index-url https://pypi.jetson-ai-lab.io/jp6/cu126 torch==2.7.0 torchvision==0.22.0
  • enabling MAXN power mode
  • installation of DVEFormer dependencies

Subsequently, you can run ./inference_time.bash (set DATASET_PATH if you want to use real samples) to reproduce the reported timings.

Training

Use main.py to train DVEFormer on various dataset combinations or any other dataset that you implemented following the implementation of the provided datasets. Evaluating every dataset during training adds significant overhead, so we only enable evaluation on NYUv2 while training the model.

Note

Training our DVEFormer with our selected SwinV2-T-128 encoder requires pretrained weights. You can download our pretrained weights on ImageNet from Link. Alternatively you can directly download it by:

gdown 1YzwHYuKfyBX4AD6f8snFKYtId3shvDx3 -O trained_models/imagenet_swin_multi_t_v2_128.tar.gz

tar -xzf trained_models/imagenet_swin_multi_t_v2_128.tar.gz -C trained_models

Note

We trained all models on NVIDIA A100-SXM4-40GB GPUs with batch size of 4. Training the model on a GPU with less VRAM might not work.

Example: Train DVEFormer on the mixed dataset combination:

python main.py \
    --results-basepath ./results \
    --dataset nyuv2:hypersim:sunrgbd:scannet:ade20k^depthanything_v2__indoor_large \
    --dataset-path ./datasets/nyuv2:./datasets/hypersim:./datasets/sunrgbd:./datasets/scannet:./datasets/ade20k \
    --raw-depth \
    --subset-train 1.0:0.1:1.0:0.25:1.0 \
    --split train:train:train:train:train_panoptic_2017 \
    --validation-split test:none:none:none:none \
    --input-modalities rgbd \
    --tasks dense-visual-embedding \
    --rgbd-encoder-backbone swin-multi-t-v2-128 \
    --rgbd-encoder-backbone-pretrained-weights-filepath ./trained_models/imagenet/dveformer_swin_multi_t_v2_128.pth \
    --dense-visual-embedding-decoder-n-channels-out 512 \
    --validation-skip 0.9 \
    --checkpointing-skip 0.9 \
    --checkpointing-best-only \
    --checkpointing-metrics valid_total_loss valid_dense_visual_embedding_text_based_miou valid_dense_visual_embedding_visual_mean_based_miou \
    --compile-model \
    --enable-tf32 \
    --encoder-amp bfp16 \
    --context-module-amp bfp16 \
    --decoder-amp bfp16 \
    --batch-size 4 \
    --validation-batch-size 4 \
    --learning-rate 5e-5 \
    --n-epochs 250 \
    --wandb-mode disabled

Tip

To train the reduced resolution model append --dense-visual-embedding-decoder-n-upsamplings 0 to the command.

Note

Dataset concatenation is achieved using : as the delimiter. The --subset-train parameter controls the fraction of each dataset sampled per epoch. For example, setting hypersim to 0.1 draws roughly 10% of its samples each epoch, with the subset potentially changing between epochs.

For more options, we refer to ./dveformer/args.py or simply run:

python main.py --help

Linear probing (NYUv2 example)

To adapt the full resolution checkpoint to NYUv2 while keeping the backbone frozen, activate linear probing in main.py and initialise the head from the mean visual embeddings:

python main.py \
    --results-basepath ./results \
    --dataset nyuv2 \
    --dataset-path ./datasets/nyuv2 \
    --tasks dense-visual-embedding \
    --raw-depth \
    --weights-filepath ./trained_models/mixed/dveformer_fullres_mixed.pth \
    --dense-visual-embedding-decoder-n-channels-out 512 \
    --enable-linear-probing \
    --linear-probing-weights-init mean \
    --checkpointing-best-only \
    --checkpointing-metrics valid_linear_probing_miou \
    --enable-tf32 \
    --encoder-amp bfp16 \
    --context-module-amp bfp16 \
    --decoder-amp bfp16 \
    --batch-size 4 \
    --validation-batch-size 4 \
    --learning-rate 1e-2 \
    --n-epochs 50 \
    --validation-split test \
    --skip-sanity-check \
    --wandb-mode disabled

For more options, we refer to ./dveformer/args.py or simply run:

python main.py --help

Changelog

Apr 21, 2026

  • update citation
  • bump lib/nicr-multitask-scene-analysis to version 0.3.1:
    • fixes rare crash in DenseVisualEmbeddingTaskHelper when data augmentation led to having no valid embedding index at all for the whole batch
  • bump lib/nicr-scene-analysis-datasets to version 0.9.0:
  • fix off-by-one issue for --validation-force-interval argument and force validation at the end of training
  • force dynamo = False in torch.onnx.export for now
  • use new github markdown alerts
  • add more recent environment (env_dveformer2026.yaml) with Python 3.12 and latest tested PyTorch 2.10.0

Oct 15, 2025

  • initial release of DVEFormer: Efficient Prediction of Dense Visual Embeddings via Distillation and RGB-D Transformers

About

Efficient Prediction of Dense Visual Embeddings (DVE) via Distillation and RGB-D Transformers

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors