Cross-sensory retrieval requires the model to take one sensory modality as input and retrieve the corresponding data of another modality. For instance, given the sound of striking a mug, the "audio2vision" model needs to retrieve the corresponding image of the mug from a pool of images of hundreds of objects. In this benchmark, each sensory modality (vision, audio, touch) can be used as either input or output, leading to 9 sub-tasks.
For each object, given modality A and modality B (A and B can be either vision, touch or audio), the goal of cross-sensory retrieval is to minimize the distance between the representations of sensory observations from the same object while maximizing those from different objects.
Specifically, we sample 100 instances from each modality of each object, resulting in two instance sets
The dataset used to train the baseline models can be downloaded from here
Start the training process, and test the best model on test-set after training:
# Train DSCMR as an example
python main.py --model DSCMR --config_location ./configs/DSCMR.yml \
--epochs 10 --weight_decay 1e-2 --modality_list vision touch \
--exp DSCMR_vision_touch --batch_size 32Evaluate the best model in vision_audio_dscmr:
# Evaluate DSCMR as an example
python main.py --model DSCMR --config_location ./configs/DSCMR.yml \
--modality_list vision touch \
--exp DSCMR_vision_touch --batch_size 32 \
--evalTo train and test your new model on ObjectFolder Cross-Sensory Retrieval Benchmark, you only need to modify several files in models, you may follow these simple steps.
-
Create new model directory
mkdir models/my_model
-
Design new model
cd models/my_model touch my_model.py -
Build the new model and its optimizer
Add the following code into models/build.py:
elif args.model == 'my_model': from my_model import my_model model = my_model.my_model(args) optimizer = optim.AdamW(model.parameters(), lr=args.lr, weight_decay=args.weight_decay)
-
Add the new model into the pipeline
Once the new model is built, it can be trained and evaluated similarly:
python main.py --model my_model --config_location ./configs/my_model.yml \ --epochs 10 --modality_list vision touch \ --exp my_model_vision_touch --batch_size 32
In our experiments, we randomly split the objects from ObjectFolder into train/val/test splits of 800/100/100 objects, and split the 10 instances of each object from ObjectFolder Real into 8/1/1. In the retrieval process, we set each instance in the input sensory modality as the query, and the instances from another sensory are retrieved by ranking them according to cosine similarity. Next, the Average Precision (AP) is computed by considering the retrieved instances from the same object as positive and others as negative. Finally, the model performance is measured by the mean Average Precision (mAP) score, which is a widely-used metric for evaluating retrieval performance.
| Input | Retrieved | RANDOM | CCA | PLSCA | DSCMR | DAR |
| Vision | Vision (different views) | 1.00 | 55.52 | 82.43 | 82.74 | 89.28 |
| Audio | 1.00 | 19.56 | 11.53 | 9.13 | 20.64 | |
| Touch | 1.00 | 6.97 | 6.33 | 3.57 | 7.03 | |
| Audio | Vision | 1.00 | 20.58 | 13.37 | 10.84 | 20.17 |
| Audio (different vertices) | 1.00 | 70.53 | 80.77 | 75.45 | 77.80 | |
| Touch | 1.00 | 5.27 | 6.96 | 5.30 | 6.91 | |
| Touch | Vision | 1.00 | 8.50 | 6.25 | 4.92 | 8.80 |
| Audio | 1.00 | 6.18 | 7.11 | 6.15 | 7.77 | |
| Touch (different vertices) | 1.00 | 28.06 | 52.30 | 51.08 | 54.80 |
| Input | Retrieved | RANDOM | CCA | PLSCA | DSCMR | DAR |
| Vision | Vision (different views) | 3.72 | 30.60 | 60.95 | 81.27 | 81.00 |
| Audio | 3.72 | 12.05 | 27.12 | 68.34 | 66.92 | |
| Touch | 3.72 | 6.29 | 9.77 | 64.91 | 39.46 | |
| Audio | Vision | 3.72 | 12.41 | 30.54 | 67.16 | 64.35 |
| Audio (different vertices) | 3.72 | 27.40 | 55.75 | 72.59 | 68.79 | |
| Touch | 3.72 | 5.38 | 11.66 | 54.55 | 33.00 | |
| Touch | Vision | 3.72 | 6.40 | 11.46 | 64.86 | 41.18 |
| Audio | 3.72 | 5.57 | 13.89 | 55.37 | 37.30 | |
| Touch (different vertices) | 3.72 | 21.16 | 27.97 | 66.09 | 41.42 |