Cross-Sensory Retrieval

Cross-sensory retrieval requires the model to take one sensory modality as input and retrieve the corresponding data of another modality. For instance, given the sound of striking a mug, the "audio2vision" model needs to retrieve the corresponding image of the mug from a pool of images of hundreds of objects. In this benchmark, each sensory modality (vision, audio, touch) can be used as either input or output, leading to 9 sub-tasks.

For each object, given modality A and modality B (A and B can be either vision, touch or audio), the goal of cross-sensory retrieval is to minimize the distance between the representations of sensory observations from the same object while maximizing those from different objects.

Specifically, we sample 100 instances from each modality of each object, resulting in two instance sets $S_A$ and $S_B$. Next, we pair the instances from both modalities, which is done by Cartesian Product: $$ P(i)=S_A(i) \times S_B(i) $$ , in which $i$ is the object index, and $P$ is the set of instance pairs.

Usage

Data Preparation

The dataset used to train the baseline models can be downloaded from here

Training & Evaluation

Start the training process, and test the best model on test-set after training:

# Train DSCMR as an example
python main.py --model DSCMR --config_location ./configs/DSCMR.yml \
               --epochs 10 --weight_decay 1e-2 --modality_list vision touch \
               --exp DSCMR_vision_touch --batch_size 32

Evaluate the best model in vision_audio_dscmr:

# Evaluate DSCMR as an example
python main.py --model DSCMR --config_location ./configs/DSCMR.yml \
               --modality_list vision touch \
               --exp DSCMR_vision_touch --batch_size 32 \
               --eval

Add your own model

To train and test your new model on ObjectFolder Cross-Sensory Retrieval Benchmark, you only need to modify several files in models, you may follow these simple steps.

Create new model directory
```
mkdir models/my_model
```
Design new model
```
cd models/my_model
touch my_model.py
```

Build the new model and its optimizer

Add the following code into models/build.py:

elif args.model == 'my_model':
    from my_model import my_model
    model = my_model.my_model(args)
    optimizer = optim.AdamW(model.parameters(), lr=args.lr, weight_decay=args.weight_decay)

Add the new model into the pipeline

Once the new model is built, it can be trained and evaluated similarly:

python main.py --model my_model --config_location ./configs/my_model.yml \
               --epochs 10 --modality_list vision touch \
               --exp my_model_vision_touch --batch_size 32

Results on ObjectFolder Cross-Sensory Retrieval Benchmark

In our experiments, we randomly split the objects from ObjectFolder into train/val/test splits of 800/100/100 objects, and split the 10 instances of each object from ObjectFolder Real into 8/1/1. In the retrieval process, we set each instance in the input sensory modality as the query, and the instances from another sensory are retrieved by ranking them according to cosine similarity. Next, the Average Precision (AP) is computed by considering the retrieved instances from the same object as positive and others as negative. Finally, the model performance is measured by the mean Average Precision (mAP) score, which is a widely-used metric for evaluating retrieval performance.

Results on ObjectFolder

Input	Retrieved	RANDOM	CCA	PLSCA	DSCMR	DAR
Vision	Vision (different views)	1.00	55.52	82.43	82.74	89.28
	Audio	1.00	19.56	11.53	9.13	20.64
	Touch	1.00	6.97	6.33	3.57	7.03
Audio	Vision	1.00	20.58	13.37	10.84	20.17
	Audio (different vertices)	1.00	70.53	80.77	75.45	77.80
	Touch	1.00	5.27	6.96	5.30	6.91
Touch	Vision	1.00	8.50	6.25	4.92	8.80
	Audio	1.00	6.18	7.11	6.15	7.77
	Touch (different vertices)	1.00	28.06	52.30	51.08	54.80

Results on ObjectFolder Real

Input	Retrieved	RANDOM	CCA	PLSCA	DSCMR	DAR
Vision	Vision (different views)	3.72	30.60	60.95	81.27	81.00
	Audio	3.72	12.05	27.12	68.34	66.92
	Touch	3.72	6.29	9.77	64.91	39.46
Audio	Vision	3.72	12.41	30.54	67.16	64.35
	Audio (different vertices)	3.72	27.40	55.75	72.59	68.79
	Touch	3.72	5.38	11.66	54.55	33.00
Touch	Vision	3.72	6.40	11.46	64.86	41.18
	Audio	3.72	5.57	13.89	55.37	37.30
	Touch (different vertices)	3.72	21.16	27.97	66.09	41.42

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configs		configs
dataset		dataset
models		models
scripts		scripts
utils		utils
.gitignore		.gitignore
CCA.py		CCA.py
CCA_large.py		CCA_large.py
Engine.py		Engine.py
KCCA.py		KCCA.py
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cross-Sensory Retrieval

Usage

Data Preparation

Training & Evaluation

Add your own model

Results on ObjectFolder Cross-Sensory Retrieval Benchmark

Results on ObjectFolder

Results on ObjectFolder Real

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cross-Sensory Retrieval

Usage

Data Preparation

Training & Evaluation

Add your own model

Results on ObjectFolder Cross-Sensory Retrieval Benchmark

Results on ObjectFolder

Results on ObjectFolder Real

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages