Skip to content

objectfolder/cross-sensory-retrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cross-Sensory Retrieval

Cross-sensory retrieval requires the model to take one sensory modality as input and retrieve the corresponding data of another modality. For instance, given the sound of striking a mug, the "audio2vision" model needs to retrieve the corresponding image of the mug from a pool of images of hundreds of objects. In this benchmark, each sensory modality (vision, audio, touch) can be used as either input or output, leading to 9 sub-tasks.

For each object, given modality A and modality B (A and B can be either vision, touch or audio), the goal of cross-sensory retrieval is to minimize the distance between the representations of sensory observations from the same object while maximizing those from different objects.

Specifically, we sample 100 instances from each modality of each object, resulting in two instance sets $S_A$ and $S_B$. Next, we pair the instances from both modalities, which is done by Cartesian Product: $$ P(i)=S_A(i) \times S_B(i) $$ , in which $i$ is the object index, and $P$ is the set of instance pairs.

Usage

Data Preparation

The dataset used to train the baseline models can be downloaded from here

Training & Evaluation

Start the training process, and test the best model on test-set after training:

# Train DSCMR as an example
python main.py --model DSCMR --config_location ./configs/DSCMR.yml \
               --epochs 10 --weight_decay 1e-2 --modality_list vision touch \
               --exp DSCMR_vision_touch --batch_size 32

Evaluate the best model in vision_audio_dscmr:

# Evaluate DSCMR as an example
python main.py --model DSCMR --config_location ./configs/DSCMR.yml \
               --modality_list vision touch \
               --exp DSCMR_vision_touch --batch_size 32 \
               --eval

Add your own model

To train and test your new model on ObjectFolder Cross-Sensory Retrieval Benchmark, you only need to modify several files in models, you may follow these simple steps.

  1. Create new model directory

    mkdir models/my_model
  2. Design new model

    cd models/my_model
    touch my_model.py
  3. Build the new model and its optimizer

    Add the following code into models/build.py:

    elif args.model == 'my_model':
        from my_model import my_model
        model = my_model.my_model(args)
        optimizer = optim.AdamW(model.parameters(), lr=args.lr, weight_decay=args.weight_decay)
  4. Add the new model into the pipeline

    Once the new model is built, it can be trained and evaluated similarly:

    python main.py --model my_model --config_location ./configs/my_model.yml \
                   --epochs 10 --modality_list vision touch \
                   --exp my_model_vision_touch --batch_size 32

Results on ObjectFolder Cross-Sensory Retrieval Benchmark

In our experiments, we randomly split the objects from ObjectFolder into train/val/test splits of 800/100/100 objects, and split the 10 instances of each object from ObjectFolder Real into 8/1/1. In the retrieval process, we set each instance in the input sensory modality as the query, and the instances from another sensory are retrieved by ranking them according to cosine similarity. Next, the Average Precision (AP) is computed by considering the retrieved instances from the same object as positive and others as negative. Finally, the model performance is measured by the mean Average Precision (mAP) score, which is a widely-used metric for evaluating retrieval performance.

Results on ObjectFolder

Input Retrieved RANDOM CCA PLSCA DSCMR DAR
Vision Vision (different views) 1.00 55.52 82.43 82.74 89.28
Audio 1.00 19.56 11.53 9.13 20.64
Touch 1.00 6.97 6.33 3.57 7.03
Audio Vision 1.00 20.58 13.37 10.84 20.17
Audio (different vertices) 1.00 70.53 80.77 75.45 77.80
Touch 1.00 5.27 6.96 5.30 6.91
Touch Vision 1.00 8.50 6.25 4.92 8.80
Audio 1.00 6.18 7.11 6.15 7.77
Touch (different vertices) 1.00 28.06 52.30 51.08 54.80

Results on ObjectFolder Real

Input Retrieved RANDOM CCA PLSCA DSCMR DAR
Vision Vision (different views) 3.72 30.60 60.95 81.27 81.00
Audio 3.72 12.05 27.12 68.34 66.92
Touch 3.72 6.29 9.77 64.91 39.46
Audio Vision 3.72 12.41 30.54 67.16 64.35
Audio (different vertices) 3.72 27.40 55.75 72.59 68.79
Touch 3.72 5.38 11.66 54.55 33.00
Touch Vision 3.72 6.40 11.46 64.86 41.18
Audio 3.72 5.57 13.89 55.37 37.30
Touch (different vertices) 3.72 21.16 27.97 66.09 41.42

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors