All objects are labeled by seven material types: ceramic, glass, wood, plastic, iron, polycarbonate, and steel. The task is formulated as a single-label classification problem. Given an RGB image, an impact sound, a tactile image, or their combination, the model must predict the correct material label for the target object.
The dataset used to train the baseline models can be downloaded from here
Start the training process, and test the best model on test-set after training:
# Train FENet as an example
python main.py --model FENet --config_location ./configs/FENet.yml \
--modality_list vision touch audio --batch_size 256 \
--lr 1e-3 --weight_decay 1e-2 --exp FENet_vision_touch_audioEvaluate the best model in FENet_vision_touch_audio:
# Evaluate FENet as an example
python main.py --model FENet --config_location ./configs/FENet.yml \
--modality_list vision touch audio --batch_size 256 \
--lr 1e-3 --weight_decay 1e-2 --exp FENet_vision_touch_audio \
--evalTo train and test your new model on ObjectFolder Cross-Sensory Retrieval Benchmark, you only need to modify several files in models, you may follow these simple steps.
-
Create new model directory
mkdir models/my_model
-
Design new model
cd models/my_model touch my_model.py -
Build the new model and its optimizer
Add the following code into models/build.py:
elif args.model == 'my_model': from my_model import my_model model = my_model.my_model(args) optimizer = optim.AdamW(model.parameters(), lr=args.lr, weight_decay=args.weight_decay)
-
Add the new model into the pipeline
Once the new model is built, it can be trained and evaluated similarly:
python main.py --model my_model --config_location ./configs/my_model.yml \ --modality_list vision touch audio --batch_size 256 \ --lr 1e-3 --weight_decay 1e-2 --exp my_model_vision_touch_audio
The 1, 000 objects are randomly split into train/validation/test = 800/100/100, and the model needs to generalize to new objects during the testing process. Furthermore, we also conduct a cross-object experiment on ObjectFolder Real to test the Sim2Real transferring ability of the models, in which the 100 real objects are randomly split into train/validation/test = 60/20/20.
| Method | Vision | Touch | Audio | Fusion |
| ResNet | 91.89 | 74.36 | 94.91 | 96.28 |
| FENet | 92.25 | 75.89 | 95.80 | 96.60 |
| Method | Accuracy |
| ResNet w/o pretrain | 45.25 |
| ResNet | 51.02 |