Given an RGB image of an object, a sequence of tactile readings from the object’s surface, or a sequence of impact sounds of striking its surface locations, the task is to reconstruct the point cloud of the target object given combinations of these multisensory observations.
The dataset used to train the baseline models can be downloaded from here
Start the training process, and test the best model on test-set after training:
# Train PCN as an example
python main.py --modality_list vision touch audio \
--model PCN \
--batch_size 8 \
--epochs 10 \
--local_gt_points_location ../DATA_new/local_gt_points_down_sampled \
--lr 1e-4 --exp pcn/vision_touch_audio \
--config_location ./configs/PCN.yml \
--normalizeEvaluate the best model in pcn/vision_touch_audio:
# Evaluate PCN as an example
python main.py --modality_list vision touch audio \
--model PCN \
--batch_size 8 \
--epochs 10 \
--local_gt_points_location ../DATA_new/local_gt_points_down_sampled \
--lr 1e-4 --exp pcn/vision_touch_audio \
--config_location ./configs/PCN.yml \
--normalize --evalTo train and test your new model on ObjectFolder 3D Shape Reconstruction Benchmark, you only need to modify several files in models, you may follow these simple steps.
-
Create new model directory
mkdir models/my_model
-
Design new model
cd models/my_model touch my_model.py -
Build the new model and its optimizer
Add the following code into models/build.py:
elif args.model == 'my_model': from my_model import my_model model = my_model.my_model(args) optimizer = optim.AdamW(model.parameters(), lr=args.lr, weight_decay=args.weight_decay)
-
Add the new model into the pipeline
Once the new model is built, it can be trained and evaluated similarly:
python main.py --modality_list vision touch audio \ --model my_model \ --batch_size 8 \ --epochs 10 \ --local_gt_points_location ../DATA_new/local_gt_points_down_sampled \ --lr 1e-4 --exp my_model/vision_touch_audio \ --config_location ./configs/my_model.yml \ --normalize
For the visual RGB images, tactile RGB images, and impact sounds used in this task, we respectively sample 100 instances around each object (vision) or on its surface (touch and audio).
In all, given the 1, 000 objects, we can obtain 1, 000 x 100 = 100, 000 instances for vision, touch, and audio modality, respectively. In the experiments, we randomly split the 1, 000 objects as train/validation/test = 800/100/100, meaning that the models need to generalize to new objects during testing. Furthermore, we also test the model performance on ObjectFolder Real by similarly splitting the 100 objects as train/validation/test = 60/20/20.
| Method | Vision | Touch | Audio | V+T | V+A | T+A | V+T+A |
| MDN | 4.02 | 3.88 | 5.04 | 3.19 | 4.05 | 3.49 | 2.91 |
| PCN | 2.36 | 3.81 | 3.85 | 2.30 | 2.48 | 3.27 | 2.25 |
| MRT | 2.80 | 4.12 | 5.01 | 2.78 | 3.13 | 4.28 | 3.08 |
| Method | Vision | Touch | Audio | V+T | V+A | T+A | V+T+A |
| MRT | 1.17 | 1.04 | 1.04 | 0.96 | 1.50 | 1.12 | 0.95 |