A curated list of different papers and datasets in various areas of audio-visual processing
-
Updated
Jan 30, 2024
A curated list of different papers and datasets in various areas of audio-visual processing
Sparse Fuse Dense: Towards High Quality 3D Detection with Depth Completion (CVPR 2022, Oral)
Multimodal Transformer for Korean Sentiment Analysis with Audio and Text Features
PyTorch Implementation of Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
[Paper][IJCNN2023] Modality-Aware Negative Sampling for Multi-modal Knowledge Graph Embedding
Multi-modal AI agent that extracts information from PDFs, images, and documents to answer questions. Combines vision models with RAG architecture for intelligent document understanding. Upload any file and chat with your documents. Built with LangChain, vision APIs, and vector embeddings.
A multi-language invoice data extractor tool using Google Gemini Pro and Streamlit with Prompt Engineering.
This repo reproduces key findings from Masked Autoencoders Are Scalable Vision Learners (MAE) on CIFAR-10: self-supervised pretraining improves downstream classification versus training from scratch, and we studied how decoder depth and decoder width affect MAE pretraining and downstream results.
Add a description, image, and links to the mutli-modal topic page so that developers can more easily learn about it.
To associate your repository with the mutli-modal topic, visit your repo's landing page and select "manage topics."