This project demonstrates how to build a backend for a Multimodal Retrieval-Augmented Generation (RAG) system using:
- Node.js and Express for the backend
- Google Vertex AI for generating text, image, and video embeddings
- Pinecone for storing and querying vector representations
- Upload and vectorize content via
/upload:- Supports text, images, and videos
- Video is split into segments and each segment is vectorized individually (by Vertex AI)
- Store vectors with metadata in Pinecone
- Search via
/searchendpoint using a natural language query- Returns top 3 most relevant items with metadata
git clone https://github.com/ym-empat/multimodal-rag-nodejs.git
cd multimodal-rag-nodejsnpm installPORT=3000
GCP_LOCATION=us-central1
GOOGLE_APPLICATION_CREDENTIALS=./service-account.json
PINECONE_API_KEY=example_api_key
PINECONE_INDEX_NAME=example_indexnode app.jsUpload content to be vectorized and stored in Pinecone.
contentfield inmultipart/form-data- If it's text → processed as text
- If it's an image → embedded using Vertex AI
- If it's a video → embedded using Vertex AI
Query Pinecone using a natural language string.
{
"query": "How do I change the theme color?"
}Returns metadata for the top 3 most similar results.
.
├── app.js # Main Express server
├── vertex.js # Vertex AI embedding logic
├── pinecone.js # Pinecone vector storage & search
├── .env # Environment variables
└── README.md
- Node.js + Express
- Google Vertex AI
- Pinecone Vector DB
- Conversational RAG using Gemini
MIT — feel free to use and adapt.
Feel free to open issues or submit pull requests if you’d like to collaborate or extend the project.
This solution was developed by Empat.tech