MediVoxAI is a next-generation Medical Chatbot powered by a Multimodal Large Language Model (LLM) with both Vision and Voice capabilities. MediVoxAI can converse with patients, understand spoken questions, analyze medical images, and respond with empathetic and informative answers — making healthcare assistance more accessible and interactive. LINK: https://huggingface.co/spaces/jv456/MediVoxAI
- Multimodal LLM: Handles both medical images and text inputs.
- Speech-to-Text (STT): Records and transcribes patient voice input.
- Text-to-Speech (TTS): Responds with realistic doctor voice output.
- Intuitive UI: User-friendly interface using Gradio.
- Configure GROQ API key for fast AI inference.
- Prepare images in required format.
- Integrate the Llama 3 Vision model for image and text understanding.
- Set up audio recording using
ffmpegandportaudio. - Implement speech-to-text transcription with OpenAI Whisper.
- Integrate TTS using
gTTSand ElevenLabs. - Convert model-generated text responses into human-like voice.
- Build an interactive UI with Gradio for seamless conversation.
- User speaks or uploads an image via the Gradio UI.
- Voice input is transcribed to text using Whisper.
- Image and text are processed by the multimodal LLM (Llama 3 Vision).
- AI generates response as text.
- Doctor's response is converted to voice using TTS and played back.
- All interaction happens in a web UI (Gradio).
- Groq for AI Inference
- OpenAI Whisper for transcription
- Llama 3 Vision for multimodal understanding
- gTTS & ElevenLabs for speech synthesis
- Gradio for UI
- Python, VS Code
If you like this project, please give it a star!