This API uses CrewAI with three sequential AI agents and Google Cloud Vertex AI (Gemini 2.5 Flash) to generate personalised patient questionnaires, convert questions to speech, transcribe patient answers, and describe patient-submitted photos.
One-time setup:
./setup_conda.shThen copy the environment template and fill in your GCP details:
cp .env.example .env
# edit .env with your valuesStart the server:
./start_api.sh
# or: python api.pyconda create -n rx-ai python=3.11
conda activate rx-ai
pip install -r requirements.txtYou need a GCP service account with the following roles:
| Role | Purpose |
|---|---|
Vertex AI User |
CrewAI LLM calls + /analyze-image |
Cloud Speech-to-Text ServiceAgent |
/stt endpoint |
Cloud Text-to-Speech Editor |
/tts endpoint |
Steps:
- Open GCP Console → IAM → Service Accounts
- Create (or select) the
rx-ai-backendservice account - Assign the three roles above
- Click Keys → Add Key → JSON and save the file to a path outside the repo, e.g.
~/.gcp/rx-ai-sa.json
Enable these APIs in your project:
cp .env.example .envEdit .env — never commit it:
# Google Cloud — Vertex AI
GOOGLE_APPLICATION_CREDENTIALS=/absolute/path/to/rx-ai-sa.json
GOOGLE_CLOUD_PROJECT=your-gcp-project-id
GOOGLE_CLOUD_LOCATION=us-central1
# Speech-to-Text v2 location (used for both the regional endpoint and recognizer path).
# chirp_2 is available in specific regions (e.g., us-central1).
GOOGLE_CLOUD_STT_LOCATION=us-central1
# Gemini model identifiers
GEMINI_MODEL=gemini-2.5-flash
GEMINI_TTS_VOICE=en-US-Chirp3-HD-Aoede
# Evaluation log output directory (relative to project root)
EVAL_LOG_DIR=eval/logspython api.pyThe server starts on http://localhost:8000.
Returns all patients in the system.
curl http://localhost:8000/patientsGenerates a personalised questionnaire based on current visit context.
Request body:
{
"patient_id": "P001",
"visit_id": "P001_V3",
"conditions": ["Diabetes Type 2", "Hypertension"],
"medications": ["Metformin 1000mg BID", "Lisinopril 10mg QD"],
"allergies": ["Penicillin"],
"issues_detected": ["Elevated blood pressure", "Foot numbness"],
"clinical_provider_note": "Patient reports occasional dizziness..."
}Response:
{
"questions": [
{
"id": "q1",
"question": "How often do you experience dizziness?",
"type": "radio",
"source": "Clinical notes follow-up",
"rationale": "Monitor reported symptom severity",
"required": true,
"options": ["Never", "Rarely", "Sometimes", "Often", "Always"]
},
{
"id": "q2",
"question": "On a scale of 1–10, how would you rate the numbness in your feet?",
"type": "scale",
"source": "Diabetic neuropathy screening",
"rationale": "Assess peripheral neuropathy progression",
"required": true,
"min": 1,
"max": 10
}
],
"patient_id": "P001",
"visit_id": "P001_V3"
}Single-pass Gemini baseline questionnaire generator used to compare against the 3-agent CrewAI pipeline.
Uses the same request body as /generate-questionnaire and returns the same response shape.
Which is used by default?
- Default (React frontend):
/generate-questionnaire(CrewAI 3-agent sequential pipeline) - Baseline comparison endpoint:
/generate-questionnaire-singlepass(single Gemini call)
Observed trade-offs:
- Latency: single-pass is typically faster (1 LLM call vs 3 sequential calls).
- Quality:
- CrewAI tends to be more robust when the visit context is sparse (asks safer intake questions, covers more domains).
- Single-pass tends to be more direct and condition-focused, but can make implicit assumptions if key structured fields are empty.
Recommendation (current):
- Keep CrewAI as default for now, and use single-pass when you need lower latency and can tolerate a higher assumption risk.
Generates a questionnaire from stored patient data only (no current visit context).
{ "patient_id": "P001" }Converts question text to speech. Returns an mp3 audio stream.
Request body:
{
"text": "How are you feeling compared to your last visit?",
"voice": "en-US-Chirp3-HD-Aoede"
}voice is optional — defaults to GEMINI_TTS_VOICE env var.
Verify:
curl -X POST http://localhost:8000/tts \
-H "Content-Type: application/json" \
-d '{"text":"Hello, how are you feeling today?"}' \
--output test.mp3 && open test.mp3Transcribes patient speech to text. Accepts multipart/form-data with an audio file field named audio.
Supported codecs: audio/webm;codecs=opus (browser MediaRecorder default), audio/mp4.
Response:
{
"transcript": "I have been feeling some dizziness in the morning.",
"confidence": 0.962
}Verify:
curl -X POST http://localhost:8000/stt \
-F "audio=@test.webm"Describes a patient-submitted photo in clinical terms using Gemini 2.5 Flash multimodal.
Request body:
{
"image_base64": "<standard base64, no data-URI prefix>",
"question": "Can you show us the affected area on your foot?",
"patient_id": "P001",
"question_id": "q3"
}Response:
{
"description": "The image shows the lateral aspect of the left foot with a 2–3 cm area of reddened, slightly raised skin near the fifth metatarsal. No open wound or discharge is visible."
}Verify:
IMAGE_B64=$(base64 -i your_photo.jpg | tr -d '\n')
curl -X POST http://localhost:8000/analyze-image \
-H "Content-Type: application/json" \
-d "{\"image_base64\":\"$IMAGE_B64\",\"question\":\"Show us the area of concern.\"}"To evaluate workflow combinations (e.g., question generation + TTS + STT + image analysis within the same patient check-in), send a workflow id header on every request:
X-RxAI-Workflow-Id: <uuid>
The backend logs this value under input.workflow_id in JSONL/BigQuery logs.
- Medical Data Deduplicator — removes duplicate information across visits and clinical notes
- Healthcare Data Summarizer — identifies key problems and risk factors requiring patient input
- Patient Questionnaire Generator — creates 3–8 targeted, validated questions
Typical pipeline latency: 8–15 seconds (3 sequential LLM calls).
/tts— Google Cloud Text-to-Speech with Chirp3 HD voices (highest quality, low latency)/stt— Cloud Speech-to-Text v2 with Chirp 2 model;AutoDetectDecodingConfighandles webm/opus natively; medical vocabulary hints are included
/analyze-image— same Gemini 2.5 Flash model as the LLM, called with an inline image part + clinical prompt; returns 2–4 sentences of plain text
Every AI call (question generation, TTS, STT, image analysis) writes a JSONL entry to eval/logs/<date>.jsonl via the log_ai_call async context manager in eval/eval_logger.py.
Log schema:
{
"session_id": "uuid4",
"feature": "stt | tts | image_analysis | question_generation",
"model": "model-name-or-voice",
"input": {},
"output": {},
"latency_ms": 1234,
"timestamp": "2026-04-14T10:00:00Z",
"patient_id": "P001",
"question_id": "q2",
"error": null
}These logs feed the evaluation pipeline described in eval/README.md.
- Check the server is running:
python api.py - Verify
http://localhost:8000is reachable - Confirm
.envhas validGOOGLE_APPLICATION_CREDENTIALSpointing to the service account JSON
The API allows http://localhost:5173 (Vite) and http://localhost:3000. Update allow_origins in api.py if using a different port.
The service account is missing a required role. Check the three roles listed in the setup section above.
Chirp3 HD voices may need to be enabled in your GCP project. Verify availability at: GCP Console → Text-to-Speech → Voice list → filter by "Chirp3 HD"
- CrewAI pipeline: 8–15 seconds is normal (3 sequential LLM calls to Gemini)
- TTS/STT: 1–3 seconds round-trip to Google Cloud APIs
- If event loop blocking is observed under load, the sync SDK calls in
/ttsand/sttcan be wrapped inasyncio.get_event_loop().run_in_executor(None, ...)— defer this to Week 2
- Database — replace in-memory patient JSON with a proper database
- Caching — cache generated questionnaires to reduce LLM calls
- Async SDK calls — move
synthesize_speechandrecognizeto thread pool executor - BigQuery eval sink — update
eval_logger.pyto stream to BigQuery (see Week 2 plan) - Rate limiting — add per-patient request throttling
- Authentication — add proper auth/authorization before patient data is accessible
- HTTPS —
getUserMedia(camera/mic) requires HTTPS in production; localhost is exempt