Skip to content

padmanabhan-r/Beacon

Repository files navigation

Beacon — a guiding light for everyday freedom. Voice-first sight, navigation, reading, web search, and smart-home — hands-free.

Beacon

A guiding light for everyday freedom.

Built for ElevenHacks 2026.

Backend ElevenLabs Gemini Firecrawl SerpAPI Google Maps Tuya Cloud Next.js FastAPI Leaflet Railway Vercel ElevenHacks

There are hundreds of millions of visually impaired people around the world.

Things we do without thinking aren't that simple for them.

Every day takes courage.

Beacon is a voice-first, hands-free mobile app built to help visually impaired users better understand and interact with the world around them. It runs on a phone mounted to a specialized lightweight chest harness and pairs with a Bluetooth remote that lives in the user's pocket to instantly start the assistant, turning the system into wearable AI. No screens, no menus, no touching the display. Users simply talk, and Beacon sees the world and talks back in real time.

Beacon delivers real-time scene narration, hazard detection, object recognition, and live OCR for signs, menus, books, and labels, all spoken aloud naturally. It also supports walking turn-by-turn navigation, live web search, location-aware weather, and smart-home control entirely through conversation. The entire experience is fully hands-free with natural speech, sub-second barge-in, and no keyboards, menus, or screen interaction.

Built with Cursor for rapid agentic development, ElevenLabs (Conversational AI Agents, Voice Design, Text-to-Speech, Sound Effects), Gemini as the vision intelligence layer, Firecrawl and SerpAPI for web search, Google Maps + OSRM for navigation, Open-Meteo for weather, and Tuya for smart-home control.

Beacon. A guiding light for everyday freedom.

Beacon — voice-first hands-free assistant


Screenshots

The app

Beacon landing — a guiding light for everyday freedom, with feature list and Enter button
Landing — the one screen a sighted caretaker sees during install.
Beacon idle screen — 'Let the Beacon guide you' with sample voice prompts and a big start FAB
Idle / Start — three voice example prompts then the FAB.
Beacon setup screen — name, microphone / camera / location permissions, smart-light test
Setup — one-time wizard for permissions, identity, smart-light pairing.
Beacon dashboard — sessions saved, weather card, location card, session history with timestamps
Dashboard — saved sessions + live weather + GPS card.

Real scenes

Phone clipped to a chest harness, voice in, voice out. Left half is the Beacon PWA; right half is the user.

Beacon describing a garden path — user asks 'Can you look what's in front of me?' and hears the paved red path, tree trunks, hedge, and house ahead read aloud.

Walking outside. "Can you look what's in front of me?" → scene narration in real time.

Beacon identifying croissants at a bakery display case — user asks 'I want some croissants. Do you have any here?' and Beacon confirms the golden-brown crescents on the right side.

At the bakery. "I want croissants. Do you have any?" → outlet-aware vision answer.

Beacon controlling a smart bulb and finding a water bottle in the bedroom — user says 'Can you turn on the lights' then 'Can you look for my bottle?' and Beacon describes the bottle on the side table.

At home. "Turn on the lights" → control_light via Tuya cloud. "Look for my bottle" → vision find-object.


Table of Contents


What is Beacon?

A hands-free PWA designed to be opened on a phone clipped to a chest harness. Once started, the user never touches the screen — the loop is voice-only:

  • See for me — describe what's in front of the camera. Read text on signs, packaging, screens. Refuses to hallucinate on dark or blank frames.
  • Search for me — three-tier dispatcher: SerpAPI Google engine for live data (scores, weather, breaking), Firecrawl news for "latest" / "this week", Firecrawl general for evergreen. Queries auto-rewrite to inject resolved city ("restaurants near me" → "restaurants in Chennai").
  • Review for me — stand in front of a restaurant, ask "is this place good?". place_reviews uses GPS + Google Places to pin the exact outlet, returns its star rating + address + real reviewer snippets — not generic chain reviews.
  • Walk me somewhere — turn-by-turn walking navigation. Geocodes "nearest X" / category queries via Google Places Text Search (Nominatim fallback), routes via OSRM (default, no key) or Google Maps Directions (MAPS_PROVIDER=gmaps). Live watchPosition (3 s throttle) auto-advances waypoints at 15 m via haversine; AR turn-arrow chip + leaflet route overlay (with × dismiss) update on every cross. Voice "stop" or the dismiss button clears the route end-to-end.
  • Tell me where I am — reverse-geocoded location read aloud, mini-map shown inline in the transcript.
  • Help me at home — voice control for a Tuya cloud-paired smart bulb. Reachable from any network including Railway production.

The PWA installs on iOS Safari and Android Chrome. The whole interaction model is one big tap-to-start, then voice for everything that follows. The camera feed is the canvas — the assistant orb is corner-pinned, not center-stage.


How It Works

1. Open + Start. User taps Enter on the landing screen, then Start. The browser asks for mic, camera, and geolocation. The PWA mounts an AudioWorklet recorder, opens a WebSocket to the backend, and the backend mints a fresh signed URL via GET /v1/convai/conversation/get-signed-url and connects the WS upstream to the (private) ElevenLabs ConvAI agent.

2. Talk. Microphone PCM @ 16 kHz streams over the WS. ConvAI's STT transcribes; the agent's LLM (Gemini Flash) plans; ElevenLabs TTS streams audio back; the worklet plays it. Barge-in fires interruption events that flush the audio buffer in <200 ms.

3. See. When the user asks a vision question, the agent emits client_tool_call('describe_scene'). The backend reads the latest buffered camera frame (the frontend streams JPEGs over the WebSocket on a low cadence into a per-session ring buffer) and hands it to Gemini 3.1 Flash Lite, which returns 1–2 sentences. Cold call ~2.3 s, warm ~1.2 s — Gemini is pre-warmed on lifespan startup with the bundled backend/fixtures/coffee.jpg. A native-multimodal POST /files + multimodal_message pipeline (Path A) is wired in backend/elevenlabs_client.py and verified via the /dev/vision-test endpoint, but kept off the hot path (race between agent-decided tool calls and our multimodal inject; hard 10-file cap per conversation).

5. Search. Agent calls web_search. Tiered dispatcher classifies intent:

  • Live (live | breaking | right now | scores | currently) → SerpAPI Google engine (answer_box, sports_results, knowledge_graph, organic enrichment) — actual numbers, not just headlines.
  • News (latest | recent | news | this week | update) → Firecrawl /v2/search with sources=news, tbs=qdr:w.
  • General → Firecrawl web tier, no time filter — keeps evergreen content like reviews and how-tos.

"near me / nearby / this location / this place" rewrite to the user's resolved city via Nominatim before dispatch. Top-3 results summarized into 800 chars; agent speaks 1–2 sentences.

6. Outlet-precise reviews. When the user asks reviews of a named place they're at ("reviews for this Popeyes"), agent skips web_search and calls place_reviews(name). Backend uses session GPS + Google Places findplacefromtext (50 m → 200 m → 500 m progressive radius) to lock on the specific outlet, then place/details to pull rating, address, and 2 real reviewer snippets. Returns the actual outlet — not generic chain reviews.

7. Tools render inline. Every tool result is forwarded to the frontend as a tool_result WebSocket envelope and rendered as a persistent card in the transcript — chips for control_light / time_now / describe_scene, a glass card for web_search with a "from web · in {city}" pill, a 120 px Leaflet mini-map for where_am_i. The top-left tool chip is the during indicator; the inline card is the after record.

8. Smart light (cloud). control_light dispatches to Tuya OpenAPI via tinytuya.Cloud, toggling a Havells F8 bulb on switch_led. HMAC-SHA256 signed REST, 5-second timeout. Works from any network — phone on cellular, backend on Railway, bulb behind home WiFi all interoperate.


Key Features

Feature Description
Voice-first hands-free loop One tap to start, then PCM audio streams both ways over a single WebSocket. No keyboard, no on-screen taps required.
Wireless remote integration Pairs with any cheap Bluetooth shutter remote (the kind bundled with selfie sticks) via Web Bluetooth in the Setup flow. The OS sees the remote as a standard HID device. End-to-end zero-touch operation (intercepting the remote's AudioVolumeUp / AudioVolumeDown DOM events) is on the roadmap; in the current build, pairing serves as the on-ramp and the on-screen Start/Stop FAB stays pixel-stable so a sighted caretaker (or the user themselves) can rely on muscle memory.
Camera-first UI Live video fills the phone frame as the canvas. Reticle, AR turn-arrow slot, and corner orb sit on top.
Vision via describe_scene Client tool Agent emits client_tool_call('describe_scene') → backend hands the buffered frame to Gemini 3.1 Flash Lite (cold 2.3 s, warm 1.2 s, pre-warmed on lifespan startup). Covers scene description, OCR for signs / books / menus / labels, and object identification — no separate tools needed. Refuses to hallucinate on dark / blank frames — explicit refusal string instead of made-up shapes.
Tiered web search Live tier (scores, breaking, right now) → SerpAPI Google engine with answer_box / sports_results / knowledge_graph parsing. News tier → Firecrawl /v2/search sources=news past-week. General → Firecrawl web. Auto-rewrites "near me / this location / this place" to resolved city via Nominatim.
Outlet-precise reviews (place_reviews) GPS + Google Places findplacefromtext finds the exact outlet the user is standing at, then place/details returns its star rating, address, and 2 real reviewer snippets. Beats generic chain reviews from web search. Progressive radius 50 → 200 → 500 m.
Inline tool transcript Every tool call leaves a persistent visual record: chips for time/light/scene, glass card for search, Leaflet mini-map for location.
Smart light control (cloud) `control_light(action: on
Signed-URL private agent Agent locked enable_auth=true. Backend mints a fresh signed URL per connect. API key never reaches the browser.
PWA + offline Service worker caches shell. AudioWorklet processors served from public/. Installs on iOS/Android home screen.
Setup wizard Name + mic/cam/geo permission test + bulb cloud reachability check. Persisted to localStorage under beacon.<domain>.v1 keys.
Session history Every session saved client-side (beacon.sessions.v1) with transcript, frames count, GPS track. Dashboard shows past sessions.
Camera-first dashboard Live session timer or "{N} sessions saved" hero; no telemetry leakage; sessions list grouped by day.
Walking navigation OSRM + Google Maps Directions abstraction (MAPS_PROVIDER env). navigate_to / next_turn / stop_navigation tools, live watchPosition (3 s throttle), 15 m waypoint auto-advance via haversine. Floating leaflet route overlay + AR turn arrow chip. Forward geocode falls back from Google Places Text Search → Nominatim so category queries ("nearest Domino's", "closest fish market") resolve.
Where am I where_am_i reverse-geocodes session GPS via Nominatim (zoom 18) → "street, neighborhood, city, country" line. Disambiguated from describe_scene in the system prompt so location queries no longer fall back to the camera.
UI sound feedback (ElevenLabs Sound Effects) 4 short cues baked once via the ElevenLabs Sound Effects API (POST /v1/sound-generation) by backend/generate_sfx.py and served from frontend/public/sfx/: tool plays on every agent tool dispatch, nav on bottom-tab navigation, start / stop on the session FAB. Cached automatically by the service worker; played from lib/sfx.ts with iOS audio-unlock warm on first user gesture.
Spoken welcome line (ElevenLabs TTS) backend/bake_welcome.py bakes a static welcome line via ElevenLabs Text-to-Speech to frontend/public/welcome.mp3. Voice + wording configurable via ELEVENLABS_WELCOME_VOICE_ID / ELEVENLABS_WELCOME_TEXT / ELEVENLABS_WELCOME_MODEL.
Idle-screen quick-start prompts The pre-session screen surfaces 3 voice example prompts ("What do you see?", "Where am I?", "Take me to the nearest coffee shop.") so first-time users know what to say without reading docs.
Smooth barge-in transcript Mid-utterance interruption is handled end-to-end: backend forwards agent_response_correction with replace_last: true; frontend overwrites the last assistant bubble in place. No duplicate text after a "stop". Audio buffer also flushes in <200 ms via the interruption event.

Wireless remote integration (zero-touch on-ramp)

Beacon is built to run from a chest-mounted phone with as little screen interaction as possible. The cheapest way to get there: a generic Bluetooth selfie-stick shutter remote (~$3 on any marketplace).

  • The remote pairs through Web Bluetooth in the Setup flow (frontend/app/setup/page.tsx) — Beacon stores a paired flag + the device name in localStorage under beacon.remote.v1. Pairing is the user-facing confirmation that the chest-harness input device is connected.
  • The remote shows up to Android as a standard HID device at the OS level — no driver, no app permissions, no SDK.
  • Roadmap (not yet wired): intercepting the remote's AudioVolumeUp / AudioVolumeDown DOM events to map one button to "enter app" and the other to "start / stop the session." Android Chrome does not reliably forward hardware volume keys to a backgrounded PWA, so this needs a real-device test pass before shipping.
  • The on-screen Start / Stop FAB is deliberately pixel-stable across states (bottom: 16, left: 50%, transform: translateX(-50%), 56×56) — critical for muscle memory on a blind-user workflow, and a fallback for situations where the remote is out of reach.

Realtime Architecture

┌─────────────────┐         audio + JSON envelopes        ┌─────────────────────────┐
│  Frontend (PWA) │◄─────────────────────────────────────►│ ElevenLabs ConvAI       │
│  mic + camera   │         WebSocket (signed URL)        │ STT + agent LLM         │
│  + GPS          │                                       │ (gemini-2.5-flash) + TTS│
└────────┬────────┘                                       └──────────┬──────────────┘
         │ JPEG frames + GPS pushed                                  │ client_tool_call
         │ over the same WebSocket                                   ▼
         ▼                                                  ┌────────────────────┐
   ┌──────────────┐                                         │  Tool dispatcher   │
   │ Backend      │                                         │  (FastAPI)         │
   │ frame buffer │                                         └────┬───────────────┘
   │ + GPS store  │                                              │
   └──────┬───────┘                                              │
          │                                                      ▼
          └──► describe_scene │ where_am_i │ get_weather │ web_search  │ place_reviews │ navigate_to        │ control_light
              (Gemini 3.1 FL) │ (Nominatim) │ (Open-Meteo)│ (SerpAPI +  │ (Google       │ (OSRM | gmaps)     │ (Tuya cloud)
                                                          │  Firecrawl) │  Places)      │ + next_turn        │ + time_now
                                                                                        │ + stop_navigation

Vision path. When the agent emits client_tool_call('describe_scene'), the backend pulls the latest frame from the per-session ring buffer, calls Gemini 3.1 Flash Lite (cold 2.3 s, warm 1.2 s, pre-warmed on lifespan startup with backend/fixtures/coffee.jpg), and returns the answer via client_tool_result. A Path A pipeline (ConvAI POST /v1/convai/conversations/{id}/files + multimodal_message) is implemented in backend/elevenlabs_client.py and verified via the /dev/vision-test endpoint, but kept off the hot path — the F2-hybrid spike (see plan.md:128) hit a race between the agent's own tool-call decision and our multimodal frame inject, and the hard 10-file cap per conversation makes it unsuitable for long sessions.

Latency tactics:

  • Pre-warm Gemini 3.1 Flash Lite on backend lifespan startup with the bundled backend/fixtures/coffee.jpg.
  • Transitional phrase ("let me look") in agent system prompt — first audio plays before the tool round-trip completes.
  • TTS streamed at 16 kHz to match the worklet player; no resample.
  • Signed URL fetched per connect (no long-lived public agent ID exposed).

ElevenLabs Integration

Product Where Used
Conversational AI (/v1/convai/conversation) The whole audio loop. STT + agent LLM (gemini-2.5-flash) + TTS, all integrated. WebSocket schema: client_tool_call, client_tool_result, interruption, agent_response, user_transcript.
Multimodal POST /files (implemented, reserved) backend/elevenlabs_client.py has the full POST /v1/convai/conversations/{id}/files upload + multimodal_message inject path. Exercised by the /dev/vision-test endpoint. Not on the production hot path — production vision goes through the describe_scene Client tool because of the agent-vs-inject tool-decision race and the hard 10-file cap per conversation.
Signed-URL flow (GET /v1/convai/conversation/get-signed-url) Backend mints a short-lived wss:// URL per WebSocket connect. Agent locked private (platform_settings.auth.enable_auth=true). Client never sees agent ID or API key.
Console-provisioned tools Ten Client tools live on the agent. Eight are registered via API by backend/register_tools.py (web_search, get_weather, control_light, where_am_i, navigate_to, next_turn, stop_navigation, place_reviews); time_now and describe_scene are dashboard-created during initial agent setup (see AGENT_SETUP_INSTRUCTIONS.md). Other backend ops scripts: fix_tool_timeouts.py, fix_describe_scene_param.py, fix_system_prompt.py, fix_web_search_desc.py. System-prompt source-of-truth lives between <!-- prompt:start/end --> markers in AGENT_SETUP_INSTRUCTIONS.md and is patched onto the agent via API.
Voice Design (POST /v1/text-to-voice/designPOST /v1/text-to-voice) Beacon's guide voice — a calm, caring archetype crafted for hands-free use, not a stock library voice. Set as the agent's default in the ConvAI dashboard.
Sound Effects (POST /v1/sound-generation) UI feedback cues — 4 short MP3s (tool, nav, start, stop) baked offline via backend/generate_sfx.py, served from frontend/public/sfx/, cached by the service worker, played from lib/sfx.ts.
Text-to-Speech (POST /v1/text-to-speech/{voice_id}) Static welcome audio baked once by backend/bake_welcome.py to frontend/public/welcome.mp3. Voice + text configurable via ELEVENLABS_WELCOME_VOICE_ID / ELEVENLABS_WELCOME_TEXT.

Audio format pinned to pcm @ 16000 in the dashboard (default pcm_44100 mismatches the AudioWorklet player and produces chipmunk audio). Tool response_timeout_secs patched from the dashboard default of 1 s up to 5–20 s depending on tool.


Tools

The agent has ten tools registered in the ElevenLabs console (all Client tools, dispatched via client_tool_call to the FastAPI backend):

Tool Params What it does Dispatched in
time_now Returns local ISO time + day. Frontend renders a 🕒 chip with parsed human time. backend/main.py _handle_tool_call
describe_scene question?: string Reads the buffered camera frame, calls Gemini 3.1 Flash Lite. Returns 1–2 spoken sentences describing what's visible. Hard-refuses on dark / blank frames with an explicit string instead of hallucinating. Also covers OCR ("read this") and object identification — no separate tools needed. backend/vision.py
where_am_i Reverse-geocodes session GPS via Nominatim (zoom 18) → "street, neighborhood, city, country". Disambiguated from describe_scene in the system prompt so the agent never falls back to the camera for location queries. City is also pre-warmed in the background on first GPS fix for the session, so the answer arrives in <5 ms after the user asks. backend/main.py
get_weather Current weather at the user's GPS via Open-Meteo (free, no key, native Celsius). Pre-warmed on first GPS fix and cached for 10 min, so repeat calls return in <10 ms; cold path ~150 ms. Returns Current weather in <city>: <X>°C, <conditions>, humidity <Y>%, wind <Z> km/h. Replaces SerpAPI for weather — Open-Meteo takes raw lat/lon so the answer always matches the user's actual location (no Google geo-IP guessing). backend/main.py + backend/weather.py
web_search query: string Three-tier dispatcher. Live (regex match on live|breaking|right now|score|currently|happening now) → SerpAPI Google engine, parses answer_box / sports_results / knowledge_graph / organic enrichment. News (latest|recent|news|update) → Firecrawl /v2/search sources=news, tbs=qdr:w. General → Firecrawl web tier. Rewrites "near me / this location / this place" to resolved city via Nominatim before dispatch. Top-3 summarized into 800 chars. Weather keywords short-circuit to get_weather before SerpAPI (safety net for agent confusion). backend/main.py
place_reviews name: string Outlet-precise reviews of a place the user is at. Pulls session GPS, calls Google Places findplacefromtext (50 m → 200 m → 500 m progressive radius) to lock the specific outlet, then place/details for star rating, address, and up to 2 reviewer snippets. Falls back gracefully when GPS missing or no candidate. backend/main.py
control_light action: "on"|"off" Toggles a Havells F8 bulb via Tuya OpenAPI (tinytuya.Cloud), switch_led command. HMAC-SHA256 signed REST, 5-second timeout. Reachable from anywhere — backend on Railway controls a bulb behind home WiFi. Returns "light is on" / "light is off". backend/smart_bulb.py
navigate_to destination: string Walking nav. Geocodes via Google Places Text Search (handles "nearest X" / category queries) → Nominatim fallback. Plans route via OSRM (default) or Google Directions (MAPS_PROVIDER=gmaps). Stores nav_session[session_id], returns ETA + first turn cue. Frontend renders the polyline + AR arrow. backend/navigation.py + backend/main.py
next_turn Bumps current_idx server-side, returns the upcoming step instruction. Used when the user asks "what's next". backend/main.py
stop_navigation Clears nav_session[session_id], fires nav_end envelope to frontend. Polyline + AR arrow disappear. Also reachable via the × button on the route overlay (frontend sends stop_nav directly). backend/main.py

System prompt rules + per-tool docs live between <!-- prompt:start/end --> markers in AGENT_SETUP_INSTRUCTIONS.md. Push edits with uv run python fix_system_prompt.py from backend/.


Tech Stack

Category Technology
Frontend framework Next.js 16 App Router, React 19, TypeScript strict
Styling Tailwind CSS 4, custom design tokens (Syne / DM Sans / Geist Mono), glassmorphism
Realtime audio AudioWorklet (PCM 16 kHz both ways), pcm-recorder-processor.js, pcm-player-processor.js
Maps Leaflet (no react-leaflet wrapper) — MapPinPreview (read-only mini), GpsMapClient (interactive dashboard)
PWA Service worker (public/sw.js), manifest with any + maskable icons at 192/512, Apple touch icon, iOS standalone display
Backend framework FastAPI on Python 3.12, Uvicorn
Agents / Voice ElevenLabs Conversational AI (elevenlabs Python SDK server-side, signed-URL flow), client_tool_call / client_tool_result schema, gemini-2.5-flash as the agent LLM
Vision Google Gemini gemini-3.1-flash-lite via the describe_scene Client tool (backend/vision.py). The native ConvAI multimodal POST /files + multimodal_message pipeline is implemented in backend/elevenlabs_client.py but reserved for the /dev/vision-test endpoint only.
Search SerpAPI Google engine for live tier (/search.json); Firecrawl /v2/search for news + general; raw httpx, no SDK
Weather Open-Meteo /v1/forecast (free, no key, native Celsius) — backend/weather.py mirrors frontend/lib/weather.ts
Places / Reviews Google Places API (findplacefromtext + details) — outlet pinning + reviews
Geocoding OpenStreetMap Nominatim (reverse-geocode "near me" → city); IP-fallback when GPS denied
Smart home tinytuya.Cloud — Tuya OpenAPI (HMAC-SHA256 signed REST), switch_led command — Havells F8 bulb. Works from any network.
Routing OSRM walking profile (default, no key) + Google Maps Directions (MAPS_PROVIDER=gmaps); 15 m haversine waypoint advance, polyline decode for Google's encoded format
Sound effects ElevenLabs Sound Effects API (POST /v1/sound-generation); baked offline by backend/generate_sfx.py into frontend/public/sfx/{tool,nav,start,stop}.mp3
Static welcome audio ElevenLabs TTS (POST /v1/text-to-speech/{voice_id}); baked offline by backend/bake_welcome.py into frontend/public/welcome.mp3
Package management uv for Python (lock + dev), pip + exported requirements.txt for the production Docker image; pnpm for frontend
Validation Pydantic at FastAPI boundaries, Zod planned at Next.js API routes
Deployment Railway (backend Dockerfile, repo-root build context, $PORT shell-expanded), Vercel (frontend, NEXT_PUBLIC_WS_URL=wss://<railway>/ws)

Running Locally

Prerequisites: Node 20+, pnpm, Python 3.12, uv. API keys for ElevenLabs, Firecrawl, Gemini.

# Clone
git clone https://github.com/<you>/Beacon.git && cd Beacon

# Backend deps (uv)
cd backend && uv sync && cd ..

# Frontend deps
cd frontend && pnpm install && cd ..

# Configure env (do NOT commit)
cp backend/.env.example backend/.env
cp frontend/.env.example frontend/.env.local
# fill in keys — see "Environment Variables" below

# Boot backend :8000 + frontend :3000 + tunnels (when present)
./dry-run.sh
# Or run them separately:
#   cd backend  && uv run uvicorn main:app --reload --port 8000
#   cd frontend && pnpm dev

Open http://localhost:3000 on a phone (use a tunnel — Cloudflare, ngrok — since getUserMedia requires HTTPS on real devices).

Environment Variables

Templates live per-app: backend/.env.example and frontend/.env.example.

Backend (backend/.env):

# ElevenLabs (server-only — never inlined in client components)
ELEVENLABS_API_KEY=...
ELEVENLABS_AGENT_ID=...        # console-provisioned ConvAI agent (private, signed-URL only)
ELEVENLABS_REGION=default      # default for hobby/pro; us|eu|in only on Enterprise data-residency workspaces

# Vision Path B
GEMINI_API_KEY=...

# Web search
FIRECRAWL_API_KEY=...           # news + general (/v2/search)
SERPAPI_KEY=...                 # live tier only — scores/breaking/right-now (100 credits/mo)

# Smart light (Tuya cloud — works from any network)
TUYA_CLIENT_ID=...
TUYA_CLIENT_SECRET=...
TUYA_REGION=in                  # us | eu | cn | in
HAVELLS_DEVICE_ID=...

# Maps (optional — OSRM is the default and needs no key)
GOOGLE_MAPS_API_KEY=...
MAPS_PROVIDER=osrm              # osrm | gmaps

Frontend (frontend/.env.local):

NEXT_PUBLIC_WS_URL=ws://localhost:8000/ws

Setting up the ConvAI agent

The agent is provisioned manually in the ElevenLabs console — full step-by-step runbook with system prompt, tool schemas, voice/LLM config, and the enable_auth=true switch lives in AGENT_SETUP_INSTRUCTIONS.md.

Quick reference once the agent exists:

# Probe which region is hosting your agent
uv run python backend/probe_region.py

# Inspect the live agent's tool list, prompts, timeouts (run BEFORE guessing tool failures)
uv run python backend/inspect_agent.py

# Push the latest system prompt + turn timeouts to the agent
uv run python backend/fix_system_prompt.py

# Idempotently register all 8 client tools (web_search, get_weather, control_light, where_am_i,
# navigate_to, next_turn, stop_navigation, place_reviews) on the live agent
uv run python backend/register_tools.py

# Patch every tool's response_timeout_secs (dashboard default is 1 s — too short for Gemini)
uv run python backend/fix_tool_timeouts.py

# Lock the agent private (enable_auth=true) so only signed URLs can connect
uv run python backend/lock_agent.py

# Bake the static welcome line (ElevenLabs TTS) → frontend/public/welcome.mp3
uv run python backend/bake_welcome.py

# Bake the 4 UI SFX cues (ElevenLabs Sound Effects) → frontend/public/sfx/{tool,nav,start,stop}.mp3
uv run python backend/generate_sfx.py

Deployment

Backend → Railway

  1. Push the repo to GitHub.
  2. Railway → New ProjectDeploy from GitHub repo.
  3. Railway auto-detects railway.json. Builder = Dockerfile, build context = repo root, Dockerfile path = backend/Dockerfile. The Dockerfile uses pip install -r requirements.txt (the requirements.txt is exported from uv.lockuv export --no-dev --no-hashes).
  4. Set Variables on Railway:
    • Required: ELEVENLABS_API_KEY, ELEVENLABS_AGENT_ID, ELEVENLABS_REGION, GEMINI_API_KEY.
    • Search: FIRECRAWL_API_KEY (news + general), SERPAPI_KEY (live tier — scores, breaking, right-now; 100 credits/mo).
    • Places: GOOGLE_MAPS_API_KEY (Places API enabled in Google Cloud Console; findplacefromtext + place/details are the consumed endpoints).
    • Smart light (cloud): TUYA_CLIENT_ID, TUYA_CLIENT_SECRET, TUYA_REGION (us\|eu\|cn\|in), HAVELLS_DEVICE_ID. Cloud Tuya works from Railway — bulb does not need to share a network with the backend.
  5. Networking → Generate Domain. Verify https://<your-domain>/healthz returns {"ok": true, "agent_id_set": true, ...}.

Frontend → Vercel

  1. Vercel → Add NewProject → import the repo.
  2. Root Directory: frontend/ (critical — set in the import screen).
  3. Framework auto-detects Next.js.
  4. Environment Variable: NEXT_PUBLIC_WS_URL=wss://<your-railway-domain>/ws (note wss://, trailing /ws, no slash after).
  5. Deploy. Vercel issues a *.vercel.app URL.

Open the Vercel URL on a phone, grant mic + camera + location, tap Enter → Start. Frontend opens a WS to wss://<railway>/ws; backend mints a signed URL and connects to the private ConvAI agent. Railway logs should show [ElevenLabs] WS open (signed URL) followed by your first transcript.

The smart-light demo works directly from Railway production — Tuya cloud is reachable from anywhere, no LAN constraint.


License

MIT License

Licensed under the MIT License.

Built by Limb for ElevenHacks 2026.


⚠️ Use with caution. Beacon is an AI-powered assistant — outputs may be incorrect, delayed, or incomplete. Always cross-check with a cane, guide, or sighted assistance before acting. Production deployment as a certified assistive device would require accessibility audits, regulatory clearance, and formal safety validation beyond the scope of this build.