The Voice Server is a dedicated microservice for the SignalZero AI ecosystem, providing high-quality Speech-to-Text (ASR) and Text-to-Speech (TTS) capabilities optimized for running on CPU environments (like macOS via Docker).
- ASR (Recognition): Powered by OpenAI Whisper
medium.enfor high accuracy. - TTS (Synthesis): Powered by
Kokoro-ONNXfor high-quality, lightweight neural voice output. - Voice Activity Detection (VAD): Intelligent silence suppression with a tunable RMS noise gate to prevent hallucinations.
- Wake Word: Mandatory "Axiom" wake word detection.
- Remote Control: FastAPI endpoints to toggle the microphone state ("ears") from the Chat UI.
- Audio Routing: Uses PulseAudio over TCP to bridge Docker audio to the host system (e.g., Bose Flex speakers).
The service runs a background recording loop that:
- Filters noise using an RMS threshold.
- Detects speech using WebRTC VAD.
- Transcribes valid utterances with Whisper.
- Identifies the speaker using SpeechBrain.
- Routes the message to the SignalZero Kernel if the "Axiom" wake word is present.
It also provides a /speak API used by the Kernel to generate and play neural responses.
This service is designed to be run via the SignalZero-Docker composition.
KERNEL_URL: URL of the SignalZero LocalNode (default:http://localnode:3001/api).REDIS_HOST: Redis host for state locking (default:redis).PULSE_SERVER: PulseAudio server address (e.g.,tcp:host.docker.internal:4713).
Currently patched to run efficiently on CPU (ARM64/Apple Silicon) by forcing Float32 precision and ONNX execution providers.
This project is licensed under the MIT License.