Voice Endpoint Format
Cedar OS provides two approaches for handling voice, depending on your provider configuration:- Mastra/Custom backends: Direct voice endpoint handling
- AI SDK/OpenAI providers: Automatic transcription and speech generation
Provider-Specific Voice Handling
Mastra and Custom Backends
When using Mastra or custom backends, Cedar OS sends voice data directly to your voice endpoint. You have full control over:- Audio transcription
- Response generation
- Text-to-speech synthesis
- Response format
AI SDK and OpenAI Providers
When using AI SDK or OpenAI providers, Cedar OS automatically:- Transcribes audio using OpenAI’s Whisper model
- Generates a text response using the configured LLM
- Optionally generates speech using OpenAI’s TTS model (when
useBrowserTTS
is false)
Request Format (Mastra/Custom)
Cedar OS sends voice data to your endpoint as a multipart form data request:Voice Settings Structure
Thesettings
field contains a JSON object with the following structure:
language
: Language code for speech recognition/synthesisvoiceId
: Voice identifier for TTS (provider-specific)pitch
,rate
,volume
: Voice modulation parametersuseBrowserTTS
: Whether to use browser’s built-in TTSautoAddToMessages
: Whether to add voice interactions to chat history
Context
Thecontext
field contains stringified additional context from the Cedar state, which may include:
- Current chat messages
- Application state
- User-defined context
Response Format (All Providers)
Your endpoint can return different types of responses:1. JSON Response (Recommended)
text
: The text response from the assistanttranscription
: The transcribed user inputaudioData
: Base64-encoded audio responseaudioUrl
: URL to an audio fileaudioFormat
: MIME type of the audiousage
: Token usage statisticsobject
: Structured response for actions
2. Audio Response
Return raw audio data with appropriate content type:3. Plain Text Response
Implementation Example (Mastra)
Here’s an example of implementing a voice endpoint in a Mastra backend:Voice Response Handling
Cedar OS provides a unifiedhandleLLMVoice
function that processes voice responses consistently across all providers:
- Audio Playback: Handles base64 audio data, audio URLs, or browser TTS
- Message Integration: Automatically adds transcriptions and responses to chat history
- Action Execution: Processes structured responses to trigger state changes
Structured Responses
Cedar OS supports structured responses that can trigger actions in your application:SetState Response
To execute a state change:myCustomState.setValue(42)
in your Cedar state.
Error Handling
Return appropriate HTTP status codes:200 OK
: Successful response400 Bad Request
: Invalid request format401 Unauthorized
: Missing or invalid API key500 Internal Server Error
: Server-side error
Voice Configuration
Configure voice settings when initializing Cedar:Provider-Specific Notes
OpenAI/AI SDK
- Transcription: Uses Whisper model (
whisper-1
) - Speech: Uses TTS model (
tts-1
) with configurable voices - Audio format: MP3 (audio/mpeg)
Mastra/Custom
- Full control over transcription and TTS services
- Can integrate with any speech service (Google, Azure, AWS, etc.)
- Flexible audio format support