Cedar’s voice system sends audio data to your backend for processing and expects either audio responses or structured JSON responses. This page covers how to implement the backend endpoint to handle voice interactions.

Endpoint Configuration

Automatic Configuration with Mastra

When using the Mastra provider, you can automatically configure the voice endpoint through the provider configuration:
<CedarCopilot
	llmProvider={{
		provider: 'mastra',
		baseURL: 'http://localhost:3000/api',
		voiceRoute: '/chat/voice-execute', // Automatically sets voice endpoint to: http://localhost:3000/api/chat/voice-execute
	}}>
	<YourApp />
</CedarCopilot>
This eliminates the need to manually call setVoiceEndpoint() and ensures consistency between your chat and voice endpoints.

Manual Configuration

For other providers or custom setups, configure the endpoint manually:
const voice = useCedarStore((state) => state.voice);
voice.setVoiceEndpoint('https://your-backend.com/api/voice');

Request Format

The voice system sends a multipart/form-data POST request to your configured endpoint with the following fields:
FormData {
  audio: Blob,      // WebM format with Opus codec
  settings: string, // JSON string of voice settings
  context: string   // JSON string of additional context
}

Audio Data

  • Format: WebM container with Opus codec
  • Type: Binary blob from MediaRecorder API
  • Quality: Optimized for speech recognition

Voice Settings

{
  language: string;     // e.g., 'en-US'
  voiceId?: string;     // Optional voice ID for TTS
  pitch?: number;       // 0.5 to 2.0
  rate?: number;        // 0.5 to 2.0
  volume?: number;      // 0.0 to 1.0
  useBrowserTTS?: boolean;
  autoAddToMessages?: boolean;
}

Additional Context

The context includes Cedar’s additional context data (file contents, state information, etc.) that can be used to provide better responses.

Response Formats

Your backend can respond in several ways:

1. Direct Audio Response

Return audio data directly with the appropriate content type:
// Response headers
Content-Type: audio/mpeg
// or audio/wav, audio/ogg, etc.

// Response body: Raw audio data

2. JSON Response with Audio URL

{
  "audioUrl": "https://example.com/response.mp3",
  "text": "Optional text transcript",
  "transcription": "What the user said"
}

3. JSON Response with Base64 Audio

{
  "audioData": "base64-encoded-audio-data",
  "audioFormat": "mp3", // or "wav", "ogg"
  "text": "Response text",
  "transcription": "User input transcription"
}

4. Structured Response with Actions

{
  "transcription": "Show me the user dashboard",
  "text": "I'll show you the user dashboard",
  "object": {
    "type": "action",
    "stateKey": "ui",
    "setterKey": "navigateTo",
    "args": ["/dashboard"]
  },
  "audioUrl": "https://example.com/response.mp3"
}

Implementation Examples

Node.js with Express

import express from 'express';
import multer from 'multer';
import OpenAI from 'openai';

const app = express();
const upload = multer();
const openai = new OpenAI();

app.post(
	'/api/chat/voice',
	upload.fields([
		{ name: 'audio', maxCount: 1 },
		{ name: 'settings', maxCount: 1 },
		{ name: 'context', maxCount: 1 },
	]),
	async (req, res) => {
		try {
			const audioFile = req.files.audio[0];
			const settings = JSON.parse(req.body.settings);
			const context = JSON.parse(req.body.context);

			// 1. Transcribe audio to text
			const transcription = await openai.audio.transcriptions.create({
				file: new File([audioFile.buffer], 'audio.webm', {
					type: 'audio/webm',
				}),
				model: 'whisper-1',
				language: settings.language?.split('-')[0] || 'en',
			});

			// 2. Process with your AI agent
			const messages = [
				{ role: 'system', content: 'You are a helpful assistant.' },
				{ role: 'user', content: transcription.text },
			];

			const completion = await openai.chat.completions.create({
				model: 'gpt-4',
				messages: messages,
			});

			const responseText = completion.choices[0].message.content;

			// 3. Convert response to speech
			const speech = await openai.audio.speech.create({
				model: 'tts-1',
				voice: settings.voiceId || 'alloy',
				input: responseText,
				speed: settings.rate || 1.0,
			});

			const audioBuffer = Buffer.from(await speech.arrayBuffer());

			// 4. Return audio response
			res.set({
				'Content-Type': 'audio/mpeg',
				'Content-Length': audioBuffer.length,
			});
			res.send(audioBuffer);
		} catch (error) {
			console.error('Voice processing error:', error);
			res.status(500).json({ error: 'Failed to process voice request' });
		}
	}
);

Python with FastAPI

from fastapi import FastAPI, File, Form, UploadFile
from fastapi.responses import Response
import openai
import json
import io

app = FastAPI()

@app.post("/api/chat/voice")
async def handle_voice(
    audio: UploadFile = File(...),
    settings: str = Form(...),
    context: str = Form(...)
):
    try:
        # Parse settings and context
        voice_settings = json.loads(settings)
        additional_context = json.loads(context)

        # Read audio data
        audio_data = await audio.read()

        # 1. Transcribe audio
        audio_file = io.BytesIO(audio_data)
        audio_file.name = "audio.webm"

        transcription = openai.Audio.transcribe(
            model="whisper-1",
            file=audio_file,
            language=voice_settings.get("language", "en")[:2]
        )

        # 2. Process with AI
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": transcription["text"]}
            ]
        )

        response_text = response.choices[0].message.content

        # 3. Generate speech
        speech_response = openai.Audio.speech.create(
            model="tts-1",
            voice=voice_settings.get("voiceId", "alloy"),
            input=response_text,
            speed=voice_settings.get("rate", 1.0)
        )

        # 4. Return audio
        return Response(
            content=speech_response.content,
            media_type="audio/mpeg"
        )

    except Exception as e:
        return {"error": str(e)}, 500

Mastra Agent Integration

When using Cedar-OS with the Mastra provider and voiceRoute configuration, your Mastra backend should handle requests at the specified route. Here’s how to implement the voice handler:
import { Agent } from '@mastra/core';
import { openai } from '@mastra/openai';

const agent = new Agent({
	name: 'voice-assistant',
	instructions: 'You are a helpful voice assistant.',
	model: openai.gpt4o(),
});

export async function handleVoiceRequest(
	audioBlob: Blob,
	settings: VoiceSettings,
	context: string
) {
	// 1. Transcribe audio
	const transcription = await openai.audio.transcriptions.create({
		file: audioBlob,
		model: 'whisper-1',
		language: settings.language?.split('-')[0] || 'en',
	});

	// 2. Add context to the conversation
	const contextData = JSON.parse(context);
	const systemMessage = `Additional context: ${JSON.stringify(contextData)}`;

	// 3. Generate response with agent
	const response = await agent.generate([
		{ role: 'system', content: systemMessage },
		{ role: 'user', content: transcription.text },
	]);

	// 4. Convert to speech
	const speech = await openai.audio.speech.create({
		model: 'tts-1',
		voice: settings.voiceId || 'alloy',
		input: response.text,
		speed: settings.rate || 1.0,
	});

	return {
		transcription: transcription.text,
		text: response.text,
		audioData: await speech.arrayBuffer(),
		audioFormat: 'mp3',
	};
}

// Example Mastra route setup
// If you configured voiceRoute: '/chat/voice-execute' in Cedar-OS,
// your Mastra backend should handle POST requests to this route:

app.post(
	'/chat/voice-execute',
	upload.fields([
		{ name: 'audio', maxCount: 1 },
		{ name: 'settings', maxCount: 1 },
		{ name: 'context', maxCount: 1 },
	]),
	async (req, res) => {
		const audioFile = req.files.audio[0];
		const settings = JSON.parse(req.body.settings);
		const context = req.body.context;

		const result = await handleVoiceRequest(
			new Blob([audioFile.buffer]),
			settings,
			context
		);

		// Return the structured response
		res.json(result);
	}
);

Error Handling

Your backend should handle various error cases:
app.post('/api/chat/voice', async (req, res) => {
	try {
		// ... processing logic
	} catch (error) {
		console.error('Voice processing error:', error);

		// Return appropriate error response
		if (error.code === 'AUDIO_TRANSCRIPTION_FAILED') {
			return res.status(400).json({
				error: 'Could not understand the audio. Please try again.',
				code: 'TRANSCRIPTION_ERROR',
			});
		}

		if (error.code === 'TTS_FAILED') {
			return res.status(500).json({
				error: 'Failed to generate speech response.',
				code: 'TTS_ERROR',
				// Fallback to text-only response
				text: responseText,
				transcription: userInput,
			});
		}

		// Generic error
		res.status(500).json({
			error: 'Internal server error processing voice request',
			code: 'INTERNAL_ERROR',
		});
	}
});

CORS Configuration

Ensure your backend allows CORS for the frontend domain:
import cors from 'cors';

app.use(
	cors({
		origin: ['http://localhost:3000', 'https://your-app-domain.com'],
		methods: ['POST'],
		allowedHeaders: ['Content-Type', 'Authorization'],
	})
);

Performance Considerations

Audio Processing

  • Use streaming transcription for real-time responses
  • Implement audio compression to reduce bandwidth
  • Cache TTS responses for common phrases

Response Optimization

  • Stream audio responses when possible
  • Use CDN for serving generated audio files
  • Implement request queuing for high-traffic scenarios
// Example streaming response
app.post('/api/chat/voice-stream', async (req, res) => {
	res.writeHead(200, {
		'Content-Type': 'audio/mpeg',
		'Transfer-Encoding': 'chunked',
	});

	const audioStream = await generateAudioStream(transcription);

	audioStream.on('data', (chunk) => {
		res.write(chunk);
	});

	audioStream.on('end', () => {
		res.end();
	});
});

Security Best Practices

  1. Rate Limiting: Prevent abuse of voice endpoints
  2. Authentication: Verify user permissions
  3. Input Validation: Sanitize audio data and settings
  4. Content Filtering: Screen transcriptions for inappropriate content
import rateLimit from 'express-rate-limit';

const voiceRateLimit = rateLimit({
	windowMs: 15 * 60 * 1000, // 15 minutes
	max: 50, // 50 requests per window
	message: 'Too many voice requests, please try again later.',
});

app.post('/api/chat/voice', voiceRateLimit, handleVoiceRequest);

Testing Your Integration

Test your voice endpoint with curl:
# Create test audio file
curl -X POST http://localhost:3456/api/chat/voice \
  -F "audio=@test-audio.webm" \
  -F "settings={\"language\":\"en-US\",\"rate\":1.0}" \
  -F "context={\"files\":[],\"state\":{}}" \
  -o response.mp3
Or use the Cedar voice system directly for end-to-end testing:
// With Mastra provider (automatic configuration)
<CedarCopilot
	llmProvider={{
		provider: 'mastra',
		baseURL: 'http://localhost:3456/api',
		voiceRoute: '/chat/voice-execute',
	}}>
	<YourApp />
</CedarCopilot>;

// Or manually configure the endpoint
const voice = useCedarStore((state) => state.voice);
voice.setVoiceEndpoint('http://localhost:3456/api/chat/voice');
await voice.requestVoicePermission();
voice.startListening();