The Unified Text-to-Speech SDK
The SpeechSDK is a free, open-source toolkit for building AI audio applications with multiple voice providers.
import { generateSpeech } from '@speech-sdk/core';
const result = await generateSpeech({
model: 'openai/gpt-4o-mini-tts',
text: 'Hello from SpeechSDK!',
voice: 'alloy',
});
result.audio.uint8Array; // Uint8Array
result.audio.base64; // string (lazy)
result.audio.mediaType; // "audio/mpeg"import { generateSpeech } from '@speech-sdk/core';
const result = await generateSpeech({
model: 'elevenlabs/eleven_multilingual_v2',
text: 'Hello from SpeechSDK!',
voice: 'EXAVITQu4vr4xnSDxMaL',
});
result.audio.uint8Array; // Uint8Array
result.audio.base64; // string (lazy)
result.audio.mediaType; // "audio/mpeg"import { generateSpeech } from '@speech-sdk/core';
const result = await generateSpeech({
model: 'deepgram/aura-2',
text: 'Hello from SpeechSDK!',
voice: 'thalia-en',
});
result.audio.uint8Array; // Uint8Array
result.audio.base64; // string (lazy)
result.audio.mediaType; // "audio/mpeg"import { generateSpeech } from '@speech-sdk/core';
const result = await generateSpeech({
model: 'cartesia/sonic-3',
text: 'Hello from SpeechSDK!',
voice: 'a0e99841-438c-4a64-b679-ae501e7d6091',
});
result.audio.uint8Array; // Uint8Array
result.audio.base64; // string (lazy)
result.audio.mediaType; // "audio/mpeg"import { generateSpeech } from '@speech-sdk/core';
const result = await generateSpeech({
model: 'google/gemini-2.5-flash-preview-tts',
text: 'Hello from SpeechSDK!',
voice: 'Kore',
});
result.audio.uint8Array; // Uint8Array
result.audio.base64; // string (lazy)
result.audio.mediaType; // "audio/wav"import { generateSpeech } from '@speech-sdk/core';
const result = await generateSpeech({
model: 'hume/octave-2',
text: 'Hello from SpeechSDK!',
voice: 'Dacher',
});
result.audio.uint8Array; // Uint8Array
result.audio.base64; // string (lazy)
result.audio.mediaType; // "audio/wav"import { generateSpeech } from '@speech-sdk/core';
const result = await generateSpeech({
model: 'mistral/voxtral-mini-tts-2603',
text: 'Hello from SpeechSDK!',
voice: { audio: 'base64-encoded-audio...' },
});
result.audio.uint8Array; // Uint8Array
result.audio.base64; // string (lazy)
result.audio.mediaType; // "audio/wav"Multi-Provider
One interface across OpenAI, ElevenLabs, Deepgram, Cartesia, Google, Mistral, Hume, and more. Unified model strings, consistent response format, BYO API keys.
Cross-Platform
Runs everywhere — Node.js, Edge runtimes, and the browser. Same API, zero platform-specific code.
Minimal Dependencies
Lightweight by design. Built-in retries, typed errors, and lazy base64 encoding. No heavy frameworks.
Why SpeechSDK?
Locking into a single TTS provider's SDK means rewriting code when a better or less expensive model ships.
The SpeechSDK integrates all major providers into an easy-to-use, unified interface so you can swap models without breaking your application code.
Supports
import { generateSpeech } from '@speech-sdk/core';
import { createOpenAI } from '@speech-sdk/core/openai';
const myOpenAI = createOpenAI();
const result = await generateSpeech({
model: myOpenAI('gpt-4o-mini-tts'),
text: 'Hello from SpeechSDK!',
voice: 'alloy',
});
result.audio.uint8Array; // Uint8Array
result.audio.base64; // string (lazy)
result.audio.mediaType; // "audio/mpeg"AI Engineering
For Production Voice Applications
Lazy base64 conversion
Only computes the format you access — uint8Array or base64 — and caches it. No unnecessary encoding or wasted memory.
Content-type awareness
The mediaType is read directly from each provider's response headers. You always know the actual audio format — MP3 from OpenAI, WAV from Cartesia, etc.
Custom fetch & Base URL
Every provider accepts a custom fetch and baseURL. Point at OpenAI-compatible proxies, Azure OpenAI, LiteLLM, or local models. Swap in undici, a proxy-aware fetch, or a mock.
Smart retries
Built-in retry with exponential backoff via p-retry. Retries 5xx and network errors automatically. 4xx errors (auth failures, bad requests) abort immediately — no wasted time.
Zero runtime dependencies
Only dependency is p-retry. The SDK uses raw fetch and Uint8Array — no heavy audio libraries, no provider SDK wrappers. Works anywhere fetch works.
Works seamlessly with Speech Gateway
Speech Gateway adds production infrastructure — queuing, quality processing, voice management, and analytics. One config change to connect. Coming Soon.
PROVIDERS
Every model, one interface
| Provider | Model String | Default* |
|---|---|---|
| OpenAI | openai/gpt-4o-mini-tts | Yes |
| OpenAI | openai/tts-1 | — |
| OpenAI | openai/tts-1-hd | — |
| ElevenLabs | elevenlabs/eleven_multilingual_v2 | Yes |
| ElevenLabs | elevenlabs/eleven_v3 | — |
| ElevenLabs | elevenlabs/eleven_flash_v2_5 | — |
| ElevenLabs | elevenlabs/eleven_flash_v2 | — |
| Deepgram | deepgram/aura-2 | Yes |
| Cartesia | cartesia/sonic-3 | Yes |
| Hume | hume/octave-2 | Yes |
| google/gemini-2.5-flash-preview-tts | Yes | |
| google/gemini-2.5-pro-preview-tts | — | |
| Fish Audio | fish-audio/s2-pro | Yes |
| Unreal Speech | unreal-speech/default | Yes |
| Murf | murf/GEN2 | Yes |
| Resemble | resemble/default | Yes |
| fal | fal-ai/* | — |
| Mistral | mistral/voxtral-mini-tts-2603 | Yes |
* Pass just the provider name to use its default model — e.g. model: 'openai' resolves to openai/gpt-4o-mini-tts.
Frequently asked questions
Each provider has its own SDK, request format, auth pattern, and response shape. SpeechSDK gives you one interface for all of them — same function call, same result type, same error handling. Switch providers by simply changing a model string.
One SDK, every provider. Add text-to-speech to your app in minutes with a unified, open-source interface.