The Unified Text-to-Speech SDK

The SpeechSDK is a free, open-source toolkit for building AI audio applications with multiple voice providers.

$ npm install @speech-sdk/core
import { generateSpeech } from '@speech-sdk/core';

const result = await generateSpeech({
  model: 'openai/gpt-4o-mini-tts',
  text: 'Hello from SpeechSDK!',
  voice: 'alloy',
});

result.audio.uint8Array;  // Uint8Array
result.audio.base64;      // string (lazy)
result.audio.mediaType;   // "audio/mpeg"
View all models →
12
Providers
25+
Models
Built
For Production
Open Source
MIT License

Multi-Provider

One interface across OpenAI, ElevenLabs, Deepgram, Cartesia, Google, Mistral, Hume, and more. Unified model strings, consistent response format, BYO API keys.

Cross-Platform

Runs everywhere — Node.js, Edge runtimes, and the browser. Same API, zero platform-specific code.

Node.jsEdgeBrowser

Minimal Dependencies

Lightweight by design. Built-in retries, typed errors, and lazy base64 encoding. No heavy frameworks.

Why SpeechSDK?

Locking into a single TTS provider's SDK means rewriting code when a better or less expensive model ships.

The SpeechSDK integrates all major providers into an easy-to-use, unified interface so you can swap models without breaking your application code.

$ npm i @speech-sdk/core

Supports

OpenAI
ElevenLabs
Google
Cartesia
+ 8 providers
generate-speech.ts
import { generateSpeech } from '@speech-sdk/core';
import { createOpenAI } from '@speech-sdk/core/openai';

const myOpenAI = createOpenAI();

const result = await generateSpeech({
  model: myOpenAI('gpt-4o-mini-tts'),
  text: 'Hello from SpeechSDK!',
  voice: 'alloy',
});

result.audio.uint8Array;  // Uint8Array
result.audio.base64;      // string (lazy)
result.audio.mediaType;   // "audio/mpeg"
Text to Speech
Auto-Retries
Multi-Provider
Provider Options

AI Engineering

For Production Voice Applications

Lazy base64 conversion

Only computes the format you access — uint8Array or base64 — and caches it. No unnecessary encoding or wasted memory.

Content-type awareness

The mediaType is read directly from each provider's response headers. You always know the actual audio format — MP3 from OpenAI, WAV from Cartesia, etc.

Custom fetch & Base URL

Every provider accepts a custom fetch and baseURL. Point at OpenAI-compatible proxies, Azure OpenAI, LiteLLM, or local models. Swap in undici, a proxy-aware fetch, or a mock.

Smart retries

Built-in retry with exponential backoff via p-retry. Retries 5xx and network errors automatically. 4xx errors (auth failures, bad requests) abort immediately — no wasted time.

Zero runtime dependencies

Only dependency is p-retry. The SDK uses raw fetch and Uint8Array — no heavy audio libraries, no provider SDK wrappers. Works anywhere fetch works.

Works seamlessly with Speech Gateway

Speech Gateway adds production infrastructure — queuing, quality processing, voice management, and analytics. One config change to connect. Coming Soon.

PROVIDERS

Every model, one interface

View All Providers →
ProviderModel StringDefault*
OpenAIopenai/gpt-4o-mini-ttsYes
OpenAIopenai/tts-1
OpenAIopenai/tts-1-hd
ElevenLabselevenlabs/eleven_multilingual_v2Yes
ElevenLabselevenlabs/eleven_v3
ElevenLabselevenlabs/eleven_flash_v2_5
ElevenLabselevenlabs/eleven_flash_v2
Deepgramdeepgram/aura-2Yes
Cartesiacartesia/sonic-3Yes
Humehume/octave-2Yes
Googlegoogle/gemini-2.5-flash-preview-ttsYes
Googlegoogle/gemini-2.5-pro-preview-tts
Fish Audiofish-audio/s2-proYes
Unreal Speechunreal-speech/defaultYes
Murfmurf/GEN2Yes
Resembleresemble/defaultYes
falfal-ai/*
Mistralmistral/voxtral-mini-tts-2603Yes

* Pass just the provider name to use its default model — e.g. model: 'openai' resolves to openai/gpt-4o-mini-tts.

Frequently asked questions

Each provider has its own SDK, request format, auth pattern, and response shape. SpeechSDK gives you one interface for all of them — same function call, same result type, same error handling. Switch providers by simply changing a model string.

SpeechSDK

One SDK, every provider. Add text-to-speech to your app in minutes with a unified, open-source interface.