The Unified Text-to-Speech SDK

The SpeechSDK is a free, open-source toolkit for building better AI audio applications with multiple voice providers.

$ npm install @speech-sdk/core

import { generateSpeech } from '@speech-sdk/core';

const result = await generateSpeech({
  model: 'openai/gpt-4o-mini-tts',
  text: 'Hello from SpeechSDK!',
  voice: 'alloy',
});

result.audio.uint8Array;  // Uint8Array
result.audio.base64;      // string (lazy)
result.audio.mediaType;   // "audio/mpeg"

import { generateSpeech } from '@speech-sdk/core';

const result = await generateSpeech({
  model: 'elevenlabs/eleven_v3',
  text: 'Hello from SpeechSDK!',
  voice: 'EXAVITQu4vr4xnSDxMaL',
});

result.audio.uint8Array;  // Uint8Array
result.audio.base64;      // string (lazy)
result.audio.mediaType;   // "audio/mpeg"

import { generateSpeech } from '@speech-sdk/core';

const result = await generateSpeech({
  model: 'cartesia/sonic-3',
  text: 'Hello from SpeechSDK!',
  voice: 'a0e99841-438c-4a64-b679-ae501e7d6091',
});

result.audio.uint8Array;  // Uint8Array
result.audio.base64;      // string (lazy)
result.audio.mediaType;   // "audio/mpeg"

import { generateSpeech } from '@speech-sdk/core';

const result = await generateSpeech({
  model: 'google/gemini-3.1-flash-tts-preview',
  text: 'Hello from SpeechSDK!',
  voice: 'Kore',
});

result.audio.uint8Array;  // Uint8Array
result.audio.base64;      // string (lazy)
result.audio.mediaType;   // "audio/wav"

import { generateSpeech } from '@speech-sdk/core';

const result = await generateSpeech({
  model: 'xai/grok-tts',
  text: 'Hello from SpeechSDK!',
  voice: 'ava',
});

result.audio.uint8Array;  // Uint8Array
result.audio.base64;      // string (lazy)
result.audio.mediaType;   // "audio/mpeg"

import { generateSpeech } from '@speech-sdk/core';

const result = await generateSpeech({
  model: 'inworld/inworld-tts-1.5-max',
  text: 'Hello from SpeechSDK!',
  voice: 'Ashley',
});

result.audio.uint8Array;  // Uint8Array
result.audio.base64;      // string (lazy)
result.audio.mediaType;   // "audio/mpeg"

Join our DiscordView all models →

Providers

25+

Models

Built

For Production

Open Source

Apache 2.0 License

One API, Every Provider

One interface across OpenAI, ElevenLabs, Deepgram, Cartesia, Google, Mistral, Hume, and more. Unified model strings, consistent response format, BYO API keys.

Streaming

Auto-Retries

Error Handling

Provider Options

Multi-Speaker Conversations

Generate a multi-speaker conversation with a single API. Mix voices across providers in one call, with automatic volume leveling and turn stitching.

Audio Tags

Voice Cloning

Pronunciations

Volume Leveling

Speed Control

Auto-Chunking & Timestamps

Intelligently splits long inputs on sentence boundaries for providers with max input lengths.

Multilingual Splitter

Timestamps

SRT/VTT Captions

Output Formats

Why SpeechSDK?

Locking into a single TTS provider's SDK means rewriting code when a better or less expensive model ships.

The SpeechSDK integrates all major providers into an easy-to-use, unified interface so you can swap models without breaking your application code.

$ npm i @speech-sdk/core

Supports

+ 10 providers

generate-conversation.ts

import { generateConversation } from "@speech-sdk/core";

const result = await generateConversation({
  turns: [
    {
      model: "elevenlabs/eleven_v3",
      voice: "EXAVITQu4vr4xnSDxMaL",
      text: "Hello from the SDK.",
    },
    {
      model: "google/gemini-3.1-flash-tts-preview",
      voice: "Kore",
      text: "One call. Multiple voices. Auto-leveled.",
    },
  ],
});

result.audio.uint8Array; // Uint8Array
result.audio.mediaType;  // "audio/mpeg"

AI Engineering

For Production Voice Applications

Smart retries

Jittered exponential backoff retries 5xx and 429 automatically. 429s honor Retry-After (60s cap) and expose the delay via ApiError.retryAfterMs.

Long inputs, handled

maxInputChars splits at sentence boundaries, stitches chunks into one audio file, and reconnects word-level timestamps end-to-end.

Format conversions

Render wav, mp3, or pcm from any provider. Native pass-through where supported, lossless local conversion otherwise.

Custom fetch & Base URL

Every provider accepts a custom fetch and baseURL — point at OpenAI-compatible proxies, Azure, LiteLLM, or local models.

Words and captions

Word-level timestamps from native alignment or a one-shot STT fallback. timestampsToCaptions ships SRT or WebVTT in a single call.

Speechbase ready

Queuing, quality processing, voice management, and analytics — one config change to connect. Coming soon.

PROVIDERS

Every model, one interface

View All Providers →

Provider	Model String	Default*
OpenAI	openai/gpt-4o-mini-tts	Yes
ElevenLabs	elevenlabs/eleven_v3	Yes
ElevenLabs	elevenlabs/eleven_flash_v2_5	—
ElevenLabs	elevenlabs/eleven_flash_v2	—
Deepgram	deepgram/aura-2	Yes
Cartesia	cartesia/sonic-3	Yes
Hume	hume/octave-2	Yes
Google	google/gemini-3.1-flash-tts-preview	Yes
Fish Audio	fish-audio/s2-pro	Yes
Inworld	inworld/inworld-tts-1.5-max	Yes
Murf	murf/GEN2	Yes
Smallest AI	smallest-ai/lightning-v3.1	Yes
Resemble	resemble/default	Yes
fal	fal-ai/*	—
Mistral	mistral/voxtral-mini-tts-2603	Yes
xAI	xai/grok-tts	Yes

* Pass just the provider name to use its default model — e.g. model: 'openai' resolves to openai/gpt-4o-mini-tts.

Frequently asked questions

Each provider has its own SDK, request format, auth pattern, and response shape. SpeechSDK is one API, every provider — same function call, same result type, same error handling. Switch providers by simply changing a model string.

One SDK, every provider. Add text-to-speech to your app in minutes with a unified, open-source interface.

Get started →View models

The Unified Text-to-Speech SDK

One API, Every Provider

Multi-Speaker Conversations

Auto-Chunking & Timestamps

Why SpeechSDK?

For Production Voice Applications

Smart retries

Long inputs, handled

Format conversions

Custom fetch & Base URL

Words and captions

Speechbase ready

Every model, one interface

Frequently asked questions

How is this different from using provider SDKs directly?

What's the difference between a model and a voice?

Do I need my own API keys?

Does it work in the browser?

What audio formats are returned?

How does voice cloning work?

What is Speechbase?