Day 0 support for Google Gemini 3.1 Flash TTS Try it now →

Getting Started

Get started with SpeechSDK — the universal text-to-speech SDK for JavaScript and TypeScript.

SpeechSDK is a lightweight, provider-agnostic TypeScript toolkit for text-to-speech. One API, 14 providers, zero lock-in.

Install

npm install @speech-sdk/core

For Agents

Give your AI coding assistant full knowledge of this library. Works with Claude Code, Cursor, Codex, and more.

npx skills add Jellypod-Inc/speech-sdk --skill speech-sdk

Quick Start

import { generateSpeech } from "@speech-sdk/core"

const result = await generateSpeech({
  model: "openai/gpt-4o-mini-tts",
  text: "Hello from SpeechSDK!",
  voice: "alloy",
})

result.audio.uint8Array // Uint8Array
result.audio.base64 // string (lazy-computed)
result.audio.mediaType // "audio/mpeg"

That's it — no provider SDK to install, no client to initialize. Just pass a provider/model string and go.

How It Works

SpeechSDK resolves the provider from the model string, reads the API key from the corresponding environment variable, calls the provider's API, and returns a normalized result.

// These all work the same way
generateSpeech({ model: "openai/gpt-4o-mini-tts", text: "...", voice: "alloy" })
generateSpeech({
  model: "elevenlabs/eleven_v3",
  text: "...",
  voice: "voice-id",
})
generateSpeech({ model: "deepgram/aura-2", text: "...", voice: "thalia-en" })

// Pass just the provider name to use its default model
generateSpeech({ model: "openai", text: "...", voice: "alloy" })

Next Steps

  • Providers — see all 14 supported providers and their models
  • Streaming — stream audio chunk-by-chunk for low-latency playback
  • Multi-Speaker Conversation — generate a single audio file from a multi-turn, multi-voice script
  • Auto-Chunking — split long inputs at sentence boundaries and stitch the audio
  • Timestamps & Captions — word-level alignment and SRT / WebVTT caption generation
  • Audio Tags — write expressive cues once, every provider handles them
  • Output Formats — request wav, mp3, or pcm with native pass-through
  • Pronunciations — substitute custom pronunciations with timestamps that still align to the original text
  • Speed Control — slow down or speed up generated audio without changing pitch
  • Voice Cloning — clone voices with reference audio
  • Configuration — custom API keys, base URLs, and fetch
  • Error Handling — handle API errors gracefully

On this page