Text to Speech API¶

Generate audio from text for voiceovers, audiobooks, accessibility features, or interactive voice experiences.

Why Choose Text to Speech?¶

Global Support
30+ languages supported. Choose from Neural, Generative, and Long-Form engines.
60+ Voices
Professional narration to conversational voices. Use OpenAI voice names with automatic language detection or specify any Polly voice ID directly.
Automatic Language Detection
Using OpenAI voice names? AWS Comprehend automatically detects your content's language and selects an appropriate Polly voice—matching language, gender, and quality.
Advanced Control with SSML
Fine-tune pronunciation, emphasis, pauses, and prosody with SSML markup for complex audio requirements.

Quick Start: Available Endpoint¶

Endpoint	Method	What It Does	Powered By
`/v1/audio/speech`	POST	Turn text into natural-sounding speech	AWS Polly + AWS Comprehend

Feature Compatibility¶

Feature	Status	Notes
Voice Selection
OpenAI voice names		Mapped to Polly voices
Polly voice IDs		60+ voices across 30+ languages
Dynamic voice selection		Select best Polly voice based on the detected language
Input
Plain text		Standard text input
SSML markup		Fine-grained speech control
Output Formats
MP3		Native Polly format
PCM		Native Polly format
Opus		Native Polly format
AAC		Encoded from PCM
FLAC		Encoded from PCM
WAV		Encoded from PCM
OGG (Vorbis)		Native Polly format
Control
`speed` parameter		0.2x to 2.0x playback speed
Extra model-specific params		Extra model-specific parameters not supported by the OpenAI API
Streaming
Byte streaming		Default streaming mode
SSE streaming		Event-based streaming
Usage tracking
Input text tokens		Characters count (billing unit)
Output tokens		Not available

Legend:

Supported — Fully compatible with OpenAI API
Extra Feature — Enhanced capability beyond OpenAI API
Unsupported — Not available in this implementation

Advanced Features¶

OpenAI-Compatible with AWS Power¶

Models & Voices:

Use amazon.polly-standard, amazon.polly-neural, amazon.polly-long-form, or amazon.polly-generative (instead of tts-1/tts-1-hd)
OpenAI voice names work with automatic language detection and intelligent voice selection
Or specify any Polly voice ID directly for 60+ voices across 30+ languages

Enhanced Features:

SSML Support : Fine-grained control over pronunciation, emphasis, pauses, and prosody — SSML docs
Flexible Formats: mp3, ogg, wav, flac, aac, opus (some transcoded server-side via ffmpeg)
Streaming Options: Raw bytes (default) or SSE events with stream_format: "sse"
Speed Control: Adjust playback from 0.25x to 4.0x
Character-Based Billing: Usage tracks character counts—the native billing unit for AWS Polly and AWS Comprehend—rather than OpenAI-style tokens

Performance Tips: Optimize Speed & Cost

Use native Polly formats (mp3, ogg, PCM) to skip server-side conversion
Specify a Polly voice ID to bypass language detection—faster responses, no AWS Comprehend charges

Provider-Specific Parameters¶

Unlock advanced AWS Polly capabilities by passing provider-specific parameters directly in your requests. These parameters are forwarded to AWS Polly's synthesize_speech API and allow you to access features unique to Polly.

How It Works:

Add provider-specific fields at the top level of your request body alongside standard OpenAI parameters. The API automatically forwards these to AWS Polly.

Examples:

Lexicon Support:

Apply custom pronunciation lexicons to your speech synthesis:

{
  "model": "amazon.polly-neural",
  "voice": "Joanna",
  "input": "AWS Polly uses lexicons for custom pronunciation.",
  "response_format": "mp3",
  "LexiconNames": ["MyCustomLexicon"]
}

Sample Rate:

Specify custom audio sample rate (8000, 16000, 22050, 24000, 44100, or 48000 Hz):

{
  "model": "amazon.polly-neural",
  "voice": "Matthew",
  "input": "High quality audio at 24kHz.",
  "response_format": "mp3",
  "SampleRate": "24000"
}

Language Code:

Specify the language for bilingual voices (only useful for voices that support multiple languages):

{
  "model": "amazon.polly-neural",
  "voice": "Aditi",
  "input": "Hello, how are you?",
  "response_format": "mp3",
  "LanguageCode": "en-IN"
}

Configuration Options:

Option 1: Per-Request

Add provider-specific parameters directly in your request body (as shown in examples above).

Option 2: Server-Wide Defaults

Configure default parameters for specific models via the DEFAULT_MODEL_PARAMS environment variable:

export DEFAULT_MODEL_PARAMS='{
  "amazon.polly-neural": {
    "SampleRate": "24000"
  }
}'

Note: Per-request parameters override server-wide defaults.

Behavior:

✅ Compatible parameters: Forwarded to Polly and applied
⚠️ Unsupported parameters: Return HTTP 400 with an error message

Available Parameters:

The following parameters from the AWS Polly SynthesizeSpeech API can be used:

LexiconNames (list): Apply pronunciation lexicons
SampleRate (string): Audio sample rate in Hz
LanguageCode (string): Language code for bilingual voices only (e.g., en-IN, hi-IN)

Try It Now¶

Stream audio as bytes (default):

curl -OJ -X POST "$BASE/v1/audio/speech" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "amazon.polly-neural",
    "voice": "Amy",
    "input": "Welcome to the future of voice technology!",
    "response_format": "mp3"
  }'

Stream audio as SSE events:

curl -N -X POST "$BASE/v1/audio/speech" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "amazon.polly-neural",
    "voice": "Amy",
    "input": "This audio streams as SSE events!",
    "response_format": "mp3",
    "stream_format": "sse"
  }'

Ready to add voice to your application? Explore available voices and models in the Models API.