Skip to content

Text to Speech API

Generate audio from text for voiceovers, audiobooks, accessibility features, or interactive voice experiences.

Why Choose Text to Speech?

  • Global Support
    30+ languages supported. Choose from Neural, Generative, and Long-Form engines.

  • 60+ Voices
    Professional narration to conversational voices. Use OpenAI voice names with automatic language detection or specify any Polly voice ID directly.

  • Automatic Language Detection
    Using OpenAI voice names? AWS Comprehend automatically detects your content's language and selects an appropriate Polly voice—matching language, gender, and quality.

  • Advanced Control with SSML
    Fine-tune pronunciation, emphasis, pauses, and prosody with SSML markup for complex audio requirements.

Quick Start: Available Endpoint

Endpoint Method What It Does Powered By
/v1/audio/speech POST Turn text into natural-sounding speech AWS Polly + AWS Comprehend

Feature Compatibility

Feature Status Notes
Voice Selection
OpenAI voice names Mapped to Polly voices
Polly voice IDs 60+ voices across 30+ languages
Dynamic voice selection Select best Polly voice based on the detected language
Input
Plain text Standard text input
SSML markup Fine-grained speech control
Output Formats
MP3 Native Polly format
PCM Native Polly format
Opus Native Polly format
AAC Encoded from PCM
FLAC Encoded from PCM
WAV Encoded from PCM
OGG (Vorbis) Native Polly format
Control
speed parameter 0.2x to 2.0x playback speed
Extra model-specific params Extra model-specific parameters not supported by the OpenAI API
Streaming
Byte streaming Default streaming mode
SSE streaming Event-based streaming
Usage tracking
Input text tokens Characters count (billing unit)
Output tokens Not available

Legend:

  • Supported — Fully compatible with OpenAI API
  • Extra Feature — Enhanced capability beyond OpenAI API
  • Unsupported — Not available in this implementation

Advanced Features

AWS Polly OpenAI-Compatible with AWS Power

Models & Voices:

  • Use amazon.polly-standard, amazon.polly-neural, amazon.polly-long-form, or amazon.polly-generative (instead of tts-1/tts-1-hd)
  • OpenAI voice names work with automatic language detection and intelligent voice selection
  • Or specify any Polly voice ID directly for 60+ voices across 30+ languages

Enhanced Features:

  • SSML Support : Fine-grained control over pronunciation, emphasis, pauses, and prosody — SSML docs
  • Flexible Formats: mp3, ogg, wav, flac, aac, opus (some transcoded server-side via ffmpeg)
  • Streaming Options: Raw bytes (default) or SSE events with stream_format: "sse"
  • Speed Control: Adjust playback from 0.25x to 4.0x
  • Character-Based Billing: Usage tracks character counts—the native billing unit for AWS Polly and AWS Comprehend—rather than OpenAI-style tokens

Performance Tips: Optimize Speed & Cost

  • Use native Polly formats (mp3, ogg, PCM) to skip server-side conversion
  • Specify a Polly voice ID to bypass language detection—faster responses, no AWS Comprehend charges

Provider-Specific Parameters

Unlock advanced AWS Polly capabilities by passing provider-specific parameters directly in your requests. These parameters are forwarded to AWS Polly's synthesize_speech API and allow you to access features unique to Polly.

How It Works:

Add provider-specific fields at the top level of your request body alongside standard OpenAI parameters. The API automatically forwards these to AWS Polly.

Examples:

Lexicon Support:

Apply custom pronunciation lexicons to your speech synthesis:

{
  "model": "amazon.polly-neural",
  "voice": "Joanna",
  "input": "AWS Polly uses lexicons for custom pronunciation.",
  "response_format": "mp3",
  "LexiconNames": ["MyCustomLexicon"]
}

Sample Rate:

Specify custom audio sample rate (8000, 16000, 22050, 24000, 44100, or 48000 Hz):

{
  "model": "amazon.polly-neural",
  "voice": "Matthew",
  "input": "High quality audio at 24kHz.",
  "response_format": "mp3",
  "SampleRate": "24000"
}

Language Code:

Specify the language for bilingual voices (only useful for voices that support multiple languages):

{
  "model": "amazon.polly-neural",
  "voice": "Aditi",
  "input": "Hello, how are you?",
  "response_format": "mp3",
  "LanguageCode": "en-IN"
}

Configuration Options:

Option 1: Per-Request

Add provider-specific parameters directly in your request body (as shown in examples above).

Option 2: Server-Wide Defaults

Configure default parameters for specific models via the DEFAULT_MODEL_PARAMS environment variable:

export DEFAULT_MODEL_PARAMS='{
  "amazon.polly-neural": {
    "SampleRate": "24000"
  }
}'

Note: Per-request parameters override server-wide defaults.

Behavior:

  • Compatible parameters: Forwarded to Polly and applied
  • ⚠️ Unsupported parameters: Return HTTP 400 with an error message

Available Parameters:

The following parameters from the AWS Polly SynthesizeSpeech API can be used:

  • LexiconNames (list): Apply pronunciation lexicons
  • SampleRate (string): Audio sample rate in Hz
  • LanguageCode (string): Language code for bilingual voices only (e.g., en-IN, hi-IN)

Try It Now

Stream audio as bytes (default):

curl -OJ -X POST "$BASE/v1/audio/speech" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "amazon.polly-neural",
    "voice": "Amy",
    "input": "Welcome to the future of voice technology!",
    "response_format": "mp3"
  }'

Stream audio as SSE events:

curl -N -X POST "$BASE/v1/audio/speech" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "amazon.polly-neural",
    "voice": "Amy",
    "input": "This audio streams as SSE events!",
    "response_format": "mp3",
    "stream_format": "sse"
  }'

Ready to add voice to your application? Explore available voices and models in the Models API.