Skip to content

Text to Speech API

Generate natural-sounding speech from text with AWS Polly through an OpenAI-compatible interface.

Why Choose Text to Speech?

  • Global Support
    30+ languages supported. Choose from Neural, Generative, and Long-Form engines.

  • 60+ Voices
    Professional narration to conversational voices. Use OpenAI voice names with automatic language detection or specify any Polly voice ID directly.

  • Automatic Language Detection
    Using OpenAI voice names? AWS Comprehend automatically detects your content's language and selects an appropriate Polly voice—matching language, gender, and quality.

  • Advanced Control with SSML
    Fine-tune pronunciation, emphasis, pauses, and prosody with SSML markup for complex audio requirements.

Quick Start: Available Endpoint

Endpoint Method What It Does Powered By
/v1/audio/speech POST Turn text into natural-sounding speech AWS Polly + AWS Comprehend

Feature Compatibility

Feature Status Notes
Voice Selection
OpenAI voice names Mapped to Polly voices
Polly voice IDs 60+ voices across 30+ languages
Dynamic voice selection Select best Polly voice based on the detected language
Input
Plain text Standard text input
SSML markup Fine-grained speech control
Output Formats
MP3 Native Polly format
PCM Native Polly format
Opus Native Polly format
AAC Encoded from PCM
FLAC Encoded from PCM
WAV Encoded from PCM
OGG (Vorbis) Native Polly format
Control
speed parameter 0.2x to 2.0x playback speed
Extra model-specific params Extra model-specific parameters not supported by the OpenAI API
Streaming
Byte streaming Default streaming mode
SSE streaming Event-based streaming
Usage tracking
Input text tokens Characters count (billing unit)
Output tokens Not available

Legend:

  • Supported — Fully compatible with OpenAI API
  • Extra Feature — Enhanced capability beyond OpenAI API
  • Unsupported — Not available in this implementation

Advanced Features

AWS Polly OpenAI-Compatible with AWS Power

Models & Voices:

  • Use amazon.polly-standard, amazon.polly-neural, amazon.polly-long-form, or amazon.polly-generative (instead of tts-1/tts-1-hd)
  • Or use OpenAI model names directly: tts-1 (maps to amazon.polly-standard) and tts-1-hd (maps to amazon.polly-neural) work out of the box
  • OpenAI voice names work with automatic language detection and intelligent voice selection
  • Or specify any Polly voice ID directly for 60+ voices across 30+ languages

OpenAI Model Compatibility

stdapi.ai includes built-in model aliases that map OpenAI model names to AWS Polly engines:

  • tts-1amazon.polly-standard
  • tts-1-hdamazon.polly-neural

These aliases enable seamless compatibility with OpenAI-based tools and applications without any configuration changes. You can also customize or override these aliases to suit your needs.

Enhanced Features:

  • SSML Support : Fine-grained control over pronunciation, emphasis, pauses, and prosody — SSML docs
  • Flexible Formats: mp3, ogg, wav, flac, aac, opus (some transcoded server-side via ffmpeg)
  • Streaming Options: Raw bytes (default) or SSE events with stream_format: "sse"
  • Speed Control: Adjust playback from 0.25x to 4.0x
  • Character-Based Billing: Usage tracks character counts—the native billing unit for AWS Polly and AWS Comprehend—rather than OpenAI-style tokens

Performance Tips: Optimize Speed & Cost

  • Use native Polly formats (mp3, ogg, PCM) to skip server-side conversion
  • Specify a Polly voice ID to bypass language detection—faster responses, no AWS Comprehend charges
  • Configure a default language via DEFAULT_TTS_LANGUAGE environment variable to skip language detection for all requests using OpenAI voice names

Language Detection Behavior

When using OpenAI voice names without specifying a default language, the system analyzes only the first 500 characters of your text to detect the language. This approach:

  • Works best with long, single-language texts where the first 500 characters are representative
  • May be inconsistent with very short texts (< 100 characters) where language detection has limited context
  • Can produce mixed results with multi-language content where different parts use different languages

For consistent behavior across requests, consider:

  • Setting DEFAULT_TTS_LANGUAGE for applications serving primarily one language
  • Using Polly voice IDs directly when you know the target language
  • Structuring multi-language applications to make separate API calls per language

Provider-Specific Parameters

Unlock advanced AWS Polly capabilities by passing provider-specific parameters directly in your requests. These parameters are forwarded to AWS Polly's synthesize_speech API and allow you to access features unique to Polly.

How It Works:

Add provider-specific fields at the top level of your request body alongside standard OpenAI parameters. The API automatically forwards these to AWS Polly.

Examples:

Lexicon Support:

Apply custom pronunciation lexicons to your speech synthesis:

{
  "model": "amazon.polly-neural",
  "voice": "Joanna",
  "input": "AWS Polly uses lexicons for custom pronunciation.",
  "response_format": "mp3",
  "LexiconNames": ["MyCustomLexicon"]
}

Sample Rate:

Specify custom audio sample rate (8000, 16000, 22050, 24000, 44100, or 48000 Hz):

{
  "model": "amazon.polly-neural",
  "voice": "Matthew",
  "input": "High quality audio at 24kHz.",
  "response_format": "mp3",
  "SampleRate": "24000"
}

Language Code:

Specify the language for bilingual voices (only useful for voices that support multiple languages):

{
  "model": "amazon.polly-neural",
  "voice": "Aditi",
  "input": "Hello, how are you?",
  "response_format": "mp3",
  "LanguageCode": "en-IN"
}

Configuration Options:

Option 1: Per-Request

Add provider-specific parameters directly in your request body (as shown in examples above).

Option 2: Server-Wide Defaults

Configure default parameters for specific models via the DEFAULT_MODEL_PARAMS environment variable:

export DEFAULT_MODEL_PARAMS='{
  "amazon.polly-neural": {
    "SampleRate": "24000"
  }
}'

Note: Per-request parameters override server-wide defaults.

Behavior:

  • Compatible parameters: Forwarded to Polly and applied
  • ⚠️ Unsupported parameters: Return HTTP 400 with an error message

Available Parameters:

The following parameters from the AWS Polly SynthesizeSpeech API can be used:

  • LexiconNames (list): Apply pronunciation lexicons
  • SampleRate (string): Audio sample rate in Hz
  • LanguageCode (string): Language code for bilingual voices only (e.g., en-IN, hi-IN)

Try It Now

Stream audio as bytes (default):

curl -OJ -X POST "$BASE/v1/audio/speech" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "amazon.polly-neural",
    "voice": "Amy",
    "input": "Welcome to the future of voice technology!",
    "response_format": "mp3"
  }'

Stream audio as SSE events:

curl -N -X POST "$BASE/v1/audio/speech" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "amazon.polly-neural",
    "voice": "Amy",
    "input": "This audio streams as SSE events!",
    "response_format": "mp3",
    "stream_format": "sse"
  }'

Ready to add voice to your application? Explore available voices and models in the Models API.