Speech to Text API¶

Transcribe audio to text with AWS Transcribe or AWS Bedrock audio-capable models through an OpenAI-compatible interface.

Why Choose Speech to Text?¶

Multiple Transcription Options
Choose AWS Transcribe for 100+ languages with speaker diarization, or use Bedrock audio models for advanced capabilities.
Real-Time or Batch
Stream transcriptions in real-time via SSE or process files efficiently with either service.
Subtitle Generation
Generate SRT and VTT subtitle files directly with precise timing for video content.
Advanced Features
Speaker diarization, word-level timestamps, and automatic language detection. Feature availability varies by model choice.

Quick Start: Available Endpoint¶

Endpoint	Method	What It Does	Powered By
`/v1/audio/transcriptions`	POST	Convert spoken audio to written text	AWS Transcribe or AWS Bedrock Audio Models

Feature Compatibility¶

Feature	Status	Notes
Input
Audio file upload		Multipart file upload
Output Formats
`json`		Structured transcription
`text`		Plain text output
`verbose_json`		With timestamps and details
`diarized_json`		With speaker identification
`srt`		Subtitle format with timing
`vtt`		WebVTT subtitle format
Language
Language specification		ISO-639-1 language codes
Auto language detection		Automatic identification
Streaming
SSE streaming		Event-based streaming
Advanced
Timestamp granularity		Word or segment level
Speaker diarization		Automatic speaker separation
`known_speaker_names`		Not available
`known_speaker_references`		Not available
`chunking_strategy`		Only `auto` is supported
`temperature`		Model temperature
`prompt`		Extra transcription prompt
`logprobs`
Usage tracking
Input audio duration		Seconds (billing unit on AWS Transcribe)
Output text tokens		On models from Bedrock

Legend:

Supported — Fully compatible with OpenAI API
Available on Select Models — Check your model's capabilities
Partial — Supported with limitations
Unsupported — Not available in this implementation

Model Support¶

Amazon Models¶

Model	Supported Languages	Notes
amazon.transcribe	100+	Full-featured transcription with speaker diarization and subtitle generation at the cost of higher latency

Configuration Required

You must configure the AWS_S3_BUCKET or AWS_TRANSCRIBE_S3_BUCKET environment variable with a bucket in the main AWS region to use this model. This bucket is used for temporary storage during transcription processing.

Mistral Models¶

Model	Supported Languages	Notes
mistral.voxtral-mini-3b-2507	100+	Compact model for fast transcription
mistral.voxtral-small-24b-2507	100+	Larger model for enhanced accuracy

Mistral Voxtral Limitations

Mistral Voxtral models have the following restrictions when running on AWS Bedrock:

File size limit: ~2MB maximum input file size
Audio channels: Mono channel audio only (single channel)

Advanced Features¶

Amazon Transcribe Features¶

Model & Features:

Use amazon.transcribe with the same interface as OpenAI's Whisper API
Or use OpenAI model name directly: whisper-1 works out of the box (maps to amazon.transcribe)
Auto-detect language or specify it for faster processing
Word-level or segment-level timestamps with verbose_json
Speaker Diarization : Automatically identify and label different speakers with diarized_json
Native Subtitles : SRT/VTT files generated directly by AWS Transcribe with precise timing

OpenAI Model Compatibility

stdapi.ai includes a built-in model alias that maps the OpenAI model name to AWS Transcribe:

whisper-1 → amazon.transcribe

This alias enables seamless compatibility with OpenAI-based tools and applications without any configuration changes. You can also customize or override this alias to suit your needs.

Note: The prompt, temperature, chunking_strategy, known_speaker_names, and known_speaker_references parameters are not supported to ensure consistent transcription accuracy. AWS Transcribe provides automatic speaker diarization without requiring known speaker references.

Performance Tips: Optimize Speed & Cost

Specify the language if you know it—skips auto-detection for faster processing and lower AWS costs

Try It Now¶

Transcribe audio to JSON:

curl -X POST "$BASE/v1/audio/transcriptions" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file=@meeting-recording.mp3 \
  -F model=amazon.transcribe \
  -F response_format=json

Generate subtitles:

curl -X POST "$BASE/v1/audio/transcriptions" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file=@video-audio.mp3 \
  -F model=amazon.transcribe \
  -F response_format=srt \
  -F language=en

Transcribe with speaker diarization:

curl -X POST "$BASE/v1/audio/transcriptions" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file=@meeting-recording.mp3 \
  -F model=amazon.transcribe \
  -F response_format=diarized_json

Ready to transcribe audio? Explore available transcription models in the Models API.