Speech to Text API¶
Transcribe audio to text with AWS Transcribe or AWS Bedrock audio-capable models through an OpenAI-compatible interface.
Why Choose Speech to Text?¶
-
Multiple Transcription Options
Choose AWS Transcribe for 100+ languages with speaker diarization, or use Bedrock audio models for advanced capabilities. -
Real-Time or Batch
Stream transcriptions in real-time via SSE or process files efficiently with either service. -
Subtitle Generation
Generate SRT and VTT subtitle files directly with precise timing for video content. -
Advanced Features
Speaker diarization, word-level timestamps, and automatic language detection. Feature availability varies by model choice.
Quick Start: Available Endpoint¶
| Endpoint | Method | What It Does | Powered By |
|---|---|---|---|
/v1/audio/transcriptions |
POST | Convert spoken audio to written text | AWS Transcribe or AWS Bedrock Audio Models |
Feature Compatibility¶
| Feature | Status | Notes |
|---|---|---|
| Input | ||
| Audio file upload | Multipart file upload | |
| Output Formats | ||
json |
Structured transcription | |
text |
Plain text output | |
verbose_json |
With timestamps and details | |
diarized_json |
With speaker identification | |
srt |
Subtitle format with timing | |
vtt |
WebVTT subtitle format | |
| Language | ||
| Language specification | ISO-639-1 language codes | |
| Auto language detection | Automatic identification | |
| Streaming | ||
| SSE streaming | Event-based streaming | |
| Advanced | ||
| Timestamp granularity | Word or segment level | |
| Speaker diarization | Automatic speaker separation | |
known_speaker_names |
Not available | |
known_speaker_references |
Not available | |
chunking_strategy |
Only auto is supported |
|
temperature |
Model temperature | |
prompt |
Extra transcription prompt | |
logprobs |
||
| Usage tracking | ||
| Input audio duration | Seconds (billing unit on AWS Transcribe) | |
| Output text tokens | On models from Bedrock |
Legend:
- Supported — Fully compatible with OpenAI API
- Available on Select Models — Check your model's capabilities
- Partial — Supported with limitations
- Unsupported — Not available in this implementation
Model Support¶
Amazon Models¶
| Model | Supported Languages | Notes |
|---|---|---|
| amazon.transcribe | 100+ | Full-featured transcription with speaker diarization and subtitle generation at the cost of higher latency |
Configuration Required
You must configure the AWS_S3_BUCKET or AWS_TRANSCRIBE_S3_BUCKET environment variable with a bucket in the main AWS region to use this model. This bucket is used for temporary storage during transcription processing.
Mistral Models¶
| Model | Supported Languages | Notes |
|---|---|---|
| mistral.voxtral-mini-3b-2507 | 100+ | Compact model for fast transcription |
| mistral.voxtral-small-24b-2507 | 100+ | Larger model for enhanced accuracy |
Mistral Voxtral Limitations
Mistral Voxtral models have the following restrictions when running on AWS Bedrock:
- File size limit: ~2MB maximum input file size
- Audio channels: Mono channel audio only (single channel)
Advanced Features¶
Amazon Transcribe Features¶
Model & Features:
- Use
amazon.transcribewith the same interface as OpenAI's Whisper API - Or use OpenAI model name directly:
whisper-1works out of the box (maps toamazon.transcribe) - Auto-detect language or specify it for faster processing
- Word-level or segment-level timestamps with
verbose_json - Speaker Diarization : Automatically identify and label different speakers with
diarized_json - Native Subtitles : SRT/VTT files generated directly by AWS Transcribe with precise timing
OpenAI Model Compatibility
stdapi.ai includes a built-in model alias that maps the OpenAI model name to AWS Transcribe:
whisper-1→amazon.transcribe
This alias enables seamless compatibility with OpenAI-based tools and applications without any configuration changes. You can also customize or override this alias to suit your needs.
Note: The prompt, temperature, chunking_strategy, known_speaker_names, and known_speaker_references parameters are not supported to ensure consistent transcription accuracy. AWS Transcribe provides automatic speaker diarization without requiring known speaker references.
Performance Tips: Optimize Speed & Cost
- Specify the language if you know it—skips auto-detection for faster processing and lower AWS costs
Try It Now¶
Transcribe audio to JSON:
curl -X POST "$BASE/v1/audio/transcriptions" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-F file=@meeting-recording.mp3 \
-F model=amazon.transcribe \
-F response_format=json
Generate subtitles:
curl -X POST "$BASE/v1/audio/transcriptions" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-F file=@video-audio.mp3 \
-F model=amazon.transcribe \
-F response_format=srt \
-F language=en
Transcribe with speaker diarization:
curl -X POST "$BASE/v1/audio/transcriptions" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-F file=@meeting-recording.mp3 \
-F model=amazon.transcribe \
-F response_format=diarized_json
Ready to transcribe audio? Explore available transcription models in the Models API.