Skip to content

Speech to Text API

Transcribe audio to text with AWS Transcribe or AWS Bedrock audio-capable models through an OpenAI-compatible interface.

Why Choose Speech to Text?

  • Multiple Transcription Options
    Choose AWS Transcribe for 100+ languages with speaker diarization, or use Bedrock audio models for advanced capabilities.

  • Real-Time or Batch
    Stream transcriptions in real-time via SSE or process files efficiently with either service.

  • Subtitle Generation
    Generate SRT and VTT subtitle files directly with precise timing for video content.

  • Advanced Features
    Speaker diarization, word-level timestamps, and automatic language detection. Feature availability varies by model choice.

Quick Start: Available Endpoint

Endpoint Method What It Does Powered By
/v1/audio/transcriptions POST Convert spoken audio to written text AWS Transcribe or AWS Bedrock Audio Models

Feature Compatibility

Feature Status Notes
Input
Audio file upload Multipart file upload
Output Formats
json Structured transcription
text Plain text output
verbose_json With timestamps and details
diarized_json With speaker identification
srt Subtitle format with timing
vtt WebVTT subtitle format
Language
Language specification ISO-639-1 language codes
Auto language detection Automatic identification
Streaming
SSE streaming Event-based streaming
Advanced
Timestamp granularity Word or segment level
Speaker diarization Automatic speaker separation
known_speaker_names Not available
known_speaker_references Not available
chunking_strategy Only auto is supported
temperature Model temperature
prompt Extra transcription prompt
logprobs
Usage tracking
Input audio duration Seconds (billing unit on AWS Transcribe)
Output text tokens On models from Bedrock

Legend:

  • Supported — Fully compatible with OpenAI API
  • Available on Select Models — Check your model's capabilities
  • Partial — Supported with limitations
  • Unsupported — Not available in this implementation

Model Support

AWS Transcribe Amazon Models

Model Supported Languages Notes
amazon.transcribe 100+ Full-featured transcription with speaker diarization and subtitle generation at the cost of higher latency

Configuration Required

You must configure the AWS_S3_BUCKET or AWS_TRANSCRIBE_S3_BUCKET environment variable with a bucket in the main AWS region to use this model. This bucket is used for temporary storage during transcription processing.

Mistral Mistral Models

Model Supported Languages Notes
mistral.voxtral-mini-3b-2507 100+ Compact model for fast transcription
mistral.voxtral-small-24b-2507 100+ Larger model for enhanced accuracy

Mistral Voxtral Limitations

Mistral Voxtral models have the following restrictions when running on AWS Bedrock:

  • File size limit: ~2MB maximum input file size
  • Audio channels: Mono channel audio only (single channel)

Advanced Features

AWS Transcribe Amazon Transcribe Features

Model & Features:

  • Use amazon.transcribe with the same interface as OpenAI's Whisper API
  • Or use OpenAI model name directly: whisper-1 works out of the box (maps to amazon.transcribe)
  • Auto-detect language or specify it for faster processing
  • Word-level or segment-level timestamps with verbose_json
  • Speaker Diarization : Automatically identify and label different speakers with diarized_json
  • Native Subtitles : SRT/VTT files generated directly by AWS Transcribe with precise timing

OpenAI Model Compatibility

stdapi.ai includes a built-in model alias that maps the OpenAI model name to AWS Transcribe:

  • whisper-1amazon.transcribe

This alias enables seamless compatibility with OpenAI-based tools and applications without any configuration changes. You can also customize or override this alias to suit your needs.

Note: The prompt, temperature, chunking_strategy, known_speaker_names, and known_speaker_references parameters are not supported to ensure consistent transcription accuracy. AWS Transcribe provides automatic speaker diarization without requiring known speaker references.

Performance Tips: Optimize Speed & Cost

  • Specify the language if you know it—skips auto-detection for faster processing and lower AWS costs

Try It Now

Transcribe audio to JSON:

curl -X POST "$BASE/v1/audio/transcriptions" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file=@meeting-recording.mp3 \
  -F model=amazon.transcribe \
  -F response_format=json

Generate subtitles:

curl -X POST "$BASE/v1/audio/transcriptions" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file=@video-audio.mp3 \
  -F model=amazon.transcribe \
  -F response_format=srt \
  -F language=en

Transcribe with speaker diarization:

curl -X POST "$BASE/v1/audio/transcriptions" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file=@meeting-recording.mp3 \
  -F model=amazon.transcribe \
  -F response_format=diarized_json

Ready to transcribe audio? Explore available transcription models in the Models API.