Skip to content

Speech to Text API

Transcribe audio to text with AWS Transcribe or AWS Bedrock audio-capable models through an OpenAI-compatible interface.

Why Choose Speech to Text?

  • Multiple Transcription Options
    Choose AWS Transcribe for 100+ languages with speaker diarization, or use Bedrock audio models for advanced capabilities.

  • Real-Time or Batch
    Stream transcriptions in real-time via SSE or process files efficiently with either service.

  • Subtitle Generation
    Generate SRT and VTT subtitle files directly with precise timing for video content.

  • Advanced Features
    Speaker diarization, word-level timestamps, and automatic language detection. Feature availability varies by model choice.

Quick Start: Available Endpoint

Endpoint Method What It Does Powered By MCP Tool
/v1/audio/transcriptions POST Convert spoken audio to written text AWS Transcribe or AWS Bedrock Audio Models openai_audio_transcription

Feature Compatibility

Feature Status Notes
Input
Audio file upload Multipart file upload
JSON body input Base64, data URI, HTTPS URL, or S3 URI — for MCP / AI agents
Output Formats
json Structured transcription
text Plain text output
verbose_json With timestamps and details
diarized_json With speaker identification
srt Subtitle format with timing
vtt WebVTT subtitle format
Language
Language specification ISO-639-1 language codes
Auto language detection Automatic identification
Streaming
SSE streaming Event-based streaming
Advanced
Timestamp granularity Word or segment level
Speaker diarization Automatic speaker separation
known_speaker_names Not available
known_speaker_references Not available
chunking_strategy Only auto is supported
temperature Model temperature
prompt Extra transcription prompt
logprobs Log probabilities for token-level confidence scoring
Usage tracking
Input audio duration Seconds (billing unit on AWS Transcribe)
Output text tokens On models from Bedrock

Legend:

  • Supported — Fully compatible with OpenAI API
  • Available on Select Models — Check your model's capabilities
  • Partial — Supported with limitations
  • Unsupported — Not available in this implementation

Model Support

AWS Transcribe Amazon Models

Model Supported Languages Notes
amazon.transcribe 100+ Full-featured transcription with speaker diarization and subtitle generation at the cost of higher latency

Configuration Required

You must configure the AWS_S3_BUCKET or AWS_TRANSCRIBE_S3_BUCKET environment variable with a bucket in the main AWS region to use this model. This bucket is used for temporary storage during transcription processing.

Mistral Mistral Models

Model Supported Languages Notes
mistral.voxtral-mini-3b-2507 100+ Compact model for fast transcription
mistral.voxtral-small-24b-2507 100+ Larger model for enhanced accuracy

Mistral Voxtral Limitations

Mistral Voxtral models have the following restrictions when running on AWS Bedrock:

  • File size limit: ~2MB maximum input file size
  • Audio channels: Mono channel audio only (single channel)

Advanced Features

AWS Transcribe Amazon Transcribe Features

Model & Features:

  • Use amazon.transcribe with the same interface as OpenAI's Whisper API
  • Or use OpenAI model name directly: whisper-1 works out of the box (maps to amazon.transcribe)
  • Auto-detect language or specify it for faster processing
  • Word-level or segment-level timestamps with verbose_json
  • Speaker Diarization : Automatically identify and label different speakers with diarized_json
  • Native Subtitles : SRT/VTT files generated directly by AWS Transcribe with precise timing

OpenAI Model Compatibility

stdapi.ai includes a built-in model alias that maps the OpenAI model name to AWS Transcribe:

  • whisper-1amazon.transcribe

This alias enables seamless compatibility with OpenAI-based tools and applications without any configuration changes. You can also customize or override this alias to suit your needs.

Note: The prompt, temperature, chunking_strategy, known_speaker_names, and known_speaker_references parameters are not supported to ensure consistent transcription accuracy. AWS Transcribe provides automatic speaker diarization without requiring known speaker references.

Performance Tips: Optimize Speed & Cost

  • Specify the language if you know it—skips auto-detection for faster processing and lower AWS costs

Try It Now

Transcribe audio to JSON:

curl -X POST "$BASE/v1/audio/transcriptions" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file=@meeting-recording.mp3 \
  -F model=amazon.transcribe \
  -F response_format=json

Transcribe via JSON body (MCP and AI agents):

When using MCP tools or HTTP clients that cannot construct multipart requests, pass the audio as a data URI or URL:

# Data URI (inline base64)
curl -X POST "$BASE/v1/audio/transcriptions" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file": "data:audio/mp3;base64,<base64-encoded-audio>",
    "model": "amazon.transcribe",
    "response_format": "json"
  }'
# HTTPS URL (server fetches the audio)
curl -X POST "$BASE/v1/audio/transcriptions" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file": "https://example.com/audio.mp3",
    "model": "amazon.transcribe"
  }'

Generate subtitles:

curl -X POST "$BASE/v1/audio/transcriptions" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file=@video-audio.mp3 \
  -F model=amazon.transcribe \
  -F response_format=srt \
  -F language=en

Transcribe with speaker diarization:

curl -X POST "$BASE/v1/audio/transcriptions" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file=@meeting-recording.mp3 \
  -F model=amazon.transcribe \
  -F response_format=diarized_json

Ready to transcribe audio? Explore available transcription models in the Models API.