Resilience & Failover¶
stdapi.ai on AWS is designed for high availability at every layer — from intelligent multi-region request routing to the underlying infrastructure running the service. This page covers both the application-level region routing for AWS Bedrock and the infrastructure resilience built into the Terraform module.
Region Routing¶
stdapi.ai can automatically distribute Bedrock requests across your configured AWS regions. When a region becomes temporarily unavailable or hits quota limits, requests are transparently routed to another region — no client changes needed.
Multiply Your Effective Quota
Each AWS region has its own independent quota. By configuring multiple regions, your effective quota scales proportionally — with 3 regions you get approximately 3× the tokens per minute and 3× the daily token limit compared to a single-region setup.
Overview¶
Region routing activates when you have two or more regions in AWS_BEDROCK_REGIONS. The server tracks the health of each region per model and steers traffic away from regions that are returning errors.
-
Quota & Throttling
Triggers onThrottlingException,TooManyRequestsException,ServiceQuotaExceededException -
Regional Unavailability
Triggers onServiceUnavailableException,InternalServerException,ModelNotReadyException -
Exponential Backoff
Quota errors: delay doubles per consecutive error, capped at 1 hour -
Fixed Backoff
Unavailability errors: fixed configurable delay, default 30 s -
Configurable Retry Count
SetAWS_BEDROCK_MAX_RETRIESto control total retries; requests cycle across regions in order
Routing Strategies¶
Set the strategy with AWS_BEDROCK_REGION_ROUTING:
| Strategy | Description | Prompt Caching | Default |
|---|---|---|---|
ordered |
Try regions in the order listed in AWS_BEDROCK_REGIONS, skipping any that are currently blocked |
Compatible | Yes |
lowest_latency |
Prefer the region with the lowest measured round-trip latency | Compatible | |
round_robin |
Distribute requests evenly across available regions | Not compatible | |
disabled |
No routing; each model uses its primary region only | Compatible |
Ordered (default)¶
Regions are tried in the order they appear in AWS_BEDROCK_REGIONS. The first healthy region wins. This is the best choice when you want predictable routing and prompt caching, since requests for a given model tend to land on the same region as long as it is healthy.
Lowest Latency¶
At startup the server measures round-trip latency to each region and prefers the fastest one. If that region becomes blocked, the next-fastest is used. Good for latency-sensitive workloads where you want the server to pick the closest region automatically.
Round Robin¶
Requests rotate evenly across healthy regions. This maximizes aggregate throughput when you need to spread load, but is incompatible with prompt caching because consecutive requests for the same model may land on different regions.
Configuration¶
# Required: at least two regions
export AWS_BEDROCK_REGIONS=us-east-1,us-west-2,eu-west-1
# Strategy (default: ordered)
export AWS_BEDROCK_REGION_ROUTING=ordered
# Total retries across all regions per request (default: 9)
# With 3 regions and 9 retries, the cycle is: r1, r2, r3, r1, r2, r3, r1, r2, r3, r1 (10 total attempts)
export AWS_BEDROCK_MAX_RETRIES=9
# Enable adaptive retry mode — dynamically throttles back retries under congestion (default: false)
export AWS_ADAPTIVE_RETRY=false
# How long to avoid a region after a quota/throttling error (seconds, default: 60)
# This is the base value — the actual delay doubles on each consecutive quota error,
# up to a hard ceiling of 1 hour.
export AWS_BEDROCK_REGION_ROUTING_QUOTA_BACKOFF_SECONDS=60
# Hard ceiling on quota backoff per region (seconds, default: 3600 = 1 hour)
export AWS_BEDROCK_REGION_ROUTING_MAX_QUOTA_BACKOFF_SECONDS=3600
# Factor × max quota backoff after which the consecutive-error counter resets (default: 2)
export AWS_BEDROCK_REGION_ROUTING_QUOTA_STALE_FACTOR=2
# How long to avoid a region after an unavailability error (seconds, default: 30)
export AWS_BEDROCK_REGION_ROUTING_UNAVAILABLE_BACKOFF_SECONDS=30
Single-Region Deployments
With only one region configured, routing is automatically disabled regardless of the strategy setting.
How It Works¶
- Model discovery — At startup, stdapi.ai discovers which models are available in each configured region.
- Region selection — When a request arrives, the router picks the best region for that model based on the active strategy and current region health.
- Automatic failover with cycling — For synchronous and streaming requests without S3 inputs, the retry loop cycles through regions in priority order, wrapping back to the start after exhausting all regions, up to
AWS_BEDROCK_MAX_RETRIEStotal retries. For example, with 3 regions and 9 retries the sequence isr1, r2, r3, r1, r2, r3, r1, r2, r3, r1. All retryable errors escalate to the next region immediately. When S3 inputs are present, the region is pinned and botocore's adaptive retries handle resilience within that region (see S3-Aware Region Selection). - Backoff tracking — Regions that produce errors are temporarily deprioritized. Quota errors use exponential backoff (base interval doubles per consecutive error, capped at 1 hour); unavailability errors use a fixed backoff. Once the backoff expires, regions rejoin the rotation.
%%{init: {'flowchart': {'htmlLabels': true}} }%%
flowchart LR
client["Your App"] -->|API request| stdapi
subgraph stdapi ["<img src='../styles/logo.svg' style='height:48px;width:auto;vertical-align:middle;' /> stdapi.ai"]
router["Region Router\n(strategy + health)"]
end
router -->|"region selected"| r1
subgraph aws ["<img src='../styles/logo_amazon_bedrock.svg' style='height:48px;width:auto;vertical-align:middle;' /> AWS Bedrock"]
r1["us-east-1"]
r2["us-west-2"]
r3["eu-west-1"]
end
r1 -->|"ThrottlingException"| router
router -->|"retry → next region"| r2
r2 -->|"✓ success"| stdapi
stdapi -->|response| client
Failover Scope¶
| API Style | Failover Behavior |
|---|---|
| Synchronous (Converse, InvokeModel) | Automatic retry cycling across regions within the same request; S3-pinned requests stay on the pinned region with botocore adaptive retries |
| Streaming (ConverseStream, InvokeModelWithResponseStream) | Retry cycling across regions before the stream opens; once streaming begins the region is locked. S3-pinned requests stay on the pinned region with botocore adaptive retries. |
| Asynchronous (StartAsyncInvoke) | Region is selected once at job start; no mid-job failover |
Logging¶
Every request log includes a model_regions field (a set) showing which AWS region(s) handled the request. A single request may touch more than one region when failover occurs mid-request.
{
"type": "request",
"model_id": "anthropic.claude-sonnet-4-5-20250929-v1:0",
"model_regions": ["us-east-1"],
...
}
Elevated Log Level on Failover
When a region is skipped due to a quota or unavailability error, the request log level is elevated to warning so these events are visible even when filtering for warnings only.
S3 Data Handling¶
Many Bedrock operations accept S3 URIs as input (e.g. images, PDFs) or produce S3 output (e.g. async invocations). stdapi.ai includes several features to handle S3 data seamlessly across regions.
S3-Aware Region Selection¶
When a request references S3 data, the router takes the data location into account:
- S3 inputs present — All S3-sourced input files for the request are tracked and their regions are ranked by descending total data volume. Only the single best region is used — the retry loop is pinned to it. This is required because S3 content blocks are resolved once for a specific region and cannot be re-resolved for a different one; retrying on another region would send a cross-region S3 reference that Bedrock cannot access. If none of the S3 input regions are regions where the model is available, the router falls back to the first model region that has a configured S3 bucket (the object will be copied there before invocation). If no such bucket region exists either, the request is rejected with an error.
- S3 required, no S3 inputs — Operations that need an S3 bucket (e.g. async invocations) restrict candidates to model regions that have a configured S3 bucket. If no region has a bucket, the request is rejected with an error.
- No S3 constraint — All regions where the model is available are considered.
%%{init: {'flowchart': {'htmlLabels': true}} }%%
flowchart TD
req["Incoming Request"] --> check{"S3 inputs\npresent?"}
check -->|"Yes"| rank["Rank S3 regions\nby data volume"]
rank --> overlap{"Model available\nin any S3 region?"}
overlap -->|"Yes"| pin["Pin to single\nbest region"]
overlap -->|"No"| bucketed{"Model region\nwith S3 bucket?"}
bucketed -->|"Yes"| pin_bucket["Pin to first bucketed\nmodel region\n(object will be copied)"]
bucketed -->|"No"| err["❌ Error: no viable region\nfor model + S3 inputs"]
check -->|"No"| s3req{"S3 bucket\nrequired?"}
s3req -->|"Yes"| s3cap["Model regions\nwith S3 bucket only"]
s3cap --> empty{"Any found?"}
empty -->|"No"| err2["❌ Error: no region\nhas a configured bucket"]
empty -->|"Yes"| multi["Multi-region\ncandidates"]
s3req -->|"No"| multi
pin --> invoke["Invoke Bedrock"]
pin_bucket --> invoke
multi --> invoke
No Cross-Region Failover with S3 Inputs
When S3 input files are present, the region is locked before the request is made. If that region is throttled or unavailable, the request fails rather than retrying on another region with a stale S3 reference.
S3 HTTP URL to S3 URI Conversion¶
If a user passes an S3 HTTP URL (including presigned URLs) as input, stdapi.ai automatically converts it to an s3:// URI when the bucket is recognized. This avoids unnecessary HTTP round-trips and allows Bedrock to access the object directly.
Recognized buckets include:
- The application's own buckets (
AWS_S3_BUCKETandAWS_S3_REGIONAL_BUCKETS) - Any bucket listed in
AWS_S3_ACCEPTED_BUCKETS
Both virtual-hosted style (https://bucket.s3.region.amazonaws.com/key) and path-style (https://s3.region.amazonaws.com/bucket/key) URLs are supported.
Cross-Region S3 Copy¶
When the selected Bedrock region differs from the region where the input S3 object resides, stdapi.ai copies the object to a bucket in the target region before invoking the model. The copy uses server-side copy for objects up to 5 GiB and multipart copy for larger objects.
%%{init: {'flowchart': {'htmlLabels': true}} }%%
flowchart LR
input["<img src='../styles/logo_amazon_s3.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>s3://bucket-us-east-1/file"]
copy["Server-side copy\n≤5 GiB: single copy\n>5 GiB: multipart"]
dest["<img src='../styles/logo_amazon_s3.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>s3://bucket-us-west-2/file"]
bedrock["<img src='../styles/logo_amazon_bedrock.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>Bedrock us-west-2"]
input -->|"selected region ≠ object region"| copy
copy --> dest
dest --> bedrock
Accepted S3 Buckets¶
You can declare external S3 buckets that the application has read access to. These buckets are then recognized for S3 HTTP URL conversion and region-aware routing:
export AWS_S3_ACCEPTED_BUCKETS='{"my-data-bucket": "us-east-1", "my-eu-bucket": "eu-west-1"}'
Keys are bucket names, values are the AWS region where each bucket resides.
Regional S3 Buckets¶
Asynchronous invocations require an S3 bucket in the same region as the Bedrock endpoint. When routing is enabled, configure regional buckets so the router can place async jobs in any eligible region:
export AWS_S3_REGIONAL_BUCKETS='{"us-east-1": "my-bucket-use1", "us-west-2": "my-bucket-usw2"}'
Note
If a region has no configured bucket, it is excluded from async invocation routing but remains available for synchronous and streaming requests.
Model Region Restrict¶
You can restrict specific models to a fixed set of regions. This is useful when a model offers important features only in certain regions (e.g. Nova grounding is only available in us-east-1):
export AWS_BEDROCK_MODEL_REGION_RESTRICT='{"amazon.nova-pro-v1:0": ["us-east-1"]}'
Keys are Bedrock model IDs (or prefixes). Values are ordered lists of allowed regions. The model is only made available in those regions—no fallback to other regions occurs. The order of the list determines the routing priority when multiple regions are listed.
Deprecated Model Fallback¶
When a client sends a request using a model ID that has been retired or superseded, stdapi.ai can transparently reroute it to the recommended replacement — no client changes needed.
This is controlled by AWS_BEDROCK_DEPRECATED_MODEL_FALLBACK (default: true).
How it works¶
- On a cache miss, the deprecation registry is consulted for a replacement.
- If the replacement is itself deprecated, the chain is followed until a live model is found or the chain ends.
- If a live replacement is found, the request proceeds with it. A warning is recorded in the request log and the log level is elevated to
warningso the event is visible in monitoring. - If no live model is found at the end of the chain, a
404is returned naming both the original deprecated ID and the last replacement tried.
AWS_BEDROCK_LEGACY — Use with caution
Setting AWS_BEDROCK_LEGACY=true forces stdapi.ai to keep serving legacy (end-of-life) models. AWS may deny requests to such models with an access error if you have not been actively using the model recently, causing failover to break silently. Only set this option if using a legacy model is absolutely required.
Legacy model warnings¶
Using a legacy model (one AWS has scheduled for end-of-life) also emits a warning-level log entry, including the EOL date when known:
Model 'anthropic.claude-3-5-haiku-20241022-v1:0' is legacy and will reach end-of-life on 2026-06-19. Please migrate to a supported model.
Models whose EOL date falls within the current cache window are proactively excluded at cache refresh time, so they are never served to clients even if AWS has not yet removed them from the available models list.
Strict mode¶
Set AWS_BEDROCK_DEPRECATED_MODEL_FALLBACK=false to disable the fallback. Requests using a deprecated model ID will fail with a 404 that includes the recommended replacement, forcing clients to update their code explicitly:
Model 'amazon.titan-text-lite-v1' is deprecated or pending deprecation, please use 'amazon.nova-lite-v1:0' instead.
Extending the registry¶
The built-in deprecation registry covers all models listed in the AWS Bedrock model lifecycle. Use AWS_BEDROCK_DEPRECATED_MODELS to add custom mappings or override existing ones.
Infrastructure Resilience¶
The Terraform module deploys stdapi.ai following AWS best practices for high availability and fault tolerance. Every component is designed to handle failures transparently — no additional configuration required.
-
Multi-AZ Fargate Tasks
ECS tasks spread across all Availability Zones; a single AZ failure does not interrupt service -
Stateless Service Design
stdapi.ai holds no local state — failed tasks are replaced instantly with zero data loss -
ALB Health Checks
Unhealthy tasks drained and replaced within seconds; traffic rerouted to healthy AZs automatically -
Bedrock Cross-Region Inference
Bedrock-native routing across AWS regions provides an extra failover layer on top of stdapi.ai's own region routing -
S3 Eleven-Nines Durability
99.999999999% object durability; regional buckets co-located with each Bedrock endpoint -
Fast Task Startup
New tasks become healthy in under 30 seconds, minimising the recovery window after any failure -
Zero-Downtime Updates
Rolling deployments and ALB connection draining ensure in-flight requests always complete cleanly
%%{init: {'flowchart': {'htmlLabels': true}} }%%
flowchart TB
client["Your App"]
subgraph deployment["AWS Region (ECS deployment)"]
alb["<img src='../styles/logo_amazon_load_balancing.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>ALB + WAF"]
b_local["<img src='../styles/logo_amazon_bedrock.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>AWS Bedrock"]
s3r["<img src='../styles/logo_amazon_s3.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>S3"]
subgraph az_a["Availability Zone A"]
ecs_a["<img src='../styles/logo.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>stdapi.ai<br/>ECS Fargate"]
end
subgraph az_b["Availability Zone B"]
ecs_b["<img src='../styles/logo.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>stdapi.ai<br/>ECS Fargate"]
end
end
subgraph br2["Bedrock Region 2"]
b2["<img src='../styles/logo_amazon_bedrock.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>AWS Bedrock"]
bs2["<img src='../styles/logo_amazon_s3.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>Regional S3"]
end
subgraph brn["Bedrock Region N"]
bn["<img src='../styles/logo_amazon_bedrock.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>AWS Bedrock"]
bsn["<img src='../styles/logo_amazon_s3.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>Regional S3"]
end
client -->|"HTTPS"| alb
alb --> ecs_a & ecs_b
ecs_a & ecs_b --> s3r
ecs_a & ecs_b --> b_local
ecs_a & ecs_b -.->|"region routing"| b2
ecs_a & ecs_b -.->|"region routing"| bn
b_local -.-|"cross-region inference"| b2
b2 -.-|"cross-region inference"| bn
Multi-AZ & ECS Service Resilience¶
Stateless by design. stdapi.ai stores no local state — all persistent data lives in S3. Each ECS Fargate task is fully replaceable: ECS can terminate and relaunch a failed task without any loss of data or request state that the client cannot retry.
Multi-AZ spread. The Terraform module places ECS tasks across all available Availability Zones in the region. If an AZ experiences a partial or full failure, tasks in the remaining AZs continue to process requests without interruption. The default configuration maintains at least one task per Availability Zone, guaranteeing availability even during a task replacement event.
Auto-scaling. Task count scales automatically based on CPU utilisation, memory utilisation, and ALB request count — whichever metric signals pressure first. Fargate Spot is optionally available for cost-sensitive deployments — see Cost-Optimized Deployment for the trade-offs.
Terraform Module
Minimum capacity defaults to the number of deployed Availability Zones (one task per AZ). Maximum capacity is configurable (ecs_max_capacity, default: 10). Auto-scaling targets CPU and memory utilisation as well as ALB request count per target, so the service scales out under any of these pressure signals.
Fast startup. The stdapi.ai container image is optimised for minimal startup time — a new task typically becomes healthy in under 30 seconds. Fast startup is critical for recovery: when ECS detects a failed task it launches a replacement immediately, keeping the degraded window short and ensuring the service restores full capacity without manual intervention.
Zero-downtime updates. ECS rolling deployments start the new container version and wait for it to pass health checks before draining the old task. The ALB connection draining period lets in-flight requests complete on the outgoing task before it is deregistered. Application updates never interrupt ongoing API calls.
ALB Resilience¶
The Application Load Balancer is a fully managed, natively multi-AZ AWS service:
- Cross-AZ load balancing — traffic is distributed evenly across tasks in all healthy AZs.
- Health check integration — the ALB polls
/healthon each task; a task is removed from rotation after two consecutive failed checks (~60 s) and readded as soon as it recovers. - WAF protection — when enabled, WAF sits in front of the ALB and mitigates DDoS and rate-limit abuse before requests reach the application.
Terraform Module
- Load balancing algorithm — uses
weighted_randomwith anomaly mitigation enabled, automatically reducing traffic sent to tasks exhibiting elevated error rates before they are fully drained. - Idle timeout — set to 3600 s (1 hour) (
alb_idle_timeout) to accommodate long-running streaming LLM responses. Without a sufficiently large timeout, the ALB may terminate connections mid-stream for slow or large generations.
ALB is not a single point of failure
AWS manages ALB node redundancy across AZs automatically. An AZ failure reduces capacity but does not take the load balancer offline.
Bedrock Cross-Region Inference¶
AWS Bedrock supports cross-region inference profiles, which allow Bedrock to automatically route a model invocation to another AWS region when the primary region is throttled or temporarily unavailable. stdapi.ai enables cross-region inference by default (AWS_BEDROCK_CROSS_REGION_INFERENCE_GLOBAL=true).
This creates two complementary failover layers:
| Layer | Where it operates | When it triggers |
|---|---|---|
| stdapi.ai region routing | Application level — across your configured regions | Quota exceeded, throttling, regional unavailability |
| Bedrock cross-region inference | Bedrock service level — transparent within AWS | Bedrock-internal capacity events |
Together, they maximize model availability without any client-side changes.
Compliance-aware cross-region inference
Set AWS_BEDROCK_CROSS_REGION_INFERENCE_GLOBAL=false to restrict Bedrock to region-local inference, ensuring data stays within a specific geography (e.g. EU-only for GDPR compliance). See Data Sovereignty & Compliance and the GDPR deployment example.
S3 Resilience¶
S3 stores multimodal inputs and outputs (images, PDFs, audio) used by Bedrock operations:
- 99.999999999% (11 nines) object durability — data is stored redundantly across multiple devices and AZs within a region.
- 99.99% availability SLA — designed for continuous availability with no planned downtime.
- Regional buckets — for multi-region deployments, each Bedrock region has a dedicated S3 bucket co-located in the same region. This eliminates cross-region data transfer for async and multimodal operations and satisfies data residency requirements.
Ultimate Multi-Region Deployment¶
For the highest possible resilience, deploy two independent stdapi.ai stacks in separate AWS regions and connect them with AWS Global Accelerator. Global Accelerator routes each client to the nearest healthy region using geographic proximity — both regions are active simultaneously. If one region's ALB fails health checks, GA automatically reroutes its traffic to the other region within seconds.
Additional Bedrock regions (without ECS) can be added to AWS_BEDROCK_REGIONS in each stack to expand model availability and quota without deploying more ECS infrastructure.
What this adds on top of a single-region deployment:
| Component | Single region | Multi-region + GA |
|---|---|---|
| ECS Fargate | Multi-AZ in one region | Multi-AZ in two regions |
| ALB | One ALB | One ALB per region |
| Entry point | ALB DNS name | Single Anycast IP via Global Accelerator |
| Traffic routing | — | Geographic proximity (nearest region wins) |
| Regional failover | None | Automatic, within seconds |
| Bedrock quota | One region's quota | Multiple independent quotas |
%%{init: {'flowchart': {'htmlLabels': true}} }%%
flowchart LR
client["Your App"]
ga["<img src='../styles/logo_amazon_global_accelerator.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>Global Accelerator"]
subgraph region_a["AWS Region A"]
direction TB
alb_a["<img src='../styles/logo_amazon_load_balancing.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>ALB + WAF"]
ecs_a["<img src='../styles/logo.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>stdapi.ai<br/>ECS Fargate"]
b_a["<img src='../styles/logo_amazon_bedrock.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>AWS Bedrock"]
s3_a["<img src='../styles/logo_amazon_s3.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>S3"]
end
subgraph region_b["AWS Region B"]
direction TB
alb_b["<img src='../styles/logo_amazon_load_balancing.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>ALB + WAF"]
ecs_b["<img src='../styles/logo.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>stdapi.ai<br/>ECS Fargate"]
b_b["<img src='../styles/logo_amazon_bedrock.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>AWS Bedrock"]
s3_b["<img src='../styles/logo_amazon_s3.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>S3"]
end
subgraph region_c["AWS Region C (Bedrock only)"]
direction TB
b_c["<img src='../styles/logo_amazon_bedrock.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>AWS Bedrock"]
s3_c["<img src='../styles/logo_amazon_s3.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>Regional S3"]
end
client -->|"HTTPS"| ga
ga -->|"geo-routing"| alb_a
ga -->|"geo-routing"| alb_b
alb_a --> ecs_a
alb_b --> ecs_b
ecs_a --> b_a & s3_a
ecs_b --> b_b & s3_b
ecs_a & ecs_b -.->|"region routing"| b_c
How Global Accelerator integrates:
- Geographic proximity routing — GA resolves each client to the nearest AWS region over the public internet, then carries the traffic over the AWS backbone to the ALB in that region. Both regions serve live traffic simultaneously.
- Health-based failover — GA continuously health-checks each ALB endpoint. If a region's ALB stops responding, GA automatically reroutes its traffic to the other region within seconds — with no DNS TTL delay.
- Single Anycast entry point — clients always connect to the same two static IPs regardless of which region handles the request. No client reconfiguration is needed during a regional failure.
API key synchronisation
Both ECS stacks must share the same API key so clients can reach either region transparently. Use api_key_secretsmanager_secret pointing to a cross-region replicated Secrets Manager secret, or set the same key via api_key_value in both modules.
Best Practices¶
Infrastructure:
- Use the Terraform module — The stdapi-ai Terraform module provisions all resilience features out of the box: multi-AZ ECS, ALB health checks, auto-scaling, WAF, and CloudWatch alarms. Deploying manually risks missing critical settings.
- Run at least two Bedrock regions — Configure
aws_bedrock_regionswith two or more regions to unlock quota multiplication and automatic failover. A single region is a single point of failure for quota limits.
Region routing:
- Start with
ordered— It provides failover without sacrificing prompt caching. - Use
lowest_latencyonly if your server's network position varies or you want the fastest region chosen automatically. - Use
round_robinfor high-throughput batch workloads where prompt caching is not needed. - Keep backoff values moderate — The defaults (60 s for quota, 30 s for unavailability) work well for most workloads. Very short backoffs may cause premature retries against a region that is still overloaded.
- Tune
AWS_BEDROCK_MAX_RETRIES— The default of 9 provides strong resilience across multiple regions. Lower it (e.g.3) to fail faster; raise it for workloads that can tolerate longer retry windows during sustained outages. - Consider
AWS_ADAPTIVE_RETRY— Enable this when many concurrent clients share the same endpoint and sustained congestion is likely. It paces retries based on real-time error signals, reducing the risk of retry storms — at the cost of potentially higher per-request latency under load. Avoid it for latency-sensitive or low-traffic workloads. - Monitor
model_regionsin logs — If one region consistently appears in error logs, consider adjusting its quota or removing it from the region list. - Declare accepted buckets — If your users provide S3 URLs from buckets outside the application's own buckets, add them to
AWS_S3_ACCEPTED_BUCKETSso the router can resolve their region and convert HTTP URLs to S3 URIs. - Pin models when needed — Use
AWS_BEDROCK_MODEL_REGION_RESTRICTfor models that have region-specific features (e.g. grounding) to guarantee those features are always available. The model will be restricted exclusively to the listed regions. - Plan for model deprecations — Keep
AWS_BEDROCK_DEPRECATED_MODEL_FALLBACK=true(the default) so clients survive AWS model retirements without downtime. Switch tofalsein environments where you want to enforce explicit client migrations.
Next Steps¶
- Getting Started — Deploy to AWS with Terraform in 5 minutes
- Advanced Deployment — Multi-region Terraform examples with resilience configured
- Configuration Reference — All routing and failover environment variables
- Data Sovereignty & Compliance — GDPR-compliant region configuration