Skip to content

Resilience & Failover

stdapi.ai on AWS is designed for high availability at every layer — from intelligent multi-region request routing to the underlying infrastructure running the service. This page covers both the application-level region routing for AWS Bedrock and the infrastructure resilience built into the Terraform module.


Region Routing

stdapi.ai can automatically distribute Bedrock requests across your configured AWS regions. When a region becomes temporarily unavailable or hits quota limits, requests are transparently routed to another region — no client changes needed.

Multiply Your Effective Quota

Each AWS region has its own independent quota. By configuring multiple regions, your effective quota scales proportionally — with 3 regions you get approximately 3× the tokens per minute and 3× the daily token limit compared to a single-region setup.

Overview

Region routing activates when you have two or more regions in AWS_BEDROCK_REGIONS. The server tracks the health of each region per model and steers traffic away from regions that are returning errors.

  • Quota & Throttling
    Triggers on ThrottlingException, TooManyRequestsException, ServiceQuotaExceededException

  • Regional Unavailability
    Triggers on ServiceUnavailableException, InternalServerException, ModelNotReadyException

  • Exponential Backoff
    Quota errors: delay doubles per consecutive error, capped at 1 hour

  • Fixed Backoff
    Unavailability errors: fixed configurable delay, default 30 s

  • Configurable Retry Count
    Set AWS_BEDROCK_MAX_RETRIES to control total retries; requests cycle across regions in order


Routing Strategies

Set the strategy with AWS_BEDROCK_REGION_ROUTING:

Strategy Description Prompt Caching Default
ordered Try regions in the order listed in AWS_BEDROCK_REGIONS, skipping any that are currently blocked Compatible Yes
lowest_latency Prefer the region with the lowest measured round-trip latency Compatible
round_robin Distribute requests evenly across available regions Not compatible
disabled No routing; each model uses its primary region only Compatible

Ordered (default)

Regions are tried in the order they appear in AWS_BEDROCK_REGIONS. The first healthy region wins. This is the best choice when you want predictable routing and prompt caching, since requests for a given model tend to land on the same region as long as it is healthy.

Lowest Latency

At startup the server measures round-trip latency to each region and prefers the fastest one. If that region becomes blocked, the next-fastest is used. Good for latency-sensitive workloads where you want the server to pick the closest region automatically.

Round Robin

Requests rotate evenly across healthy regions. This maximizes aggregate throughput when you need to spread load, but is incompatible with prompt caching because consecutive requests for the same model may land on different regions.


Configuration

# Required: at least two regions
export AWS_BEDROCK_REGIONS=us-east-1,us-west-2,eu-west-1

# Strategy (default: ordered)
export AWS_BEDROCK_REGION_ROUTING=ordered

# Total retries across all regions per request (default: 9)
# With 3 regions and 9 retries, the cycle is: r1, r2, r3, r1, r2, r3, r1, r2, r3, r1 (10 total attempts)
export AWS_BEDROCK_MAX_RETRIES=9

# Enable adaptive retry mode — dynamically throttles back retries under congestion (default: false)
export AWS_ADAPTIVE_RETRY=false

# How long to avoid a region after a quota/throttling error (seconds, default: 60)
# This is the base value — the actual delay doubles on each consecutive quota error,
# up to a hard ceiling of 1 hour.
export AWS_BEDROCK_REGION_ROUTING_QUOTA_BACKOFF_SECONDS=60

# Hard ceiling on quota backoff per region (seconds, default: 3600 = 1 hour)
export AWS_BEDROCK_REGION_ROUTING_MAX_QUOTA_BACKOFF_SECONDS=3600

# Factor × max quota backoff after which the consecutive-error counter resets (default: 2)
export AWS_BEDROCK_REGION_ROUTING_QUOTA_STALE_FACTOR=2

# How long to avoid a region after an unavailability error (seconds, default: 30)
export AWS_BEDROCK_REGION_ROUTING_UNAVAILABLE_BACKOFF_SECONDS=30

Single-Region Deployments

With only one region configured, routing is automatically disabled regardless of the strategy setting.


How It Works

  1. Model discovery — At startup, stdapi.ai discovers which models are available in each configured region.
  2. Region selection — When a request arrives, the router picks the best region for that model based on the active strategy and current region health.
  3. Automatic failover with cycling — For synchronous and streaming requests without S3 inputs, the retry loop cycles through regions in priority order, wrapping back to the start after exhausting all regions, up to AWS_BEDROCK_MAX_RETRIES total retries. For example, with 3 regions and 9 retries the sequence is r1, r2, r3, r1, r2, r3, r1, r2, r3, r1. All retryable errors escalate to the next region immediately. When S3 inputs are present, the region is pinned and botocore's adaptive retries handle resilience within that region (see S3-Aware Region Selection).
  4. Backoff tracking — Regions that produce errors are temporarily deprioritized. Quota errors use exponential backoff (base interval doubles per consecutive error, capped at 1 hour); unavailability errors use a fixed backoff. Once the backoff expires, regions rejoin the rotation.
%%{init: {'flowchart': {'htmlLabels': true}} }%%
flowchart LR
    client["Your App"] -->|API request| stdapi

    subgraph stdapi ["<img src='../styles/logo.svg' style='height:48px;width:auto;vertical-align:middle;' /> stdapi.ai"]
        router["Region Router\n(strategy + health)"]
    end

    router -->|"region selected"| r1

    subgraph aws ["<img src='../styles/logo_amazon_bedrock.svg' style='height:48px;width:auto;vertical-align:middle;' /> AWS Bedrock"]
        r1["us-east-1"]
        r2["us-west-2"]
        r3["eu-west-1"]
    end

    r1 -->|"ThrottlingException"| router
    router -->|"retry → next region"| r2
    r2 -->|"✓ success"| stdapi
    stdapi -->|response| client

Failover Scope

API Style Failover Behavior
Synchronous (Converse, InvokeModel) Automatic retry cycling across regions within the same request; S3-pinned requests stay on the pinned region with botocore adaptive retries
Streaming (ConverseStream, InvokeModelWithResponseStream) Retry cycling across regions before the stream opens; once streaming begins the region is locked. S3-pinned requests stay on the pinned region with botocore adaptive retries.
Asynchronous (StartAsyncInvoke) Region is selected once at job start; no mid-job failover

Logging

Every request log includes a model_regions field (a set) showing which AWS region(s) handled the request. A single request may touch more than one region when failover occurs mid-request.

{
  "type": "request",
  "model_id": "anthropic.claude-sonnet-4-5-20250929-v1:0",
  "model_regions": ["us-east-1"],
  ...
}

Elevated Log Level on Failover

When a region is skipped due to a quota or unavailability error, the request log level is elevated to warning so these events are visible even when filtering for warnings only.


S3 Data Handling

Many Bedrock operations accept S3 URIs as input (e.g. images, PDFs) or produce S3 output (e.g. async invocations). stdapi.ai includes several features to handle S3 data seamlessly across regions.

S3-Aware Region Selection

When a request references S3 data, the router takes the data location into account:

  • S3 inputs present — All S3-sourced input files for the request are tracked and their regions are ranked by descending total data volume. Only the single best region is used — the retry loop is pinned to it. This is required because S3 content blocks are resolved once for a specific region and cannot be re-resolved for a different one; retrying on another region would send a cross-region S3 reference that Bedrock cannot access. If none of the S3 input regions are regions where the model is available, the router falls back to the first model region that has a configured S3 bucket (the object will be copied there before invocation). If no such bucket region exists either, the request is rejected with an error.
  • S3 required, no S3 inputs — Operations that need an S3 bucket (e.g. async invocations) restrict candidates to model regions that have a configured S3 bucket. If no region has a bucket, the request is rejected with an error.
  • No S3 constraint — All regions where the model is available are considered.
%%{init: {'flowchart': {'htmlLabels': true}} }%%
flowchart TD
    req["Incoming Request"] --> check{"S3 inputs\npresent?"}

    check -->|"Yes"| rank["Rank S3 regions\nby data volume"]
    rank --> overlap{"Model available\nin any S3 region?"}
    overlap -->|"Yes"| pin["Pin to single\nbest region"]
    overlap -->|"No"| bucketed{"Model region\nwith S3 bucket?"}
    bucketed -->|"Yes"| pin_bucket["Pin to first bucketed\nmodel region\n(object will be copied)"]
    bucketed -->|"No"| err["❌ Error: no viable region\nfor model + S3 inputs"]

    check -->|"No"| s3req{"S3 bucket\nrequired?"}
    s3req -->|"Yes"| s3cap["Model regions\nwith S3 bucket only"]
    s3cap --> empty{"Any found?"}
    empty -->|"No"| err2["❌ Error: no region\nhas a configured bucket"]
    empty -->|"Yes"| multi["Multi-region\ncandidates"]

    s3req -->|"No"| multi

    pin --> invoke["Invoke Bedrock"]
    pin_bucket --> invoke
    multi --> invoke

No Cross-Region Failover with S3 Inputs

When S3 input files are present, the region is locked before the request is made. If that region is throttled or unavailable, the request fails rather than retrying on another region with a stale S3 reference.

S3 HTTP URL to S3 URI Conversion

If a user passes an S3 HTTP URL (including presigned URLs) as input, stdapi.ai automatically converts it to an s3:// URI when the bucket is recognized. This avoids unnecessary HTTP round-trips and allows Bedrock to access the object directly.

Recognized buckets include:

  • The application's own buckets (AWS_S3_BUCKET and AWS_S3_REGIONAL_BUCKETS)
  • Any bucket listed in AWS_S3_ACCEPTED_BUCKETS

Both virtual-hosted style (https://bucket.s3.region.amazonaws.com/key) and path-style (https://s3.region.amazonaws.com/bucket/key) URLs are supported.

Cross-Region S3 Copy

When the selected Bedrock region differs from the region where the input S3 object resides, stdapi.ai copies the object to a bucket in the target region before invoking the model. The copy uses server-side copy for objects up to 5 GiB and multipart copy for larger objects.

%%{init: {'flowchart': {'htmlLabels': true}} }%%
flowchart LR
    input["<img src='../styles/logo_amazon_s3.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>s3://bucket-us-east-1/file"]
    copy["Server-side copy\n≤5 GiB: single copy\n>5 GiB: multipart"]
    dest["<img src='../styles/logo_amazon_s3.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>s3://bucket-us-west-2/file"]
    bedrock["<img src='../styles/logo_amazon_bedrock.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>Bedrock us-west-2"]

    input -->|"selected region ≠ object region"| copy
    copy --> dest
    dest --> bedrock

Accepted S3 Buckets

You can declare external S3 buckets that the application has read access to. These buckets are then recognized for S3 HTTP URL conversion and region-aware routing:

export AWS_S3_ACCEPTED_BUCKETS='{"my-data-bucket": "us-east-1", "my-eu-bucket": "eu-west-1"}'

Keys are bucket names, values are the AWS region where each bucket resides.

Regional S3 Buckets

Asynchronous invocations require an S3 bucket in the same region as the Bedrock endpoint. When routing is enabled, configure regional buckets so the router can place async jobs in any eligible region:

export AWS_S3_REGIONAL_BUCKETS='{"us-east-1": "my-bucket-use1", "us-west-2": "my-bucket-usw2"}'

Note

If a region has no configured bucket, it is excluded from async invocation routing but remains available for synchronous and streaming requests.


Model Region Restrict

You can restrict specific models to a fixed set of regions. This is useful when a model offers important features only in certain regions (e.g. Nova grounding is only available in us-east-1):

export AWS_BEDROCK_MODEL_REGION_RESTRICT='{"amazon.nova-pro-v1:0": ["us-east-1"]}'

Keys are Bedrock model IDs (or prefixes). Values are ordered lists of allowed regions. The model is only made available in those regions—no fallback to other regions occurs. The order of the list determines the routing priority when multiple regions are listed.


Deprecated Model Fallback

When a client sends a request using a model ID that has been retired or superseded, stdapi.ai can transparently reroute it to the recommended replacement — no client changes needed.

This is controlled by AWS_BEDROCK_DEPRECATED_MODEL_FALLBACK (default: true).

How it works

  1. On a cache miss, the deprecation registry is consulted for a replacement.
  2. If the replacement is itself deprecated, the chain is followed until a live model is found or the chain ends.
  3. If a live replacement is found, the request proceeds with it. A warning is recorded in the request log and the log level is elevated to warning so the event is visible in monitoring.
  4. If no live model is found at the end of the chain, a 404 is returned naming both the original deprecated ID and the last replacement tried.

AWS_BEDROCK_LEGACY — Use with caution

Setting AWS_BEDROCK_LEGACY=true forces stdapi.ai to keep serving legacy (end-of-life) models. AWS may deny requests to such models with an access error if you have not been actively using the model recently, causing failover to break silently. Only set this option if using a legacy model is absolutely required.

Legacy model warnings

Using a legacy model (one AWS has scheduled for end-of-life) also emits a warning-level log entry, including the EOL date when known:

Model 'anthropic.claude-3-5-haiku-20241022-v1:0' is legacy and will reach end-of-life on 2026-06-19. Please migrate to a supported model.

Models whose EOL date falls within the current cache window are proactively excluded at cache refresh time, so they are never served to clients even if AWS has not yet removed them from the available models list.

Strict mode

Set AWS_BEDROCK_DEPRECATED_MODEL_FALLBACK=false to disable the fallback. Requests using a deprecated model ID will fail with a 404 that includes the recommended replacement, forcing clients to update their code explicitly:

Model 'amazon.titan-text-lite-v1' is deprecated or pending deprecation, please use 'amazon.nova-lite-v1:0' instead.

Extending the registry

The built-in deprecation registry covers all models listed in the AWS Bedrock model lifecycle. Use AWS_BEDROCK_DEPRECATED_MODELS to add custom mappings or override existing ones.


Infrastructure Resilience

The Terraform module deploys stdapi.ai following AWS best practices for high availability and fault tolerance. Every component is designed to handle failures transparently — no additional configuration required.

  • Multi-AZ Fargate Tasks
    ECS tasks spread across all Availability Zones; a single AZ failure does not interrupt service

  • Stateless Service Design
    stdapi.ai holds no local state — failed tasks are replaced instantly with zero data loss

  • ALB Health Checks
    Unhealthy tasks drained and replaced within seconds; traffic rerouted to healthy AZs automatically

  • Bedrock Cross-Region Inference
    Bedrock-native routing across AWS regions provides an extra failover layer on top of stdapi.ai's own region routing

  • S3 Eleven-Nines Durability
    99.999999999% object durability; regional buckets co-located with each Bedrock endpoint

  • Fast Task Startup
    New tasks become healthy in under 30 seconds, minimising the recovery window after any failure

  • Zero-Downtime Updates
    Rolling deployments and ALB connection draining ensure in-flight requests always complete cleanly

%%{init: {'flowchart': {'htmlLabels': true}} }%%
flowchart TB
    client["Your App"]

    subgraph deployment["AWS Region (ECS deployment)"]
        alb["<img src='../styles/logo_amazon_load_balancing.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>ALB + WAF"]
        b_local["<img src='../styles/logo_amazon_bedrock.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>AWS Bedrock"]
        s3r["<img src='../styles/logo_amazon_s3.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>S3"]

        subgraph az_a["Availability Zone A"]
            ecs_a["<img src='../styles/logo.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>stdapi.ai<br/>ECS Fargate"]
        end
        subgraph az_b["Availability Zone B"]
            ecs_b["<img src='../styles/logo.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>stdapi.ai<br/>ECS Fargate"]
        end
    end

    subgraph br2["Bedrock Region 2"]
        b2["<img src='../styles/logo_amazon_bedrock.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>AWS Bedrock"]
        bs2["<img src='../styles/logo_amazon_s3.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>Regional S3"]
    end
    subgraph brn["Bedrock Region N"]
        bn["<img src='../styles/logo_amazon_bedrock.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>AWS Bedrock"]
        bsn["<img src='../styles/logo_amazon_s3.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>Regional S3"]
    end

    client -->|"HTTPS"| alb
    alb --> ecs_a & ecs_b
    ecs_a & ecs_b --> s3r
    ecs_a & ecs_b --> b_local
    ecs_a & ecs_b -.->|"region routing"| b2
    ecs_a & ecs_b -.->|"region routing"| bn
    b_local -.-|"cross-region inference"| b2
    b2 -.-|"cross-region inference"| bn

Multi-AZ & ECS Service Resilience

Stateless by design. stdapi.ai stores no local state — all persistent data lives in S3. Each ECS Fargate task is fully replaceable: ECS can terminate and relaunch a failed task without any loss of data or request state that the client cannot retry.

Multi-AZ spread. The Terraform module places ECS tasks across all available Availability Zones in the region. If an AZ experiences a partial or full failure, tasks in the remaining AZs continue to process requests without interruption. The default configuration maintains at least one task per Availability Zone, guaranteeing availability even during a task replacement event.

Auto-scaling. Task count scales automatically based on CPU utilisation, memory utilisation, and ALB request count — whichever metric signals pressure first. Fargate Spot is optionally available for cost-sensitive deployments — see Cost-Optimized Deployment for the trade-offs.

Terraform Module

Minimum capacity defaults to the number of deployed Availability Zones (one task per AZ). Maximum capacity is configurable (ecs_max_capacity, default: 10). Auto-scaling targets CPU and memory utilisation as well as ALB request count per target, so the service scales out under any of these pressure signals.

Fast startup. The stdapi.ai container image is optimised for minimal startup time — a new task typically becomes healthy in under 30 seconds. Fast startup is critical for recovery: when ECS detects a failed task it launches a replacement immediately, keeping the degraded window short and ensuring the service restores full capacity without manual intervention.

Zero-downtime updates. ECS rolling deployments start the new container version and wait for it to pass health checks before draining the old task. The ALB connection draining period lets in-flight requests complete on the outgoing task before it is deregistered. Application updates never interrupt ongoing API calls.

ALB Resilience

The Application Load Balancer is a fully managed, natively multi-AZ AWS service:

  • Cross-AZ load balancing — traffic is distributed evenly across tasks in all healthy AZs.
  • Health check integration — the ALB polls /health on each task; a task is removed from rotation after two consecutive failed checks (~60 s) and readded as soon as it recovers.
  • WAF protection — when enabled, WAF sits in front of the ALB and mitigates DDoS and rate-limit abuse before requests reach the application.

Terraform Module

  • Load balancing algorithm — uses weighted_random with anomaly mitigation enabled, automatically reducing traffic sent to tasks exhibiting elevated error rates before they are fully drained.
  • Idle timeout — set to 3600 s (1 hour) (alb_idle_timeout) to accommodate long-running streaming LLM responses. Without a sufficiently large timeout, the ALB may terminate connections mid-stream for slow or large generations.

ALB is not a single point of failure

AWS manages ALB node redundancy across AZs automatically. An AZ failure reduces capacity but does not take the load balancer offline.

Bedrock Cross-Region Inference

AWS Bedrock supports cross-region inference profiles, which allow Bedrock to automatically route a model invocation to another AWS region when the primary region is throttled or temporarily unavailable. stdapi.ai enables cross-region inference by default (AWS_BEDROCK_CROSS_REGION_INFERENCE_GLOBAL=true).

This creates two complementary failover layers:

Layer Where it operates When it triggers
stdapi.ai region routing Application level — across your configured regions Quota exceeded, throttling, regional unavailability
Bedrock cross-region inference Bedrock service level — transparent within AWS Bedrock-internal capacity events

Together, they maximize model availability without any client-side changes.

Compliance-aware cross-region inference

Set AWS_BEDROCK_CROSS_REGION_INFERENCE_GLOBAL=false to restrict Bedrock to region-local inference, ensuring data stays within a specific geography (e.g. EU-only for GDPR compliance). See Data Sovereignty & Compliance and the GDPR deployment example.

S3 Resilience

S3 stores multimodal inputs and outputs (images, PDFs, audio) used by Bedrock operations:

  • 99.999999999% (11 nines) object durability — data is stored redundantly across multiple devices and AZs within a region.
  • 99.99% availability SLA — designed for continuous availability with no planned downtime.
  • Regional buckets — for multi-region deployments, each Bedrock region has a dedicated S3 bucket co-located in the same region. This eliminates cross-region data transfer for async and multimodal operations and satisfies data residency requirements.

Ultimate Multi-Region Deployment

For the highest possible resilience, deploy two independent stdapi.ai stacks in separate AWS regions and connect them with AWS Global Accelerator. Global Accelerator routes each client to the nearest healthy region using geographic proximity — both regions are active simultaneously. If one region's ALB fails health checks, GA automatically reroutes its traffic to the other region within seconds.

Additional Bedrock regions (without ECS) can be added to AWS_BEDROCK_REGIONS in each stack to expand model availability and quota without deploying more ECS infrastructure.

What this adds on top of a single-region deployment:

Component Single region Multi-region + GA
ECS Fargate Multi-AZ in one region Multi-AZ in two regions
ALB One ALB One ALB per region
Entry point ALB DNS name Single Anycast IP via Global Accelerator
Traffic routing Geographic proximity (nearest region wins)
Regional failover None Automatic, within seconds
Bedrock quota One region's quota Multiple independent quotas
%%{init: {'flowchart': {'htmlLabels': true}} }%%
flowchart LR
    client["Your App"]
    ga["<img src='../styles/logo_amazon_global_accelerator.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>Global Accelerator"]

    subgraph region_a["AWS Region A"]
        direction TB
        alb_a["<img src='../styles/logo_amazon_load_balancing.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>ALB + WAF"]
        ecs_a["<img src='../styles/logo.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>stdapi.ai<br/>ECS Fargate"]
        b_a["<img src='../styles/logo_amazon_bedrock.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>AWS Bedrock"]
        s3_a["<img src='../styles/logo_amazon_s3.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>S3"]
    end

    subgraph region_b["AWS Region B"]
        direction TB
        alb_b["<img src='../styles/logo_amazon_load_balancing.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>ALB + WAF"]
        ecs_b["<img src='../styles/logo.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>stdapi.ai<br/>ECS Fargate"]
        b_b["<img src='../styles/logo_amazon_bedrock.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>AWS Bedrock"]
        s3_b["<img src='../styles/logo_amazon_s3.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>S3"]
    end

    subgraph region_c["AWS Region C (Bedrock only)"]
        direction TB
        b_c["<img src='../styles/logo_amazon_bedrock.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>AWS Bedrock"]
        s3_c["<img src='../styles/logo_amazon_s3.svg' style='height:48px;width:auto;vertical-align:middle;' /><br/>Regional S3"]
    end

    client -->|"HTTPS"| ga
    ga -->|"geo-routing"| alb_a
    ga -->|"geo-routing"| alb_b
    alb_a --> ecs_a
    alb_b --> ecs_b
    ecs_a --> b_a & s3_a
    ecs_b --> b_b & s3_b
    ecs_a & ecs_b -.->|"region routing"| b_c

How Global Accelerator integrates:

  • Geographic proximity routing — GA resolves each client to the nearest AWS region over the public internet, then carries the traffic over the AWS backbone to the ALB in that region. Both regions serve live traffic simultaneously.
  • Health-based failover — GA continuously health-checks each ALB endpoint. If a region's ALB stops responding, GA automatically reroutes its traffic to the other region within seconds — with no DNS TTL delay.
  • Single Anycast entry point — clients always connect to the same two static IPs regardless of which region handles the request. No client reconfiguration is needed during a regional failure.

API key synchronisation

Both ECS stacks must share the same API key so clients can reach either region transparently. Use api_key_secretsmanager_secret pointing to a cross-region replicated Secrets Manager secret, or set the same key via api_key_value in both modules.


Best Practices

Infrastructure:

  • Use the Terraform module — The stdapi-ai Terraform module provisions all resilience features out of the box: multi-AZ ECS, ALB health checks, auto-scaling, WAF, and CloudWatch alarms. Deploying manually risks missing critical settings.
  • Run at least two Bedrock regions — Configure aws_bedrock_regions with two or more regions to unlock quota multiplication and automatic failover. A single region is a single point of failure for quota limits.

Region routing:

  • Start with ordered — It provides failover without sacrificing prompt caching.
  • Use lowest_latency only if your server's network position varies or you want the fastest region chosen automatically.
  • Use round_robin for high-throughput batch workloads where prompt caching is not needed.
  • Keep backoff values moderate — The defaults (60 s for quota, 30 s for unavailability) work well for most workloads. Very short backoffs may cause premature retries against a region that is still overloaded.
  • Tune AWS_BEDROCK_MAX_RETRIES — The default of 9 provides strong resilience across multiple regions. Lower it (e.g. 3) to fail faster; raise it for workloads that can tolerate longer retry windows during sustained outages.
  • Consider AWS_ADAPTIVE_RETRY — Enable this when many concurrent clients share the same endpoint and sustained congestion is likely. It paces retries based on real-time error signals, reducing the risk of retry storms — at the cost of potentially higher per-request latency under load. Avoid it for latency-sensitive or low-traffic workloads.
  • Monitor model_regions in logs — If one region consistently appears in error logs, consider adjusting its quota or removing it from the region list.
  • Declare accepted buckets — If your users provide S3 URLs from buckets outside the application's own buckets, add them to AWS_S3_ACCEPTED_BUCKETS so the router can resolve their region and convert HTTP URLs to S3 URIs.
  • Pin models when needed — Use AWS_BEDROCK_MODEL_REGION_RESTRICT for models that have region-specific features (e.g. grounding) to guarantee those features are always available. The model will be restricted exclusively to the listed regions.
  • Plan for model deprecations — Keep AWS_BEDROCK_DEPRECATED_MODEL_FALLBACK=true (the default) so clients survive AWS model retirements without downtime. Switch to false in environments where you want to enforce explicit client migrations.

Next Steps