Scaling eSIM Provisioning: API Architecture Behind 150+ Countries
The Problem: One API, 150+ Countries, Millions of Profiles
When we started building Rivio’s backend, the core challenge was straightforward to state but difficult to solve: expose a single API that lets a Flutter mobile app provision, manage, and meter eSIM profiles across more than 150 countries, each served by different mobile carriers with different technical integrations, different provisioning protocols, and different billing models.
This post walks through the architecture we built to solve that problem. It covers the GSMA Remote SIM Provisioning standard we build on, how we integrate with SM-DP+ servers, how profile lifecycle management works at scale, and the API design patterns that make pay-as-you-go eSIM viable as a product.
GSMA Remote SIM Provisioning (RSP) Architecture
The entire eSIM ecosystem rests on a standard called Remote SIM Provisioning (RSP), defined in the GSMA’s SGP.22 specification. RSP defines the actors, protocols, and security model that allow an eSIM profile to be prepared on a server, downloaded over the internet, and installed on a device’s eUICC chip — all without physical access.
The key actors in RSP for consumer devices:
- SM-DP+ (Subscription Manager — Data Preparation): The server that prepares, stores, and delivers eSIM profiles. Each carrier operates or contracts an SM-DP+. This is the primary system our backend talks to.
- SM-DS (Subscription Manager — Discovery Server): An optional registry that helps devices discover which SM-DP+ holds a profile for them. When we provision a profile, we can register it with an SM-DS so the user’s device can find and download it without scanning a QR code.
- LPA (Local Profile Assistant): Software on the end-user device (built into iOS and Android) that handles the actual download and installation of profiles onto the eUICC.
- eUICC: The tamper-resistant secure element chip on the device that stores and executes eSIM profiles.
In Rivio’s architecture, our backend acts as an intermediary between the user’s mobile app and multiple SM-DP+ servers. The app never communicates directly with an SM-DP+ — all provisioning flows through our API, which handles carrier selection, profile preparation, activation code generation, and lifecycle tracking.
SM-DP+ Integration: The Core of Provisioning
Integrating with an SM-DP+ means implementing the ES2+ interface, the server-to-server protocol defined in SGP.22. This is not a simple REST API. ES2+ uses HTTPS with mutual TLS (mTLS), and the payload format follows a specific JSON structure with mandatory fields for profile ordering, confirmation, and release.
The typical provisioning flow from our backend’s perspective:
- Order Profile — We send a
DownloadOrderrequest to the SM-DP+ specifying the ICCID (or requesting auto-assignment), the target EID if known, and the profile type. The SM-DP+ acknowledges and begins profile preparation. - Confirm Order — Once the SM-DP+ has the profile ready, we confirm the order. At this point, the SM-DP+ generates an activation code (a combination of SM-DP+ address and matching ID).
- Release Profile — We call
ReleaseProfileto make the profile available for download. The SM-DP+ now accepts connections from the target device’s LPA. - Notify SM-DS — Optionally, we register the profile with an SM-DS so the device can discover it via a push notification rather than requiring a QR code scan.
Each of these steps can fail independently. The SM-DP+ might reject the order if profile inventory is exhausted. Confirmation might time out under load. Release might fail if there is a mismatch between the EID we provided and the one the device presents during download. We handle each failure mode explicitly, with different retry and escalation strategies per step.
Profile Lifecycle Management
An eSIM profile is not a static artifact. It has a lifecycle that our system must track:
Preparation — The carrier’s SM-DP+ creates the profile. We track this via the ICCID returned in the order response.
Download — The user’s device contacts the SM-DP+ and downloads the profile. We receive a ProfileDownloaded callback from the SM-DP+.
Installation — The LPA installs the profile on the eUICC. We receive an ProfileInstalled notification.
Enablement — The profile is activated and the device registers on the carrier’s network. We detect this via RADIUS Access-Accept messages from the carrier.
Disablement — The user (or our system) disables the profile. The device disconnects from the carrier network.
Deletion — The profile is removed from the eUICC, freeing a slot. We trigger a DeleteProfile request via ES2+ when the user explicitly removes a country from their account.
We maintain a state machine for each profile in PostgreSQL, with every transition logged to an append-only audit table. This gives us full traceability — we can answer questions like “why did this user’s Japan profile fail to activate on April 3rd at 14:22 UTC?” by replaying the state transitions and the SM-DP+ callback payloads.
Multi-Carrier Routing
Rivio covers 150+ countries, but no single carrier covers all of them. Our carrier matrix maps each country to one or more carriers, ranked by preference based on network quality, cost, and historical reliability.
When the Flutter app requests a profile for a given country, our routing layer:
- Looks up the country in the carrier matrix
- Selects the highest-ranked available carrier
- Checks that carrier’s SM-DP+ profile inventory (we cache inventory counts with a 5-minute TTL)
- Falls back to the next carrier if inventory is exhausted or if the primary carrier’s SM-DP+ is degraded
This routing is invisible to the user. They request “data in Japan” and get a working profile. The carrier behind it might be different from the one they got last month if we renegotiated terms or if a new carrier joined our network with better coverage.
We store routing decisions with every provisioning event, which feeds back into our carrier scoring algorithm. If a carrier’s profiles consistently fail activation in a specific country, the algorithm gradually de-prioritizes that carrier for that country.
Pay-As-You-Go: Balance and Usage Metering
The hardest backend problem at Rivio is not provisioning — it is metering. Pay-as-you-go means we charge users per megabyte of actual data consumed, not for a fixed data package. This requires near-real-time usage tracking across dozens of carriers with different reporting capabilities.
CDR ingestion: Carriers send us Call Detail Records (CDRs) via SFTP drops, real-time API feeds, or RADIUS accounting messages. Each format is different. We normalize everything into a canonical CDR schema in our ingestion pipeline before it hits the rating engine.
Rating engine: Takes a normalized CDR, looks up the applicable tariff (which varies by country, carrier, and sometimes time of day), and computes the cost in the user’s wallet currency. The rated event is written to the ledger.
Balance enforcement: If real-time RADIUS updates indicate a user’s balance is approaching zero, we send a Disconnect-Request to the carrier’s AAA server. The session terminates and the user is prompted to top up.
Reconciliation: Weekly, we reconcile our internal CDR-derived totals against the carrier’s invoice. Discrepancies happen — timezone mismatches, duplicate CDRs, late-arriving records. Our reconciliation pipeline flags anything above 0.5% variance for manual review.
API Design Patterns
The API that serves the Flutter app follows a few deliberate design choices:
Idempotency keys on all write operations. Every POST request (provision profile, top up balance, change profile state) requires an Idempotency-Key header. We store the key and response for 24 hours. If the client retries (e.g., after a network timeout), we return the stored response instead of executing the operation again. This prevents double-provisioning — a scenario that would cost us real money.
Webhook-driven status updates. The app does not poll for provisioning status. When a profile transitions state (downloaded, installed, enabled), our backend pushes an event via Firebase Cloud Messaging to the app, which then fetches the updated profile object. This keeps the app responsive without hammering our API.
Rate limiting per operation class. Provisioning endpoints are rate-limited more aggressively (5 requests per minute per user) than read endpoints (60 rpm). Provisioning involves downstream SM-DP+ calls that are expensive and slow; we protect both our system and our carrier partners from burst traffic.
Versioned API with mandatory minimum. The app sends an X-API-Version header. We support the current and previous major version simultaneously. If the app sends a version below the minimum supported, the API returns a 426 Upgrade Required with a link to the app store. This prevents old clients from hitting deprecated provisioning flows.
Error Handling and Retry Strategy
SM-DP+ servers are not always fast or reliable. Our retry strategy is tiered:
- Transient errors (5xx, timeout): Exponential backoff, 2s initial, 5min cap, 8 max attempts. Jittered to prevent thundering herd across concurrent requests.
- Client errors (4xx): No retry. Log the error, map it to a user-facing message, and return immediately. Common causes: EID mismatch (user scanned QR on wrong device), profile not found (ICCID expired).
- Partial success: If
DownloadOrdersucceeds butConfirmOrderfails, we do not retry the order — we retry the confirmation only. Each lifecycle step has its own retry context.
All failures surface in our Grafana dashboards: provisioning success rate by carrier, by country, by hour. If a carrier’s success rate drops below 95%, an alert fires.
Security Model
mTLS everywhere. All SM-DP+ communication uses mutual TLS with carrier-issued client certificates. Certificates are stored in HashiCorp Vault, injected into provisioning pods at startup, and rotated 30 days before expiry.
Profile encryption. eSIM profiles contain sensitive credentials (Ki, OPc values). These are encrypted end-to-end from the SM-DP+ to the eUICC. Our backend never sees the decrypted profile content — we handle metadata (ICCID, EID, state) only.
Audit logging. Every provisioning action, every balance change, and every carrier API call is logged with a correlation ID, user ID, timestamp, and response code. Logs are immutable (append-only to a write-once object store) and retained for 7 years per telecom regulatory requirements.
Monitoring and Observability
We track three primary SLIs for the provisioning system:
- Provisioning success rate — Target: 99.2%. Measured as the percentage of
ProvisionProfileAPI calls that result in an enabled profile within 5 minutes. Currently averaging 99.4%. - Time to activation — Target: p95 under 90 seconds. Measured from API call to first RADIUS
Access-Accept. Current p95: 47 seconds. - Balance accuracy — Target: real-time estimates within 3% of finalized CDR totals. Current average delta: 1.1%.
Each SLI feeds into an error budget. When the budget is exhausted (e.g., provisioning success rate drops below 99.2% for a rolling 7-day window), we freeze feature deployments and focus on reliability.
Distributed tracing via OpenTelemetry ties together the full request path: Flutter app call, API gateway, provisioning service, SM-DP+ call, carrier callback, and balance update. A single trace ID follows the request across all systems, making incident investigation a matter of searching for one string rather than correlating timestamps across log files.
Lessons from Production
After running this system in production across 150+ countries, a few lessons stand out:
Carrier APIs are the bottleneck, not your code. We spent months optimizing our internal latency before realizing that 80% of our p99 provisioning time was waiting on SM-DP+ responses. The real optimization was improving timeout handling and fallback routing.
Idempotency is not optional in telecom. A single duplicate provisioning request creates an orphaned profile that costs money and confuses the user. Idempotency keys paid for themselves in the first week.
CDR reconciliation never fully automates. We got it to 98% automated, but the remaining 2% — timezone edge cases, carrier-specific CDR quirks, profiles that activate in one country but roam to another within the same CDR period — always requires human review. Budget engineering time for it.
Building a provisioning backend for a global pay-as-you-go eSIM product is essentially building a mini telecom billing system on top of a multi-vendor profile management layer. The standards (RSP, SGP.22, ES2+) give you a foundation, but the real engineering is in the routing, metering, and error handling that makes it all invisible to the user.