How does Rivio handle SM-DP+ server outages without losing provisioning requests?

Every provisioning request that enters our system is persisted to a durable queue before we attempt the SM-DP+ call. If the downstream SM-DP+ returns a 5xx error or times out (we enforce a 12-second ceiling), the request enters an exponential backoff retry loop with jitter — initial delay of 2 seconds, capped at 5 minutes, maximum 8 attempts. Each attempt is logged with a correlation ID that traces back to the original user action in the Flutter app. If all retries exhaust, the request moves to a dead-letter queue and our on-call team is paged via PagerDuty. The user receives a push notification explaining the delay and offering a manual retry button. In practice, transient SM-DP+ failures resolve within the first two retries about 94% of the time. For extended outages affecting a single carrier, our routing layer automatically shifts new requests to an alternate carrier covering the same country, provided one exists in our carrier matrix.

What happens to unused balance when a user switches between country profiles?

Balance in Rivio's system is not tied to a specific eSIM profile or carrier — it lives in our internal ledger as a single wallet per user account. When a user travels from Germany to Japan and their device activates a different carrier profile, the balance does not reset or transfer. Instead, the new carrier's CDR (Call Detail Record) feed reports usage back to our metering pipeline, and we debit the same wallet. This is a deliberate architectural decision that makes the pay-as-you-go model work across 150+ countries without requiring users to purchase separate balances per destination. The ledger uses double-entry bookkeeping: every data usage event creates both a debit against the user's wallet and a credit to the carrier settlement account. We reconcile CDR-reported usage against carrier invoices on a weekly cycle, and discrepancies above 0.5% trigger an automated investigation workflow.

How does mTLS work in the SM-DP+ integration and what happens when certificates rotate?

All communication between our provisioning service and SM-DP+ servers uses mutual TLS (mTLS), as mandated by the GSMA SGP.22 specification. We maintain a certificate store where each carrier's SM-DP+ endpoint has a dedicated client certificate and CA chain. During the TLS handshake, both sides present certificates: our client cert proves we are an authorized LPA proxy, and the SM-DP+ server cert proves its identity. Certificates typically have a 12-month validity period. We track expiration dates in a certificate inventory service that begins alerting 60 days before expiry. The rotation process is semi-automated: our platform generates a new CSR, submits it to the carrier's certificate authority, and once the signed certificate is returned, it is deployed to our provisioning nodes via a rolling update — zero downtime. We keep the old certificate active for a 48-hour overlap window to handle in-flight requests. A failed TLS handshake triggers an immediate alert because it means either certificate expiry or a configuration mismatch, both of which halt provisioning for that carrier entirely.

How does Rivio meter data usage in real time when CDRs from carriers arrive with delays?

This is one of the harder problems in pay-as-you-go eSIM. Carriers generate CDRs at different cadences — some send near-real-time records every 30 seconds, while others batch them every 15 minutes or even hourly. Our metering pipeline handles this by maintaining two parallel views of usage. The first is an estimated real-time view: when a user's session is active, we receive RADIUS accounting interim-updates from most carriers, which give us byte-count snapshots. We use these to update an approximate balance in Redis, which the Flutter app polls every 60 seconds. The second is the authoritative view: when finalized CDRs arrive, they flow through a Kafka topic into our rating engine, which applies the correct per-country, per-carrier tariff and updates the PostgreSQL ledger. If the finalized CDR shows higher usage than our real-time estimate (typically a delta under 2%), we adjust the balance retroactively. If a user's balance hits zero based on real-time estimates, we send a RADIUS Disconnect-Request to the carrier to terminate the session, preventing negative balance accrual. This dual-view architecture gives users responsive balance updates while maintaining financial accuracy.

Scaling eSIM Provisioning: API Architecture Behind 150+ Countries

The Problem: One API, 150+ Countries, Millions of Profiles

When we started building Rivio’s backend, the core challenge was straightforward to state but difficult to solve: expose a single API that lets a Flutter mobile app provision, manage, and meter eSIM profiles across more than 150 countries, each served by different mobile carriers with different technical integrations, different provisioning protocols, and different billing models.

This post walks through the architecture we built to solve that problem. It covers the GSMA Remote SIM Provisioning standard we build on, how we integrate with SM-DP+ servers, how profile lifecycle management works at scale, and the API design patterns that make pay-as-you-go eSIM viable as a product.

GSMA Remote SIM Provisioning (RSP) Architecture

The entire eSIM ecosystem rests on a standard called Remote SIM Provisioning (RSP), defined in the GSMA’s SGP.22 specification. RSP defines the actors, protocols, and security model that allow an eSIM profile to be prepared on a server, downloaded over the internet, and installed on a device’s eUICC chip — all without physical access.

The key actors in RSP for consumer devices:

SM-DP+ (Subscription Manager — Data Preparation): The server that prepares, stores, and delivers eSIM profiles. Each carrier operates or contracts an SM-DP+. This is the primary system our backend talks to.
SM-DS (Subscription Manager — Discovery Server): An optional registry that helps devices discover which SM-DP+ holds a profile for them. When we provision a profile, we can register it with an SM-DS so the user’s device can find and download it without scanning a QR code.
LPA (Local Profile Assistant): Software on the end-user device (built into iOS and Android) that handles the actual download and installation of profiles onto the eUICC.
eUICC: The tamper-resistant secure element chip on the device that stores and executes eSIM profiles.

In Rivio’s architecture, our backend acts as an intermediary between the user’s mobile app and multiple SM-DP+ servers. The app never communicates directly with an SM-DP+ — all provisioning flows through our API, which handles carrier selection, profile preparation, activation code generation, and lifecycle tracking.

SM-DP+ Integration: The Core of Provisioning

Integrating with an SM-DP+ means implementing the ES2+ interface, the server-to-server protocol defined in SGP.22. This is not a simple REST API. ES2+ uses HTTPS with mutual TLS (mTLS), and the payload format follows a specific JSON structure with mandatory fields for profile ordering, confirmation, and release.

The typical provisioning flow from our backend’s perspective:

Order Profile — We send a DownloadOrder request to the SM-DP+ specifying the ICCID (or requesting auto-assignment), the target EID if known, and the profile type. The SM-DP+ acknowledges and begins profile preparation.
Confirm Order — Once the SM-DP+ has the profile ready, we confirm the order. At this point, the SM-DP+ generates an activation code (a combination of SM-DP+ address and matching ID).
Release Profile — We call ReleaseProfile to make the profile available for download. The SM-DP+ now accepts connections from the target device’s LPA.
Notify SM-DS — Optionally, we register the profile with an SM-DS so the device can discover it via a push notification rather than requiring a QR code scan.

Each of these steps can fail independently. The SM-DP+ might reject the order if profile inventory is exhausted. Confirmation might time out under load. Release might fail if there is a mismatch between the EID we provided and the one the device presents during download. We handle each failure mode explicitly, with different retry and escalation strategies per step.

Profile Lifecycle Management

An eSIM profile is not a static artifact. It has a lifecycle that our system must track:

Preparation — The carrier’s SM-DP+ creates the profile. We track this via the ICCID returned in the order response.

Download — The user’s device contacts the SM-DP+ and downloads the profile. We receive a ProfileDownloaded callback from the SM-DP+.

Installation — The LPA installs the profile on the eUICC. We receive an ProfileInstalled notification.

Enablement — The profile is activated and the device registers on the carrier’s network. We detect this via RADIUS Access-Accept messages from the carrier.

Disablement — The user (or our system) disables the profile. The device disconnects from the carrier network.

Deletion — The profile is removed from the eUICC, freeing a slot. We trigger a DeleteProfile request via ES2+ when the user explicitly removes a country from their account.

We maintain a state machine for each profile in PostgreSQL, with every transition logged to an append-only audit table. This gives us full traceability — we can answer questions like “why did this user’s Japan profile fail to activate on April 3rd at 14:22 UTC?” by replaying the state transitions and the SM-DP+ callback payloads.

Multi-Carrier Routing

Rivio covers 150+ countries, but no single carrier covers all of them. Our carrier matrix maps each country to one or more carriers, ranked by preference based on network quality, cost, and historical reliability.

When the Flutter app requests a profile for a given country, our routing layer:

Looks up the country in the carrier matrix
Selects the highest-ranked available carrier
Checks that carrier’s SM-DP+ profile inventory (we cache inventory counts with a 5-minute TTL)
Falls back to the next carrier if inventory is exhausted or if the primary carrier’s SM-DP+ is degraded

This routing is invisible to the user. They request “data in Japan” and get a working profile. The carrier behind it might be different from the one they got last month if we renegotiated terms or if a new carrier joined our network with better coverage.

We store routing decisions with every provisioning event, which feeds back into our carrier scoring algorithm. If a carrier’s profiles consistently fail activation in a specific country, the algorithm gradually de-prioritizes that carrier for that country.

Pay-As-You-Go: Balance and Usage Metering

The hardest backend problem at Rivio is not provisioning — it is metering. Pay-as-you-go means we charge users per megabyte of actual data consumed, not for a fixed data package. This requires near-real-time usage tracking across dozens of carriers with different reporting capabilities.

CDR ingestion: Carriers send us Call Detail Records (CDRs) via SFTP drops, real-time API feeds, or RADIUS accounting messages. Each format is different. We normalize everything into a canonical CDR schema in our ingestion pipeline before it hits the rating engine.

Rating engine: Takes a normalized CDR, looks up the applicable tariff (which varies by country, carrier, and sometimes time of day), and computes the cost in the user’s wallet currency. The rated event is written to the ledger.

Balance enforcement: If real-time RADIUS updates indicate a user’s balance is approaching zero, we send a Disconnect-Request to the carrier’s AAA server. The session terminates and the user is prompted to top up.

Reconciliation: Weekly, we reconcile our internal CDR-derived totals against the carrier’s invoice. Discrepancies happen — timezone mismatches, duplicate CDRs, late-arriving records. Our reconciliation pipeline flags anything above 0.5% variance for manual review.

API Design Patterns

The API that serves the Flutter app follows a few deliberate design choices:

Idempotency keys on all write operations. Every POST request (provision profile, top up balance, change profile state) requires an Idempotency-Key header. We store the key and response for 24 hours. If the client retries (e.g., after a network timeout), we return the stored response instead of executing the operation again. This prevents double-provisioning — a scenario that would cost us real money.

Webhook-driven status updates. The app does not poll for provisioning status. When a profile transitions state (downloaded, installed, enabled), our backend pushes an event via Firebase Cloud Messaging to the app, which then fetches the updated profile object. This keeps the app responsive without hammering our API.

Rate limiting per operation class. Provisioning endpoints are rate-limited more aggressively (5 requests per minute per user) than read endpoints (60 rpm). Provisioning involves downstream SM-DP+ calls that are expensive and slow; we protect both our system and our carrier partners from burst traffic.

Versioned API with mandatory minimum. The app sends an X-API-Version header. We support the current and previous major version simultaneously. If the app sends a version below the minimum supported, the API returns a 426 Upgrade Required with a link to the app store. This prevents old clients from hitting deprecated provisioning flows.

Error Handling and Retry Strategy

SM-DP+ servers are not always fast or reliable. Our retry strategy is tiered:

Transient errors (5xx, timeout): Exponential backoff, 2s initial, 5min cap, 8 max attempts. Jittered to prevent thundering herd across concurrent requests.
Client errors (4xx): No retry. Log the error, map it to a user-facing message, and return immediately. Common causes: EID mismatch (user scanned QR on wrong device), profile not found (ICCID expired).
Partial success: If DownloadOrder succeeds but ConfirmOrder fails, we do not retry the order — we retry the confirmation only. Each lifecycle step has its own retry context.

All failures surface in our Grafana dashboards: provisioning success rate by carrier, by country, by hour. If a carrier’s success rate drops below 95%, an alert fires.

Security Model

mTLS everywhere. All SM-DP+ communication uses mutual TLS with carrier-issued client certificates. Certificates are stored in HashiCorp Vault, injected into provisioning pods at startup, and rotated 30 days before expiry.

Profile encryption. eSIM profiles contain sensitive credentials (Ki, OPc values). These are encrypted end-to-end from the SM-DP+ to the eUICC. Our backend never sees the decrypted profile content — we handle metadata (ICCID, EID, state) only.

Audit logging. Every provisioning action, every balance change, and every carrier API call is logged with a correlation ID, user ID, timestamp, and response code. Logs are immutable (append-only to a write-once object store) and retained for 7 years per telecom regulatory requirements.

Monitoring and Observability

We track three primary SLIs for the provisioning system:

Provisioning success rate — Target: 99.2%. Measured as the percentage of ProvisionProfile API calls that result in an enabled profile within 5 minutes. Currently averaging 99.4%.
Time to activation — Target: p95 under 90 seconds. Measured from API call to first RADIUS Access-Accept. Current p95: 47 seconds.
Balance accuracy — Target: real-time estimates within 3% of finalized CDR totals. Current average delta: 1.1%.

Each SLI feeds into an error budget. When the budget is exhausted (e.g., provisioning success rate drops below 99.2% for a rolling 7-day window), we freeze feature deployments and focus on reliability.

Distributed tracing via OpenTelemetry ties together the full request path: Flutter app call, API gateway, provisioning service, SM-DP+ call, carrier callback, and balance update. A single trace ID follows the request across all systems, making incident investigation a matter of searching for one string rather than correlating timestamps across log files.

Lessons from Production

After running this system in production across 150+ countries, a few lessons stand out:

Carrier APIs are the bottleneck, not your code. We spent months optimizing our internal latency before realizing that 80% of our p99 provisioning time was waiting on SM-DP+ responses. The real optimization was improving timeout handling and fallback routing.

Idempotency is not optional in telecom. A single duplicate provisioning request creates an orphaned profile that costs money and confuses the user. Idempotency keys paid for themselves in the first week.

CDR reconciliation never fully automates. We got it to 98% automated, but the remaining 2% — timezone edge cases, carrier-specific CDR quirks, profiles that activate in one country but roam to another within the same CDR period — always requires human review. Budget engineering time for it.

Building a provisioning backend for a global pay-as-you-go eSIM product is essentially building a mini telecom billing system on top of a multi-vendor profile management layer. The standards (RSP, SGP.22, ES2+) give you a foundation, but the real engineering is in the routing, metering, and error handling that makes it all invisible to the user.

Popular Destinations

Popular Topics