Engineering · 16 May 2026 · ~12 min read

Designing a double-entry ledger for an LLM gateway

How to charge customers for upstream OpenAI / Anthropic / Gemini usage with a fixed markup, without double-charging on retries, leaking holds on crashes, or breaking when streaming responses abort mid-flight. The design we landed on after four iterations.

The problem

ScopeVeil is an OpenAI-compatible gateway. A customer calls /v1/chat/completions, we forward the request to OpenAI, Anthropic, Google, Mistral, Cohere, or Groq, count tokens against the provider's pricing table, add our markup, and bill the customer against prepaid credit.

That sentence took about a paragraph. The implementation took four weeks and four rewrites. Here is why.

The naive shape of the problem looks like a function: cost(tokens, model) * (1 + markup). The actual shape is a distributed system. Money moves between accounts. Requests fail partway through. Streams get aborted by the client TCP closing. Two requests from the same customer hit the same row at the same time. Stripe webhooks fire twice. The gateway crashes between when you reserved the customer's credit and when you charged it.

Any one of those will quietly corrupt the books.

v1. Charge after the response

The first version was a Stripe-style "authorize then capture" but collapsed into a single step. Pseudocode:

// v1: do not do this.
const response = await openai.chat.completions.create(req);
const tokens = response.usage;
const cost = price(model, tokens) * (1 + markup);

await db.transaction(async (t) => {
  const balance = await t.balance(orgId);
  if (balance < cost) throw new InsufficientFunds();
  await t.charge(orgId, cost);
});

return response;

You see the bug. Two concurrent requests from the same org both read the same balance, both pass the check, both charge. You can double-spend. SELECT FOR UPDATE fixes the race but introduces a worse problem: a long request holds the row for the full duration of the upstream call. We saw locks of 30 seconds for big completions, blocking every other request from that org.

We also had no answer for the question "what happens if the upstream returns 429 after we charged the hold?". We hadn't charged anything yet, so we couldn't refund, but the conceptual model was already wrong.

v2. Double-entry ledger with holds

Banks figured this out a century ago. Money does not get added or subtracted; it gets moved between accounts, and every move is an entry in a ledger. The current balance is the sum of all entries. No row is ever mutated.

Our ledger has six entry types:

  • topup: customer added prepaid credit (positive, committed).
  • hold: request in flight, reserved against the balance (negative, pending).
  • capture: request completed, definitive charge (negative, committed). Releases the corresponding hold.
  • release: hold cancelled without capture, e.g. upstream returned 429 (positive, committed). Releases the hold.
  • refund: manual reversal of a capture (positive, committed).
  • adjust: admin override with audit (any sign, committed).

A request now looks like:

// v2: hold → call upstream → capture or release.
const hold = await ledger.hold({
  orgId,
  amountUsd: estimateMaxCost(model, req),
  idempotencyKey: `gw:hold:${orgId}:${requestId}`,
});

try {
  const upstream = await openai.chat.completions.create(req);
  const real = computeCost(model, upstream.usage, markup);
  await ledger.captureHold({
    orgId,
    holdId: hold.id,
    actualAmountUsd: real,
    captureIdempotencyKey: `gw:cap:${orgId}:${requestId}`,
  });
  return upstream;
} catch (err) {
  await ledger.releaseHold({
    orgId,
    holdId: hold.id,
    releaseIdempotencyKey: `gw:rel:${orgId}:${requestId}`,
    reason: err.message,
  });
  throw err;
}

The balance is computed as SUM(amount_usd) WHERE org_id = $1 AND status IN ('pending', 'committed'). Holds count against available balance even while pending, so two concurrent requests cannot both reserve more than the customer has. The SELECT FOR UPDATE happens during the hold insert, which is fast and bounded (typically under 1 millisecond) instead of during the upstream call.

The upstream call is fully outside the database transaction. The row is not locked while we wait on OpenAI.

Reconciling streaming

Streaming responses break the symmetry. The client is reading bytes before you know the total token count. You cannot block the stream to do bookkeeping. You also cannot wait until the stream ends to start emitting bytes. That defeats the point.

The solution we settled on: the adapter exposes both a readable stream that we forward to the client byte for byte, and a finalUsage promise that resolves when the upstream emits its final usage event. We chain capture off the promise, not the response:

const stream = await adapter.invokeStream({ model, req });

stream.finalUsage
  .then(async (usage) => {
    const real = computeCost(price, usage, markup);
    await ledger.captureHold({ ... });
  })
  .catch(async (err) => {
    await ledger.releaseHold({ ... });
  });

return new Response(stream.readable, {
  headers: { 'Content-Type': 'text/event-stream' },
});

Two important properties: the client sees the first byte immediately, and if the client closes the connection mid-stream the upstream call still completes (or fails) and we still capture or release. We do not leak the hold.

Idempotency, scoped per org

Every entry has an idempotency_key with a global unique index. Inserting an entry whose key already exists is a no-op that returns the existing row. This is what makes the gateway safe to retry: if your network blinks and you retry the same request, you get the same hold, the same capture, the same response.

The trap is cross-org collision. A malicious customer could pick an idempotency key that matches another org's pending hold and read its row back, leaking metadata. We fix that by validating the org_id on every lookup:

// Idempotency lookup is ALWAYS scoped by org_id.
SELECT * FROM ledger_entries
WHERE idempotency_key = $1 AND org_id = $2;

We also prefix the key with the org id in our own usage (gw:cap:{orgId}:{requestId}), so collisions across orgs cannot happen even by accident. Belt and suspenders, since the database-level scoping is what actually enforces the invariant.

For Stripe webhooks we use the Stripe event.id as the idempotency key on the corresponding topup ledger entry. Stripe will retry a webhook on transient failures, so without this we would credit the customer twice for the same payment.

Orphan holds and crash recovery

The hold/capture flow assumes you always reach the second step. In practice the gateway can crash between hold and capture. The machine reboots, the request is gone, and the hold sits there eating the customer's available balance forever.

We run a cleanup job every 60 seconds that releases pending holds older than 5 minutes:

-- Index makes this fast even with millions of rows.
CREATE INDEX ledger_pending_age_idx
  ON ledger_entries (created_at)
  WHERE status = 'pending';

-- Cleanup query:
UPDATE ledger_entries
SET status = 'released', settled_at = NOW()
WHERE status = 'pending'
  AND created_at < NOW() - INTERVAL '5 minutes';

Five minutes is the longest plausible upstream call we have seen, plus a generous buffer. Anything older than that is dead, by definition. If by some miracle the original request comes back and tries to capture, the capture finds the hold already released and either returns the existing capture (idempotent path) or refuses (the request lost the race, charge a fresh hold). The customer is never overcharged.

Markup transparency in metadata

Our pricing is upstream provider cost plus a fixed percent markup per tier. Internally the capture amount is the total. The customer sees a single number on their dashboard. But for our own accounting, and for the customer who wants to know exactly what went to OpenAI versus what went to us, we need both.

The ledger stores the total in amount_usd. The breakdown lives in the capture's metadata JSONB:

metadata: {
  provider: 'openai',
  model: 'gpt-4o-mini',
  input_tokens: 412,
  output_tokens: 180,
  cost_raw_usd: '0.00018600',
  markup_usd: '0.00001860',
  markup_pct: 10,
  tier: 'pro',
}

This is what lets us run the revenue query at week boundary:

WITH week AS (SELECT date_trunc('week', NOW())::date AS start_at)
SELECT
  COALESCE(SUM(amount_usd * -1) FILTER (WHERE type = 'capture'), 0) AS gross_charged,
  COALESCE(SUM((metadata->>'cost_raw_usd')::numeric)
    FILTER (WHERE type = 'capture' AND metadata ? 'cost_raw_usd'), 0) AS upstream_cost,
  COALESCE(SUM((metadata->>'markup_usd')::numeric)
    FILTER (WHERE type = 'capture' AND metadata ? 'markup_usd'), 0) AS markup_revenue
FROM ledger_entries
WHERE created_at >= (SELECT start_at FROM week);

The customer dashboard shows the same numbers per request, so the markup is never a black box. We had a soft principle that transparency was a feature; this is the implementation.

Things I would do differently

  • Schema versioning from day one. We added markup_usd to metadata four weeks in. Older captures do not have it. The revenue query handles the absence (metadata ? 'markup_usd'), but it would have been cheaper to land schema decisions before we started writing records.
  • Idempotency keys with a TTL. Our unique index is forever. A bug somewhere can pollute the key space and collide weeks later. A composite of (org_id, key, created_at) with a partial index over the last 90 days would be cleaner.
  • Decimal everywhere from the first line. Floating point cost in JavaScript will quietly accumulate error. We use decimal.js for every dollar amount; if you fight us for picking a heavy dependency over native numbers, write yourself a test that sums 0.1 + 0.2 a million times and check the result against a CPA.
  • Cleanup job is not optional. Without it, a single gateway crash quietly degrades the system over weeks. We almost shipped without one because the happy path tests passed. Add the cleanup before you add the feature it protects.