Engineering · 23 May 2026 · ~10 min read

Why our SDK ships only metadata: privacy by architecture

Most LLM observability tools log your prompts and completions because they need to in order to show you a useful product. ScopeVeil does not, because we decided a database that cannot leak content is more valuable than a database that can replay it. This is a writeup of what we ship on the wire, what we deliberately do not, and the trade-offs we accepted.

ScopeVeil team·SDK on GitHub

The threat model

When you point an LLM SDK at an observability vendor, you are handing them the prompt and the completion. Not because they announce it loudly, but because the obvious way to build the product is to capture the request and response and write them to a database. That database then lives behind some company's auth, some company's backups, and some company's subpoena response process.

The threat model we cared about was not malicious employees or dramatic breaches. It was this: a regulator, a customer, or a court asks what is in your database about my conversations, and the honest answer is everything. We did not want to be in that position. Not for our customers, not for ourselves.

The architectural answer is to keep the prompt out of the observability path entirely. If the data is never in the database, it cannot be leaked, subpoenaed, or accidentally indexed. It cannot be analyzed for training. It cannot be retained past a retention window because the retention window of nothing is zero.

What ships on the wire

The SDK wraps your existing OpenAI, Anthropic, Mistral, Cohere, Google, or Bedrock client. When your code calls client.chat.completions.create({...}), the wrap forwards the call untouched, awaits the response, and only then emits an event to our ingest endpoint. The event shape is fixed and small. Here is the full TypeScript type:

export interface LLMEvent {
  provider: 'openai' | 'anthropic' | 'google' | 'mistral'
          | 'cohere' | 'ollama' | 'azure' | 'bedrock'
          | 'groq' | 'xai' | 'perplexity' | 'deepseek'
          | 'together' | 'fireworks' | 'openrouter';
  model: string;
  model_version?: string;
  input_tokens: number;
  output_tokens: number;
  cache_tokens?: number;
  latency_ms: number;
  ttft_ms?: number;
  feature_tag?: string;
  user_id_hash?: string;
  environment?: 'production' | 'staging' | 'development';
  timestamp: string;
  is_error?: boolean;
  error_code?: string;
  error_message?: string;
}

That is it. There is no messages, no prompt, no completion, no tool_calls, no system prompt, no embedded function arguments. The transport batches these events and posts JSON to https://ingest.scopeveil.com/v1/ingest/batch every two seconds or every fifty events, whichever fires first.

A real event from production looks like this:

{
  "provider": "anthropic",
  "model": "claude-sonnet-4-6",
  "input_tokens": 412,
  "output_tokens": 180,
  "cache_tokens": 0,
  "latency_ms": 1340,
  "ttft_ms": 280,
  "feature_tag": "support-bot",
  "user_id_hash": "8a1f...c2",
  "environment": "production",
  "timestamp": "2026-05-23T12:30:11.452Z",
  "is_error": false
}

A reader of that record knows when, which model, how big, how fast, and which user-shaped bucket of activity (the hash is one-way). They do not know what was asked, what was answered, what tools were called, or what the system prompt contained. That information stays inside your process and dies with it.

What does not ship, on purpose

Some of the things we considered capturing and then chose not to:

Messages array. The natural way to support replay, debugging, and prompt versioning. We do not capture it. If you want replay, copy the messages to your own store. Your store, your retention, your decision.
Completion content. Same reasoning. Also avoids the awkward situation where a customer's user generates content that would be a liability for us to retain.
Tool call arguments. Tool-using agents often pass structured data that includes PII (addresses, IDs, account numbers). We do not parse them, count them, or store them. We do not even know if a call used tools.
Raw user identifiers. The SDK exposes a user_id field but hashes it at the boundary using a deterministic SHA-256 wrapper before it ever leaves the process. The hash is stable per user (so dashboards group correctly) but never recoverable.
Cost amounts. The client never reports dollar figures. Cost is computed server-side from input_tokens, output_tokens, model, and our pricing table. Clients cannot lie about cost, accidentally or otherwise.
Free-form metadata. The only customer-controlled text field is feature_tag, which is short, opaque to us, and intended for use like "checkout-summary". We do not parse it. If you put a credit card there, that is on you, but the field is not where an attacker would look.

The transport runs sanitizeEvent() before serializing to JSON, which strips any property that is not in the type above. If a future code path tries to attach a stray field, sanitization drops it on the floor. The wire is the type.

How cost is computed without the prompt

A common reaction is: if you do not see the prompt, how do you know what to charge? The provider tells us, and we trust the provider because the customer is also paying us based on it.

When the wrapped client gets a response back from OpenAI, the response includes a usage object with prompt_tokens, completion_tokens, and on newer endpoints a cached_tokens breakdown. Those numbers are already counted by the provider on their side, against their tokenizer, on their actual billed run. We do not retokenize. We forward the counts.

Server-side, the ingest looks up the provider and model in a pricing table and multiplies:

// Pseudocode for the cost calculation, post-ingest.
const rate = pricing.lookup(event.provider, event.model);
const upstream = event.input_tokens * rate.input
               + event.output_tokens * rate.output
               - event.cache_tokens * rate.cache_discount;
const markup = upstream * (rate.markup_pct / 100);
const total = upstream + markup;

The pricing table lives in our database, versioned per-provider per-model per-date. The customer can see the rate, the markup, and the math on the dashboard. There is no opaque "compute unit". There is tokens times a public price, plus a known percentage.

The same property holds for latency, throughput, error rates, and cache hit ratios. All of those are derivable from token counts, timestamps, and error codes. None of them need the message body.

What you lose

This is the section we wrote first internally, because we wanted to be honest about the cost. If you are evaluating ScopeVeil against tools that capture full content, here is what you do not get from us, and what we recommend instead.

Playground replay. No "click an event and re-run the same prompt against a different model" button. You can build that yourself by storing your own prompts and using our gateway for the actual call. We give you the model, tokens, and latency so you can compare apples-to-apples once you decide what to re-run.
Prompt similarity clustering. Tools that have content can group "summarize this email" with "give me a tldr of this message" and call them the same intent. We cannot do that. We can group by feature_tag, which you control, so you can carry that structure in yourself.
Post-hoc PII auditing. Tools that retain content can scan it for PII leakage and alert you. We trust you to do that pre-flight in your own code. We can tell you which feature_tag spends the most tokens at suspicious times of day, but we cannot tell you what was in the message.
Reproducing a bug from a stack trace. When a customer says "this completion was weird", you cannot pull it from us. We did not see it. Capture it in your own logs, gated by a feature flag, with whatever retention policy your team decided is correct.

Every one of these features is buildable on top of ScopeVeil. None of them is buildable inside ScopeVeil. We think that is the right line. Your prompts, your store.

What you gain

A short, defensible privacy story. Our privacy policy does not have a section on retention of message content because there is no message content. It has a section on metadata. That is easier to defend in a compliance review, a procurement questionnaire, or a regulator's first email.
Trivial GDPR / LGPD / CCPA data subject access. When an end user asks "what do you have on me", the answer is hashed identifier, tokens billed, timestamps. We can return that JSON in under a second and the request is closed. Compare to tools that need to grep through prompt content for anything that might be PII.
Cheaper storage and faster aggregation. Events are a few hundred bytes after JSON. A million events fit in a few hundred megabytes, including indexes. Aggregations like "cost per feature_tag per day" run on SUM(input_tokens * rate) from a narrow table, with no full-text scan and no JSON parsing of message arrays.
Lower legal surface for our customers. Your customer's user types their full name into your chatbot. That is now in your process and your store, but it is not in ours. When your customer asks where their user data goes, you can show them our event type and let them count fields.
No accidental training corpus. A vendor that has your prompts is one product decision away from training on them. "Anonymized aggregate", "improving model quality", whatever the framing. We literally cannot, because the input does not exist on our side.

Trade-offs we are still debating

The metadata-only line is not without internal tension. A few open questions, posted here in the spirit of writing them down so they do not rot:

Prompt fingerprinting. We could hash the prompt with a salt and ship the hash, which would let us count repeated calls without seeing content. The argument against is that any fingerprint is one rainbow table away from being content for a small enough prompt space, and the bound is hard to prove. We have not shipped it.
Length distributions. Bucketed prompt length (in tokens) is already implied by input_tokens. We expose it. We do not expose distributions of individual message lengths inside a multi-turn conversation, because that shape leaks more than the customer probably intends.
Tool call presence. A boolean did this request use tools would help dashboards distinguish agent traffic from chat traffic, and arguably leaks nothing. We are likely to ship this as an opt-in flag on the SDK.
Error message bodies. Provider error messages sometimes echo the offending prompt back. We truncate error_message at 500 characters before shipping, but truncation is not redaction. Long term we will content-filter known echo patterns on the client.

The pattern in all of these: the question is not can we ship this without privacy risk, it is can we ship this without undoing the property that the database cannot leak content. When the answer is unclear, we default to not shipping the feature, and we put the question in a list like this one so it stays visible.