The exact prompt-caching strategy that makes our AI summaries 90% cheaper

Pro AI features run inference. Inference costs money. The naive way to run something like AI Project Context — synthesise a 250–450 word brief across six recordings — is to send all six transcripts to the model, every time the user opens the project. At 25K input tokens per recording, six recordings is 150K input tokens per request. Open the project ten times a week, that's 1.5M input tokens per project per week, at roughly $3 per million for the model we use. One project. One user. Three dollars a week on inference, at a $24.99/mo intro (or $34.99/mo standard) subscription. The math does not work.

The fix is structural. We cache aggressively, against a key that captures exactly what's allowed to change.

The cache key.

Every Pro AI synthesis is keyed to a content signature — a SHA-256 hash computed over the deterministic inputs that affect the output. For Project Context, that's the project's recording set: the recording UUIDs, their transcript content (post any speaker reassignments and inline edits), and the speaker library state for the people who appear. The hash is recomputed on every project open. If the hash matches the cached version, the brief is served from cache; if it doesn't, the brief is regenerated and the cache is updated.

What's deliberately not in the cache key: the user's UI state, the time of day, the device opening the project, the model version (we cache per-model). What is in the key: every byte that could change the output if the model saw it.

The result is that opening a project ten times a week, with no underlying changes, costs roughly the same as opening it once. Add a recording, the cache invalidates — but only that project's. Edit a transcript, same. Reassign a speaker, same. The cache invalidation is precise; we don't blow away the world when one thing changes.

What 90% means here.

"90% cheaper" is not marketing rounding. We instrument every Pro AI inference call and tag it with cache-status (hit / miss / partial). Across our last three months of production traffic, 91.4% of Pro AI requests served from cache. The remaining 8.6% were genuine cache invalidations — new recordings, transcript edits, speaker reassignments — and were the only requests that touched the model.

Per-user inference cost dropped from roughly $14/month at the naive baseline to roughly $1.20/month at the cached baseline, on the heavy-user cohort. The average user costs us less than $0.30/mo on Pro AI inference. That's the gap that lets us run Pro AI at $24.99/mo intro (rolling to $34.99 standard) with margin to invest in better models — and to subsidise the 60-day intro window in the first place.

The provider's prompt-cache layer, on top of ours.

Modern frontier model providers (Anthropic, OpenAI) ship a prompt-cache layer of their own — you can mark a prompt prefix as cacheable, and subsequent requests with the same prefix bypass most of the input-token cost. We use this layer too, on top of our own content-signature cache. The combination is what gets us to single-digit-cents per summary on the cache-miss path.

The prefix we cache is the system prompt plus the framework definitions plus any project-specific scaffold. The variable suffix is the actual transcript content. A cache miss in our content-signature cache still hits a partial-cache in the provider's prompt-cache layer, which means a "miss" doesn't cost full-prompt prices either. Layered caches compound.

What we deliberately don't cache.

Two things, on principle.

Compatibility Analysis output. The consent gate is tied to the recording's consent state, which can be revoked. Caching the analysis would create a window where a revoked-consent recording's analysis lingers in cache; the simplest fix is just not to cache it. Compatibility runs end-to-end every time. It is the most expensive Pro AI feature per invocation, and that is intentional.

Anything that crosses a privacy boundary. The cache lives in your private iCloud, encrypted, scoped to your account. We do not maintain a Bonfiyah-side cache that would amount to a transcript copy on our infrastructure. The architectural decision predates the privacy commitment — but it is also the reason the privacy commitment is enforceable.

Why publish this.

Two reasons. First, because the prompt-caching strategy is one of the load-bearing decisions that lets a deliberately-bounded iOS team profitably run a Pro AI tier — and the post-AI-development indie-hacker audience is who I want to publish into. There are many small teams who could ship something like this if the caching architecture were less of a black box.

Second, because I want it to be checkable. If we ever change this — if the cache hit-rate falls because we expanded a feature carelessly, if the per-summary cost drifts up — the gap shows up in our pricing math first. Publishing the architecture means the math is auditable.

— Richard

The prompt-caching strategy that makes our AI summaries 90% cheaper.

The cache key.

What 90% means here.

The provider's prompt-cache layer, on top of ours.

What we deliberately don't cache.

Why publish this.

Two ways to make Bonfiyah cheaper.

More posts like this