How we caught three false-positive speaker IDs

Cross-recording voice ID is the layer beneath Promise Tracker, People Memory, Team Dynamics, Compatibility, and a handful of other Pro AI features. When it works, the user never thinks about it: Sarah is one identity; her Promise Tracker history rolls up cleanly; her Speaker Insights profile is coherent. When it fails, every downstream feature inherits the failure, and the failure is hard to detect because the transcripts are internally consistent — they're just attributing words to the wrong person.

Three production cases since launch where it did fail, what surfaced them, and what shipped as a result.

Case 1 — Two siblings, one identity.

A user emailed support: their Promise Tracker was telling them their brother had agreed to four things, but the brother had only been on two of those calls. The other two were their other brother. Two siblings, similar register, in-room recordings, both bound to the same speaker library identity.

The voice signatures (voice signatures) were within 0.18 cosine distance of each other — well below the splitting threshold. Same household, same accent, similar fundamental frequency, similar speech rate. The model was not "wrong" so much as the two voices were genuinely close in the embedding space.

What shipped: the ability to manually split an identity in the speaker library. Tap the identity, see the constituent recordings, choose which recordings belong to which sibling. Each side keeps its own embedding centroid. Future recordings now match against two centroids; the closer one wins. This works perfectly for the user-flagged case but doesn't help users who haven't flagged the issue. Which is why we then shipped...

Case 2 — Same person, two identities.

The opposite failure. A user noticed that Speaker Insights had two profiles for what was clearly one person — same name, same workplace, same recurring topics. The user had been recording on a Mac (close-mic via the built-in array) and on an iPhone (in-room, four feet away). The channel difference was substantial enough that the close-mic embeddings and the in-room embeddings drifted across the matching threshold; the same voice produced two distinct centroids.

This was a calibration miss. We'd tested cohort matching across iPhone-to-iPhone and across the same iPhone over time, but our test corpus didn't have systematic close-mic Mac vs. in-room iPhone pairs of the same speaker. The mismatch was real.

What shipped: a "merge identities" affordance in the speaker library, plus a recalibrated channel-aware similarity threshold. The merge action joins the centroids and tells future matching to weight intra-channel similarity differently from inter-channel similarity. We also added a Mac vs. iPhone augmentation to our internal evaluation set, which is how Case 3 surfaced.

Case 3 — One label, two voices, silently mixed.

The most insidious. A four-person standup, recorded on an iPhone in the room. the upstream transcription provider's diarization returned three speaker labels — Speaker A, B, C — for what should have been four. We assumed the meeting had only three participants. The user noticed when the Promise Tracker output had attributed a commitment to "Marcus" that was actually said by "Dani." Looking at the transcript, half of "Marcus" was Marcus and half was Dani, who had a very similar voice and was sitting close to the microphone.

This is a hybrid cluster. the upstream transcription provider did not split the two voices into separate diarization labels, so our cohort layer had nothing to work with — it bound the merged label to "Marcus" because the first few utterances were Marcus's, and the rest carried forward.

Detecting this in production is hard because the transcript looks correct. The user only caught it because they remembered the specific commitment.

What shipped: the per-utterance an on-device correction pass pass that's now part of every recording. Within each the upstream provider-assigned cluster, we compute per-utterance embeddings and re-cluster by ECAPA distance. If two sub-clusters separate cleanly above a calibrated margin, the cluster is split, the sub-clusters get reassigned identities, and the user never sees the original hybrid label. This is the pass described on the Voice ID page.

On our 12,000-utterance evaluation set, the re-clustering pass corrects the majority of the silent-error case automatically. The remaining 27% are catchable via the Resplit Voice Matching surface, which lets you manually split a hybrid cluster from inside the recording view.

What we learned.

Three things, ranked by how much they've changed how we ship.

Voice ID failures look like nothing. The transcript reads fine; the summary reads fine; the speaker labels are stable. The only thing that's wrong is the attribution. We now monitor for cross-recording attribution drift — when a single identity's behavioural signature shifts more than a calibrated amount between consecutive recordings, the identity is flagged for human review. About 1.4% of identity timelines hit this flag in any given month. Most are real (someone got a cold, the audio environment changed) but the catch-rate on actual bad bindings has gone up.

Hybrid clusters are the worst kind of failure. If diarization splits two voices into separate labels and we get the names wrong, the user catches it the moment they open the transcript. If diarization merges two voices and we silently propagate the merge, the user finds out three weeks later when the Promise Tracker says someone promised something they never said. The automatic per-utterance correction pass is the most consequential thing we shipped post-launch.

Our evaluation set was thinner than we thought. The training and evaluation corpus we'd inherited was too clean — single-channel, controlled audio, evenly-distributed speakers. Real-world data is messier in specific ways (channel-mix, voice-similarity edge cases, hybrid clusters). We've expanded the evaluation set materially since each of these cases. The 97.3% top-1 accuracy number we cite on the Voice ID page is from the expanded corpus, not the original. We do not benchmark against the cleaner corpus internally because that number would be misleading.

Why publish this.

Voice biometrics are a category where over-promising and under-disclosing are the norm. Most products in adjacent categories quote a single accuracy number, refuse to discuss failure modes, and treat the binding as authoritative for downstream features. We are not in a position — and do not want to be — to do that. Our Voice ID powers Compatibility Analysis, which is a feature with real consequences if the binding is wrong. The system has to be auditable, the failure modes have to be public, and the fixes have to be checkable.

If you're a legal, medical, or research evaluator looking at Bonfiyah for serious work, this is the post we want you to have read.

— Richard

Three false-positive speaker IDs we caught — and what each one taught us.

Case 1 — Two siblings, one identity.

Case 2 — Same person, two identities.

Case 3 — One label, two voices, silently mixed.

What we learned.

Why publish this.

Two ways to make Bonfiyah cheaper.

Engineering posts like this, by email