Blog

Call Transcription: Accuracy, Benchmarks, and Residency

Compare call transcription accuracy benchmarks, pricing models, and data residency requirements for HIPAA, GDPR, and LGPD compliance.

May 25, 2026 Tessera 7 min read SaladAssemblyAIOpenAIWhisperDeepgram

Call Transcription: Accuracy, Benchmarks, and Residency

Call transcription converts call audio into searchable, analyzable text. This guide covers how accuracy is measured, how pricing models compare, and which compliance rules govern your data.

For implementation details, see our audio transcription API documentation.

How Call Transcription Works

The transcription pipeline has three stages: preprocessing, speech-to-text inference, and post-processing. Each stage affects final accuracy and latency.

Preprocessing and Signal Cleaning

Raw call-center audio contains background noise, varying sample rates, and overlapping speech. Preprocessing applies noise reduction, normalizes sample rates, and prepares the signal for the model. Models trained on clean speech degrade quickly when exposed to telephony artifacts, so this step is critical.

Diarization and Speaker Identification

Diarization labels each audio segment with the corresponding speaker. A transcript without speaker labels is difficult to audit or analyze. Advanced systems use voice activity detection and speaker embedding models to distinguish agents, customers, and hold music.

Inference and Post-Processing

Neural networks convert speech to text, then post-processing adds punctuation, capitalization, and speaker tags. Some models also perform intent classification or entity extraction within the same step.

Tessera supports models optimized for these tasks. See available models to find the right balance of speed and accuracy.

Call Transcription Accuracy Benchmarks

Word Error Rate (WER) measures the share of incorrectly recognized words.

Benchmark Data and Comparisons

Salad’s benchmark reports WER as low as 4.2% to 4.9% on Common Voice and TED-LIUM. OpenAI Whisper scores 7.3% to 9.75% WER on the same datasets. Deepgram Nova 2 ranges from 5.56% to 12.43% WER.

Whisper large-v3 reduces errors by 10% to 20% compared to large-v2.

Lab vs. Real-World Performance

Clean datasets like LibriSpeech overstate call audio performance. Use Switchboard, CHiME, or CallHome to simulate overlapping speech. General models struggle with jargon, while fine-tuned models excel.

Multilingual contact centers benefit from models trained on FLEURS or Mozilla Common Voice to handle code-switching.

Real-World Audio Challenges

Call audio rarely matches clean benchmarks. Codecs like G.711 or Opus introduce artifacts, and network jitter, packet loss, and background noise further degrade accuracy. Studio-trained models see WER spike 15% to 30% on telephony audio.

Ask vendors for performance breakdowns by accent region, not aggregate scores.

Understanding WER vs. CER

WER is the industry standard, but some teams track Character Error Rate (CER) for complex character sets or punctuation. For voice agents parsing structured commands, CER offers a more precise downstream metric.

Latency and Throughput Considerations

Accuracy is only half the equation. Latency determines whether a transcription system can support real-time use cases.

Voice agents and live captioning require p95 latency under 500 milliseconds. Compliance archiving and post-call analytics can tolerate 2 to 5 seconds of processing delay. High-volume contact centers should look for vendors offering horizontal scaling, GPU acceleration, and clear SLAs on processing time.

Streaming vs. Batch Processing

Streaming transcription delivers text in near real-time as audio is captured, requiring sub-second latency to support live captioning or AI agent routing. Batch processing handles completed audio files, allowing higher-accuracy models at the cost of delay. Some systems combine both: an initial low-latency transcript is refined asynchronously by a higher-accuracy model.

Map your use case to the appropriate mode. Live routing demands streaming, while compliance archiving benefits from batch refinement.

Pricing Models: Flat vs. Pay-Per-Minute

Pricing structure directly affects total cost of ownership.

Pay-Per-Minute Pricing

Pay-per-minute pricing ties costs directly to call volume. High call volumes spike costs unpredictably, and you pay for every minute processed, including retries and failed attempts.

Flat Monthly Pricing

Flat monthly pricing caps spend regardless of call volume, making unit economics predictable. For organizations processing millions of minutes per month, this model is often more cost-effective.

Hidden Costs in Pricing

Beyond base rates, watch for minimum monthly commitments, retention fees for audio stored beyond standard windows, and per-call API fees for diarization or custom vocabulary loading. Model your total cost of ownership using peak call volumes plus a 20% buffer for seasonal spikes.

Enterprise contracts often bundle these features, but usage-based plans charge per feature. See tiers and limits for how Tessera structures usage.

Volume Tiers and Commitment Discounts

Most vendors offer graduated pricing where the per-minute rate drops as monthly usage increases, with additional discounts for annual prepayment. Model costs across a 12 to 24-month horizon, accounting for expected growth. Negotiate clear overage caps and grace periods to protect against unpredictable call surges.

Data Residency and Compliance Requirements

Data residency rules dictate where audio and transcripts are processed and stored. Contracts keep sensitive data within jurisdiction boundaries.

HIPAA (US)

HIPAA focuses on protection, not location. PHI may be processed anywhere if safeguards exist.

BAA Required: Vendors must sign a Business Associate Agreement before handling PHI.
Residency Claims: US-only processing is a commercial claim. Verify region-locking capabilities.

Clinic recordings often contain health data, classified as special category data under GDPR Article 9.

Processing Location: EEA exports require a lawful basis and Chapter V mechanism.
Contracts: A Data Processing Agreement (Article 28) is mandatory. Exports need SCCs and a Transfer Impact Assessment.
Schrems II & AI Act: SCCs require supplementary measures against surveillance. The AI Act adds transparency and logging requirements but does not replace GDPR. See our AI Act compliance guide.

LGPD (Brazil) & Mexico (LFPDPPP)

Both laws regulate sensitive health data without mandating local-only processing.

Contracts: A data processing agreement defines security obligations and instruction limits.
Transfers: Cross-border moves need valid legal bases (LGPD Article 33; Mexican consent/notice frameworks).
Verification: Vendors must prove hosting regions, backups, and subprocessor chains.

Data Retention, Deletion, and Encryption

GDPR and LGPD grant erasure rights. Vendors must purge audio, transcripts, and metadata from production and backups. Verify automated retention policies and deletion certificates.

Audio and transcripts must be encrypted in transit (TLS 1.2+) and at rest (AES-256). Vendors should provide key management options, including customer-managed keys (CMK). Audit logs must track every access event, and regular third-party audits (SOC 2 Type II, ISO 27001) verify security controls.

How to Vet Your Transcription Vendor

Use this checklist to evaluate vendors against compliance and technical requirements.

Universal Questions

Are you a processor or business associate of our organization?
Will you sign the required contract before any data is shared?
- HIPAA: BAA
- GDPR: DPA under Article 28
- LGPD/Mexico: Processing/transfer agreement
Where are production, backups, and support-access locations?
Do any subprocessors, remote admins, or AI model providers outside the required jurisdiction access the data?
What transfer mechanism do you use if data leaves the country or region?
Can you provide an up-to-date subprocessor list and architecture diagram?
Can you prove all residency and no-export claims with contract and technical controls?

Law-Specific Requirements

HIPAA: Sign a BAA before any PHI goes to the vendor.
GDPR: Sign a DPA and use SCCs plus a Schrems II transfer assessment for EEA exports.
LGPD: Sign processor terms and use an LGPD-compliant transfer basis under Articles 33 to 36.
Mexico: Ensure privacy notice alignment and transfer terms are in place for sensitive data.

Red Flags When Vetting Vendors

Vendors that refuse to share their subprocessor list or architecture diagrams lack transparency. Claims of zero data retention without automated deletion workflows are usually marketing copy. Avoid providers that cannot demonstrate their models handle your specific call codecs or accent profiles before you sign a contract.

FAQ

What is a good Word Error Rate for call transcription?

Clean speech hits WER under 5%. Benchmarks show 4.2% to 4.9% on clean datasets, while Whisper ranges from 7.3% to 9.75%. Analytics tolerate 10% to 15% WER, but legal evidence requires under 5%.

Does call transcription require a BAA or DPA?

Yes. HIPAA mandates a BAA for US health data. GDPR requires a DPA under Article 28 plus SCCs for cross-border transfers. LGPD and Mexico’s LFPDPPP also require processor agreements for sensitive data. Missing these contracts risks regulatory fines.

How does data residency affect call transcription?

Residency dictates where audio and transcripts are processed. EU and LATAM providers keep data within jurisdiction boundaries, essential for GDPR and LGPD compliance. Buyers must verify storage, backup, and support access locations to validate vendor claims.

What is the difference between Whisper and proprietary STT models?

Whisper large-v3 reduces errors by 10% to 20% over large-v2. Proprietary models optimize for telephony, low latency, or specific accents, often outperforming open-source models on domain jargon. Whisper excels in multilingual support.

How do I handle overlapping speech in call transcription?

Use vendors with advanced diarization and models trained on conversational datasets like CHiME or AMI. Systems using voice activity detection to segment audio before inference handle multi-speaker calls more accurately.

How do I measure transcription ROI?

Track time saved on manual note-taking, reduced compliance audit hours, and improved customer satisfaction. Calculate cost per accurate minute versus human transcription labor. Most organizations see positive ROI within three to six months.

Call Transcription: Accuracy, Benchmarks, and Residency

Call Transcription: Accuracy, Benchmarks, and Residency

How Call Transcription Works

Preprocessing and Signal Cleaning

Diarization and Speaker Identification

Inference and Post-Processing

Call Transcription Accuracy Benchmarks

Benchmark Data and Comparisons

Lab vs. Real-World Performance

Real-World Audio Challenges

Understanding WER vs. CER

Latency and Throughput Considerations

Streaming vs. Batch Processing

Pricing Models: Flat vs. Pay-Per-Minute

Pay-Per-Minute Pricing

Flat Monthly Pricing

Hidden Costs in Pricing

Volume Tiers and Commitment Discounts

Data Residency and Compliance Requirements

HIPAA (US)

GDPR + Article 9 (EU)

LGPD (Brazil) & Mexico (LFPDPPP)

Data Retention, Deletion, and Encryption

How to Vet Your Transcription Vendor

Universal Questions

Law-Specific Requirements

Red Flags When Vetting Vendors

FAQ

What is a good Word Error Rate for call transcription?

Does call transcription require a BAA or DPA?

How does data residency affect call transcription?

What is the difference between Whisper and proprietary STT models?

How do I handle overlapping speech in call transcription?

How do I measure transcription ROI?