WebbyLab Cases AI Call Center Automation with Speech Recognition and Diarization (MVP)

AI Call Center Automation with Speech Recognition and Diarization (MVP)

AI-driven speech recognition and diarization system research and development, resulting in a locally deployed backend solution for call center conversation analysis automation

CLIENT

NDA

Industry

Insurance, AI

Services provided

R&D

Business analysis

Backend development

AI/ML development

On-premise deployment

Workflow integration

months

professionals

Services provided

R&D

Business analysis

Backend development

AI/ML development

On-premise deployment

Workflow integration

The brief

Our client, a European insurance provider, required a more efficient method to evaluate its call center performance. In particular, they wanted to reduce reliance on manual analysts and introduce automation through an AI call center transcription and diarization solution.

Challenge

Before partnering with WebbyLab, our client’s call center quality analysts had to manually listen to call recordings, score conversations, and give recommendations for improvement. This approach was slow and resource-heavy.

The company also lacked in-house AI expertise necessary for implementing a pilot solution quickly. Our team came to the rescue and was tasked to design and validate an AI call center quality assurance system capable of:

Converting audio into text with accurate support for Spanish dialects
Performing speaker diarization to distinguish between agents and clients
Running entirely on the client’s infrastructure to ensure data confidentiality
Providing a backend service with APIs for smooth integration into the call center’s existing workflows

The technical hurdles included handling background noise, overlapping voices, mixed languages, and the need to process long recordings efficiently. In short, the system had to match the accuracy of human analysts while scaling far beyond the capacity of manual review.

Solution

To deliver the AI call center automation system our client requested, we handled several stages of research and development. Those included:

Speech recognition model evaluation and selection

We started with an in-depth analysis of leading speech recognition models, focusing on the following criteria:

Spanish dialects support
High transcription accuracy
Local deployment capability
Diarization support
High performance and scalability

Candidates for evaluation included OpenAI’s Whisper, Mozilla’s DeepSpeech, and Facebook AI’s Wav2Vec 2.0.

After comparative testing on real call recordings, Whisper, extended via WhisperX speech recognition, proved the most accurate. Its ability to handle accents, background noise, and even mixed languages with high fidelity made it the optimal choice.

Text diarization

To correctly separate agent and client speech, we integrated WhisperX with PyAnnote diarization. This enabled:

Segmenting audio files with voice activity detection (VAD)
Transcribing each segment with Whisper, while simultaneously handling PyAnnote speaker diarization
Assigning the speaker_id for each text segment

Backend development and optimization

To embed the system into the client’s workflows, we built a backend service on Express.js with two main API endpoints:

POST /transcribe. Accepts audio files, queues them for processing in Redis, and returns a request_id.
GET /status/{id}. Allows clients to check whether a task is pending, processing, completed, or failed, and retrieve transcripts when ready.

Processed data is stored in MySQL, with audio files saved locally but ready for migration to cloud storage if needed. Redis allowed us to scale horizontally, handling multiple audio jobs in parallel.

Results

The R&D prototype of multilingual speech-to-text with AI was tested on over 50 real call center recordings. Key outcomes included:

Transcription quality. ~92% accuracy (word error rate ~8%), even under noisy conditions, with accented speech or mixed languages.
Processing speed. 1 minute of audio processed in 45–80 seconds on CPU.
Diarization accuracy. ~87% (correctly identified speakers in almost 9 out of 10 cases), with most errors occurring during cross-talk or very short interjections.
Language support. Spanish dialects with automatic dialect detection of ~90% accuracy.

The business impact was also significant:

Manual listening time was reduced by 70–80%, allowing analytics teams to focus on edge cases rather than every call
The volume of analyzed calls increased dramatically, providing broader insights into customer satisfaction
Reviewing one call now takes less than a minute compared to 5–10 minutes previously

Implementing AI speech-to-text with diarization has become a perfect foundation for further integration with NLP and sentiment analysis, opening the door to advanced customer experience analytics.