The brief
Our client, a European insurance provider, required a more efficient method to evaluate its call center performance. In particular, they wanted to reduce reliance on manual analysts and introduce automation through an AI call center transcription and diarization solution.
Challenge
Before partnering with WebbyLab, our client’s call center quality analysts had to manually listen to call recordings, score conversations, and give recommendations for improvement. This approach was slow and resource-heavy.
The company also lacked in-house AI expertise necessary for implementing a pilot solution quickly. Our team came to the rescue and was tasked to design and validate an AI call center quality assurance system capable of:
- Converting audio into text with accurate support for Spanish dialects
- Performing speaker diarization to distinguish between agents and clients
- Running entirely on the client’s infrastructure to ensure data confidentiality
- Providing a backend service with APIs for smooth integration into the call center’s existing workflows
The technical hurdles included handling background noise, overlapping voices, mixed languages, and the need to process long recordings efficiently. In short, the system had to match the accuracy of human analysts while scaling far beyond the capacity of manual review.
Solution
To deliver the AI call center automation system our client requested, we handled several stages of research and development. Those included:
Speech recognition model evaluation and selection
We started with an in-depth analysis of leading speech recognition models, focusing on the following criteria:
- Spanish dialects support
- High transcription accuracy
- Local deployment capability
- Diarization support
- High performance and scalability
Candidates for evaluation included OpenAI’s Whisper, Mozilla’s DeepSpeech, and Facebook AI’s Wav2Vec 2.0.
After comparative testing on real call recordings, Whisper, extended via WhisperX speech recognition, proved the most accurate. Its ability to handle accents, background noise, and even mixed languages with high fidelity made it the optimal choice.
Text diarization
To correctly separate agent and client speech, we integrated WhisperX with PyAnnote diarization. This enabled:
- Segmenting audio files with voice activity detection (VAD)
- Transcribing each segment with Whisper, while simultaneously handling PyAnnote speaker diarization
- Assigning the speaker_id for each text segment
Backend development and optimizationÂ
To embed the system into the client’s workflows, we built a backend service on Express.js with two main API endpoints:
- POST /transcribe. Accepts audio files, queues them for processing in Redis, and returns a request_id.
- GET /status/{id}. Allows clients to check whether a task is pending, processing, completed, or failed, and retrieve transcripts when ready.
Processed data is stored in MySQL, with audio files saved locally but ready for migration to cloud storage if needed. Redis allowed us to scale horizontally, handling multiple audio jobs in parallel.
Results
The R&D prototype of multilingual speech-to-text with AI was tested on over 50 real call center recordings. Key outcomes included:
- Transcription quality. ~92% accuracy (word error rate ~8%), even under noisy conditions, with accented speech or mixed languages.
- Processing speed. 1 minute of audio processed in 45–80 seconds on CPU.
- Diarization accuracy. ~87% (correctly identified speakers in almost 9 out of 10 cases), with most errors occurring during cross-talk or very short interjections.
- Language support. Spanish dialects with automatic dialect detection of ~90% accuracy.
The business impact was also significant:
- Manual listening time was reduced by 70–80%, allowing analytics teams to focus on edge cases rather than every call
- The volume of analyzed calls increased dramatically, providing broader insights into customer satisfaction
- Reviewing one call now takes less than a minute compared to 5–10 minutes previously
Implementing AI speech-to-text with diarization has become a perfect foundation for further integration with NLP and sentiment analysis, opening the door to advanced customer experience analytics.
We used a range of technologies throughout the development process:
Speech recognition: Whisper, WhisperX
Diarization / VAD: PyAnnote
Audio processing (ML): Python
Backend API: Node.js, Express.js
ORM / DB access: Sequelize.js
Job queue: Redis
Database: MySQL
Media storage: Amazon S3, MinIO
Frontend (dashboard): React, Zustand
Deployment / Infrastructure: AWS ECS on Fargate