background
WebbyLab Cases AI Call Center Automation with Speech Recognition and Diarization (MVP)

AI Call Center Automation with Speech Recognition and Diarization (MVP)

AI-driven speech recognition and diarization system research and development, resulting in a locally deployed backend solution for call center conversation analysis automation
CLIENT
NDA
Industry
Insurance, AI
Services provided
R&D
Business analysis
Backend development
AI/ML development
On-premise deployment
Workflow integration
Duration
2
months
Duration
2
professionals
Services provided
R&D
Business analysis
Backend development
AI/ML development
On-premise deployment
Workflow integration

The brief

Our client, a European insurance provider, required a more efficient method to evaluate its call center performance. In particular, they wanted to reduce reliance on manual analysts and introduce automation through an AI call center transcription and diarization solution.


Challenge

Before partnering with WebbyLab, our client’s call center quality analysts had to manually listen to call recordings, score conversations, and give recommendations for improvement. This approach was slow and resource-heavy.

The company also lacked in-house AI expertise necessary for implementing a pilot solution quickly. Our team came to the rescue and was tasked to design and validate an AI call center quality assurance system capable of:

  • Converting audio into text with accurate support for Spanish dialects
  • Performing speaker diarization to distinguish between agents and clients
  • Running entirely on the client’s infrastructure to ensure data confidentiality
  • Providing a backend service with APIs for smooth integration into the call center’s existing workflows

The technical hurdles included handling background noise, overlapping voices, mixed languages, and the need to process long recordings efficiently. In short, the system had to match the accuracy of human analysts while scaling far beyond the capacity of manual review.

AI assistant

Solution

To deliver the AI call center automation system our client requested, we handled several stages of research and development. Those included:

Speech recognition model evaluation and selection

We started with an in-depth analysis of leading speech recognition models, focusing on the following criteria:

  • Spanish dialects support
  • High transcription accuracy
  • Local deployment capability
  • Diarization support
  • High performance and scalability

Candidates for evaluation included OpenAI’s Whisper, Mozilla’s DeepSpeech, and Facebook AI’s Wav2Vec 2.0.

After comparative testing on real call recordings, Whisper, extended via WhisperX speech recognition, proved the most accurate. Its ability to handle accents, background noise, and even mixed languages with high fidelity made it the optimal choice.

Text diarization

To correctly separate agent and client speech, we integrated WhisperX with PyAnnote diarization. This enabled:

  • Segmenting audio files with voice activity detection (VAD)
  • Transcribing each segment with Whisper, while simultaneously handling PyAnnote speaker diarization
  • Assigning the speaker_id for each text segment

Backend development and optimization 

To embed the system into the client’s workflows, we built a backend service on Express.js with two main API endpoints:

  • POST /transcribe. Accepts audio files, queues them for processing in Redis, and returns a request_id.
  • GET /status/{id}. Allows clients to check whether a task is pending, processing, completed, or failed, and retrieve transcripts when ready.

Processed data is stored in MySQL, with audio files saved locally but ready for migration to cloud storage if needed. Redis allowed us to scale horizontally, handling multiple audio jobs in parallel.

Results

The R&D prototype of multilingual speech-to-text with AI was tested on over 50 real call center recordings. Key outcomes included:

  • Transcription quality. ~92% accuracy (word error rate ~8%), even under noisy conditions, with accented speech or mixed languages.
  • Processing speed. 1 minute of audio processed in 45–80 seconds on CPU.
  • Diarization accuracy. ~87% (correctly identified speakers in almost 9 out of 10 cases), with most errors occurring during cross-talk or very short interjections.
  • Language support. Spanish dialects with automatic dialect detection of ~90% accuracy.

The business impact was also significant:

  • Manual listening time was reduced by 70–80%, allowing analytics teams to focus on edge cases rather than every call
  • The volume of analyzed calls increased dramatically, providing broader insights into customer satisfaction
  • Reviewing one call now takes less than a minute compared to 5–10 minutes previously

Implementing AI speech-to-text with diarization has become a perfect foundation for further integration with NLP and sentiment analysis, opening the door to advanced customer experience analytics.

We used a range of technologies throughout the development process:

whisper
Whisper
whisper x
WhisperX
PyAnnote
PyAnnote
Express.js logo
Express.js
MySQL
MySQL
redis
Redis
Node.js
Node.js
React.js
React
AWS
AWS

Speech recognition: Whisper, WhisperX

Diarization / VAD: PyAnnote

Audio processing (ML): Python

Backend API: Node.js, Express.js

ORM / DB access: Sequelize.js

Job queue: Redis

Database: MySQL

Media storage: Amazon S3, MinIO

Frontend (dashboard): React, Zustand

Deployment / Infrastructure: AWS ECS on Fargate

Launching a new project or improving an existing one?
We can bring your ideas to life!
Get in touch
Brain/B2B portal
Explore next project
Brain/B2B portal
Learn more

2025 WEBBYLAB LTD. All rights reserved.

Cookies talk
Notice. PrivacyPolicies.com uses cookies to provide necessary website functionality, improve your experience and analyze our traffic. By using our website, you agree to our Privacy Policy and our cookies usage.
Accept