Papers
arxiv:2601.18415

Pisets: A Robust Speech Recognition System for Lectures and Interviews

Published on Jan 26
· Submitted by
Roman Derunets
on Feb 9

Abstract

A three-component speech-to-text system combines Wav2Vec2, AST, and Whisper models with curriculum learning and uncertainty modeling to improve transcription accuracy and reduce hallucinations in Russian speech recognition.

AI-generated summary

This work presents a speech-to-text system "Pisets" for scientists and journalists which is based on a three-component architecture aimed at improving speech recognition accuracy while minimizing errors and hallucinations associated with the Whisper model. The architecture comprises primary recognition using Wav2Vec2, false positive filtering via the Audio Spectrogram Transformer (AST), and final speech recognition through Whisper. The implementation of curriculum learning methods and the utilization of diverse Russian-language speech corpora significantly enhanced the system's effectiveness. Additionally, advanced uncertainty modeling techniques were introduced, contributing to further improvements in transcription quality. The proposed approaches ensure robust transcribing of long audio data across various acoustic conditions compared to WhisperX and the usual Whisper model. The source code of "Pisets" system is publicly available at GitHub: https://github.com/bond005/pisets.

Community

Paper author Paper submitter

This paper presents Pisets, an offline ASR system designed for long-form audio such as lectures and interviews, where standard end-to-end models often hallucinate or degrade.

изображение

Key idea:
Pisets uses a multi-stage pipeline instead of a single monolithic model:

  1. Wav2Vec2-based speech detection to over-segment audio with high recall
  2. AST (Audio Spectrogram Transformer) to filter false positives
  3. Whisper for final transcription on cleaned segments

Fine-tuning matters:
Rather than relying purely on off-the-shelf models, the authors fine-tune their own components, especially the speech detection and filtering stages, using curriculum learning and diverse real-world data. This targeted fine-tuning is key to reducing hallucinations and improving robustness on noisy, long recordings.

Why it’s interesting:

  • Shows that careful fine-tuning of intermediate models can outperform larger end-to-end setups
  • Emphasizes pipeline design + training strategy over simply scaling model size
  • Evaluated with both WER and semantic metrics, highlighting transcription quality beyond surface accuracy

Takeaway:
Pisets argues that reliable ASR for real-world, long-form audio still benefits from modular systems and task-specific fine-uning, rather than relying solely on large general-purpose models.

arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/pisets-a-robust-speech-recognition-system-for-lectures-and-interviews-5426-84640597

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.18415 in a dataset README.md to link it from this page.

Spaces citing this paper 4

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.