Granite-Speech-4.1-2B-NAR

Model Summary: Granite-Speech-4.1-2B-NAR is a non-autoregressive (NAR) speech recognition model that formulates ASR as conditional transcript editing. Instead of decoding tokens one at a time, it edits a CTC hypothesis in a single forward pass using a bidirectional LLM, achieving competitive accuracy with faster inference than autoregressive alternatives. The model is based on the NLE (Non-autoregressive LLM-based Editing) architecture described in this paper.

For applications where accuracy is the primary concern, consider granite-speech-4.1-2b, an autoregressive model from the Granite Speech 4.1 family which achieves higher transcription accuracy at the cost of increased inference latency. Granite-speech-4.1-2b produces punctuated and capitalized transcripts, supports AST and keyword-biased recognition, and includes Japanese.

When speaker or word-timing information is needed, consider using granite-speech-4.1-2b-plus, which extends the above model with speaker-attributed ASR (speaker labels + word transcripts) and word-level timing information.

Release Date: April 2026

License: Apache 2.0

Supported Languages: English, French, German, Spanish, Portuguese

Intended Use: The model is intended for automatic speech recognition tasks, particularly in latency-sensitive applications where fast inference is critical.

Evaluation Results

Open ASR leaderboard results

RTFx-WER results on the Open ASR leaderboard (as of Apr 2026).

Additional results

Greedy decoding with bfloat16 inference. WER computed with jiwer after whisper_normalizer EnglishTextNormalizer normalization. Open ASR Leaderboard results may differ slightly due to normalization and scoring pipeline differences. Measured RTFx of ~1820 on a single H100 GPU (batched inference, batch size 128).

Dataset	WER	Dataset	WER
LibriSpeech clean	1.29	MLS EN	4.77
LibriSpeech other	2.75	MLS DE	4.75
CommonVoice 15 EN	6.50	MLS ES	3.31
CommonVoice 15 DE	4.73	MLS FR	4.52
CommonVoice 15 ES	4.02	MLS PT	11.86
CommonVoice 15 FR	7.17	AMI IHM	7.91
CommonVoice 15 PT	2.57	AMI SDM	19.59
Earnings-22	8.48	GigaSpeech	10.12
SPGISpeech	3.04	TED-LIUM	3.67
VoxPopuli	5.83

Usage

Installation

We require flash_attention_2 for inference, since this backend supports sequence packing and respects the is_causal=False flag. Tested with transformers==4.57.6 and transformers==5.5.3, using torch==2.9.1.

# Fresh install (CUDA 12.8, Python 3.10+)
pip install torch==2.9.1 torchaudio==2.9.1 --index-url https://download.pytorch.org/whl/cu128
pip install transformers==4.57.6 accelerate==1.13.0 safetensors==0.7.0 huggingface-hub==0.36.2 tokenizers==0.22.2
pip install soundfile
pip install flash-attn==2.8.3 --no-build-isolation

Inference with `transformers`

import torch
import torchaudio
from huggingface_hub import hf_hub_download
from transformers import AutoModel, AutoFeatureExtractor

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "ibm-granite/granite-speech-4.1-2b-nar"
model = AutoModel.from_pretrained(model_name, trust_remote_code=True,
                                  attn_implementation="flash_attention_2", device_map=device,
                                  dtype=torch.bfloat16).eval()
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name, trust_remote_code=True)

# Load sample audio from the repo
audio_path = hf_hub_download(repo_id=model_name, filename="10226_10111_000000.wav")
waveform, sr = torchaudio.load(audio_path)
if sr != 16000:
    waveform = torchaudio.functional.resample(waveform, sr, 16000)
if waveform.shape[0] > 1:
    waveform = waveform.mean(dim=0, keepdim=True)
waveform = waveform.squeeze(0)

# Extract features and run inference
inputs = feature_extractor([waveform], device=device)
output = model.generate(**inputs)

print(f"Prediction: {output.text_preds[0]}")

Model Architecture

The architecture consists of three components:

(1) CTC Speech Encoder (440M params)

A 16-layer Conformer encoder trained with CTC on character-level targets. It processes 16kHz audio with stacked log-mel features (80 mel bins, 2-frame stacking) and uses block attention with 4-second audio blocks and self-conditioning at layer 8. The encoder has a dual CTC head: alongside the character-level output, a secondary BPE head produces CTC logits over the LLM's 100K token vocabulary. The BPE head uses posterior-weighted pooling (window size 4) with importance weights derived from mid-layer blank probabilities (1 - blank_prob).

(2) Q-Former Projector (160M params)

A 2-layer window Q-Former that downsamples the concatenated hidden representations from 4 encoder layers (layers 4, 8, 12, 16) by 5x. Each 15-frame window is reduced to 3 queries via cross-attention, resulting in a 10Hz acoustic embedding rate for the LLM (2x from encoder + 5x from projector).

(3) Bidirectional LLM Editor (1B params, LoRA-adapted)

granite-4.0-1b-base with its causal attention mask removed, enabling bidirectional context. Adapted with LoRA (rank 128) applied to both attention and MLP layers. The LLM receives concatenated audio embeddings and an interleaved CTC hypothesis with insertion slots, then predicts the edited transcript in a single parallel forward pass using a CTC objective.

How Granite-speech NAR Works

The frozen CTC encoder produces acoustic embeddings and an initial hypothesis.
The hypothesis is interleaved with insertion slots (blank tokens between each token)
The projected audio embeddings are concatenated with the interleaved hypothesis embeddings
The bidirectional LLM predicts edits (copy, insert, delete, replace) at all positions simultaneously
CTC greedy decoding (argmax + collapse) produces the final transcript

This design exploits the identity mapping bias of Transformers: residual connections and tied embeddings make the model naturally inclined to copy input tokens, so it focuses learning capacity on corrections rather than full reconstruction.

Training Data

The model was trained on approximately 130K hours of speech across five languages (English, Spanish, French, German, Portuguese), using publicly available datasets including CommonVoice 15, MLS, LibriSpeech, Libriheavy long, AMI, Granary VoxPopuli, Granary YODAS, Earnings-22, Fisher, CallHome, and SwitchBoard. For additional training details, see the paper.

Infrastructure

Training was completed on IBM's Blue Vela cluster using 16 H100 GPUs (2 nodes) for 5 epochs (3 days).

Ethical Considerations and Limitations

The model is designed specifically for automatic speech recognition and does not generate free-form text, which limits the risk of hallucination compared to general-purpose speech-language models. However, transcription accuracy varies across languages and acoustic conditions. Performance may be weaker on languages with less training data (e.g., Portuguese) or in challenging acoustic environments (e.g., far-field, overlapping speech).

The model's editing approach is conservative by design — it prefers deletions over insertions, which reduces hallucination risk but may occasionally drop words in noisy conditions.

To enhance safety, we recommend using granite-speech-4.1-2b-nar alongside Granite Guardian. Granite Guardian is a fine-tuned instruct model designed to detect and flag risks in prompts and responses across key dimensions outlined in the IBM AI Risk Atlas.

Resources

📄 Read our papers:
⭐️ Learn about Granite: https://www.ibm.com/granite
💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

Downloads last month: 80

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for ibm-granite/granite-speech-4.1-2b-nar

Base model

ibm-granite/granite-4.0-1b-base

Finetuned

(13)

this model

Collection including ibm-granite/granite-speech-4.1-2b-nar

Granite Experiments

Collection

Experimental projects under consideration for the Granite family. • 8 items • Updated about 7 hours ago • 16

Papers for ibm-granite/granite-speech-4.1-2b-nar

In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions

Paper • 2604.22817 • Published 16 days ago

Contextual Biasing for ASR in Speech LLM with Common Word Cues and Bias Word Position Prediction

Paper • 2604.12398 • Published 16 days ago