hviske-v5.3

Danish ASR. Fine-tuned from syvai/hviske-v5.1 on the CoRal v3 train splits with layer-wise learning-rate decay (encoder LR = 0.75 × decoder LR) for 5 epochs.

Results on CoRal v3 full test sets

Evaluated on the complete test splits (17,560 samples). Two normalization conventions:

raw: jiwer on un-normalized references and hypotheses
strict: lowercase + punctuation strip + Danish digit-to-word (num2words(lang="da")) — the apples-to-apples normalization for comparing against published Whisper-style numbers

Greedy decoding (`num_beams=1`)

Split	N	raw WER	strict WER	raw CER	strict CER
`read_aloud`	9,122	10.26%	9.37%	4.17%	3.80%
`conversation`	8,438	21.30%	19.63%	12.12%	11.56%
weighted avg	17,560	15.56%	14.30%	7.99%	7.53%

Beam search (`num_beams=5`, `length_penalty=1.0`)

Split	N	raw WER	strict WER	raw CER	strict CER
`read_aloud`	9,122	9.86%	9.01%	3.98%	3.63%
`conversation`	8,438	20.89%	19.21%	11.90%	11.35%
weighted avg	17,560	15.16%	13.91%	7.78%	7.34%

Beam search costs ~75% more inference time but lowers avg WER by 0.4 pp. The transcribe() helper defaults to greedy; for beam search call model.generate(..., num_beams=5, length_penalty=1.0) directly or monkey-patch the wrapper.

Versus hviske-v5.2

Metric	v5.2	v5.3	Δ
read_aloud raw WER	10.94%	10.26%	−0.68
read_aloud strict WER	9.54%	9.37%	−0.17
conversation raw WER	21.87%	21.30%	−0.57
weighted avg raw WER	16.20%	15.56%	−0.64
weighted avg strict WER	14.90%	14.30%	−0.60

Usage

import torch, numpy as np, soundfile as sf
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

processor = AutoProcessor.from_pretrained("syvai/hviske-v5.3", trust_remote_code=True)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "syvai/hviske-v5.3", trust_remote_code=True, dtype=torch.bfloat16
).to("cuda").eval()

audio, sr = sf.read("your_audio.wav")
audio = np.asarray(audio, dtype=np.float32)

hyp = model.transcribe(
    processor=processor,
    language="da",
    audio_arrays=[audio],
    sample_rates=[sr],
)[0]
print(hyp)

Audio longer than ~35 s is automatically chunked. Input is resampled to 16 kHz internally.

Training details

Starting point: syvai/hviske-v5.1
Data: CoRal-project/coral-v3 — both read_aloud (299,255 samples) and conversation (147,249) train splits, interleaved and per-epoch shuffled (seed=42)
Eval during training: 10% of each test split (912 + 843 samples) for best-checkpoint tracking
Best-checkpoint tracking: saved the eval point with lowest avg WER (hit at 90% of training)
Hyperparameters:
- Epochs: 5
- Batch: 16 micro × 8 grad-accum = 128 effective batch
- Layer-wise LR decay (LLRD): encoder = 0.75 × decoder
  - Decoder peak: 2e-4
  - Encoder peak: 1.5e-4
- Schedule: 1,500-step linear warmup → cosine decay to zero
- Optimizer: bnb AdamW8bit
- Augmentation: SpecAugment (2 freq × 27 bins, 2 time × 100 frames)
- Max audio length: 31 s (longer is dropped)
- Precision: bf16
Hardware: single NVIDIA RTX PRO 6000 Blackwell Max-Q (98 GB)

License

Apache-2.0, inherited from the base model.

Downloads last month: 42

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for syvai/hviske-v5.3

Base model

CohereLabs/cohere-transcribe-03-2026

Finetuned

syvai/hviske-v5.1

Finetuned

(2)

this model

syvai
/

hviske-v5.3

hviske-v5.3

Results on CoRal v3 full test sets

Greedy decoding (`num_beams=1`)

Beam search (`num_beams=5`, `length_penalty=1.0`)

Versus hviske-v5.2

Usage

Training details

License

Model tree for syvai/hviske-v5.3

Dataset used to train syvai/hviske-v5.3

hviske-v5.3

Results on CoRal v3 full test sets

Greedy decoding (num_beams=1)

Beam search (num_beams=5, length_penalty=1.0)

Versus hviske-v5.2

Usage

Training details

License

Model tree for syvai/hviske-v5.3

Dataset used to train syvai/hviske-v5.3

Greedy decoding (`num_beams=1`)

Beam search (`num_beams=5`, `length_penalty=1.0`)