hviske-v5.3
Danish ASR. Fine-tuned from syvai/hviske-v5.1 on the CoRal v3 train splits with layer-wise learning-rate decay (encoder LR = 0.75 ร decoder LR) for 5 epochs.
Results on CoRal v3 full test sets
Evaluated on the complete test splits (17,560 samples). Two normalization conventions:
- raw:
jiweron un-normalized references and hypotheses - strict: lowercase + punctuation strip + Danish digit-to-word (
num2words(lang="da")) โ the apples-to-apples normalization for comparing against published Whisper-style numbers
Greedy decoding (num_beams=1)
| Split | N | raw WER | strict WER | raw CER | strict CER |
|---|---|---|---|---|---|
read_aloud |
9,122 | 10.26% | 9.37% | 4.17% | 3.80% |
conversation |
8,438 | 21.30% | 19.63% | 12.12% | 11.56% |
| weighted avg | 17,560 | 15.56% | 14.30% | 7.99% | 7.53% |
Beam search (num_beams=5, length_penalty=1.0)
| Split | N | raw WER | strict WER | raw CER | strict CER |
|---|---|---|---|---|---|
read_aloud |
9,122 | 9.86% | 9.01% | 3.98% | 3.63% |
conversation |
8,438 | 20.89% | 19.21% | 11.90% | 11.35% |
| weighted avg | 17,560 | 15.16% | 13.91% | 7.78% | 7.34% |
Beam search costs ~75% more inference time but lowers avg WER by 0.4 pp. The transcribe() helper defaults to greedy; for beam search call model.generate(..., num_beams=5, length_penalty=1.0) directly or monkey-patch the wrapper.
Versus hviske-v5.2
| Metric | v5.2 | v5.3 | ฮ |
|---|---|---|---|
| read_aloud raw WER | 10.94% | 10.26% | โ0.68 |
| read_aloud strict WER | 9.54% | 9.37% | โ0.17 |
| conversation raw WER | 21.87% | 21.30% | โ0.57 |
| weighted avg raw WER | 16.20% | 15.56% | โ0.64 |
| weighted avg strict WER | 14.90% | 14.30% | โ0.60 |
Usage
import torch, numpy as np, soundfile as sf
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
processor = AutoProcessor.from_pretrained("syvai/hviske-v5.3", trust_remote_code=True)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
"syvai/hviske-v5.3", trust_remote_code=True, dtype=torch.bfloat16
).to("cuda").eval()
audio, sr = sf.read("your_audio.wav")
audio = np.asarray(audio, dtype=np.float32)
hyp = model.transcribe(
processor=processor,
language="da",
audio_arrays=[audio],
sample_rates=[sr],
)[0]
print(hyp)
Audio longer than ~35 s is automatically chunked. Input is resampled to 16 kHz internally.
Training details
- Starting point:
syvai/hviske-v5.1 - Data:
CoRal-project/coral-v3โ bothread_aloud(299,255 samples) andconversation(147,249) train splits, interleaved and per-epoch shuffled (seed=42) - Eval during training: 10% of each test split (912 + 843 samples) for best-checkpoint tracking
- Best-checkpoint tracking: saved the eval point with lowest avg WER (hit at 90% of training)
- Hyperparameters:
- Epochs: 5
- Batch: 16 micro ร 8 grad-accum = 128 effective batch
- Layer-wise LR decay (LLRD): encoder = 0.75 ร decoder
- Decoder peak:
2e-4 - Encoder peak:
1.5e-4
- Decoder peak:
- Schedule: 1,500-step linear warmup โ cosine decay to zero
- Optimizer: bnb
AdamW8bit - Augmentation: SpecAugment (2 freq ร 27 bins, 2 time ร 100 frames)
- Max audio length: 31 s (longer is dropped)
- Precision: bf16
- Hardware: single NVIDIA RTX PRO 6000 Blackwell Max-Q (98 GB)
License
Apache-2.0, inherited from the base model.
- Downloads last month
- 42
Model tree for syvai/hviske-v5.3
Base model
CohereLabs/cohere-transcribe-03-2026 Finetuned
syvai/hviske-v5.1