hviske-v5.3

Danish ASR. Fine-tuned from syvai/hviske-v5.1 on the CoRal v3 train splits with layer-wise learning-rate decay (encoder LR = 0.75 ร— decoder LR) for 5 epochs.

Results on CoRal v3 full test sets

Evaluated on the complete test splits (17,560 samples). Two normalization conventions:

  • raw: jiwer on un-normalized references and hypotheses
  • strict: lowercase + punctuation strip + Danish digit-to-word (num2words(lang="da")) โ€” the apples-to-apples normalization for comparing against published Whisper-style numbers

Greedy decoding (num_beams=1)

Split N raw WER strict WER raw CER strict CER
read_aloud 9,122 10.26% 9.37% 4.17% 3.80%
conversation 8,438 21.30% 19.63% 12.12% 11.56%
weighted avg 17,560 15.56% 14.30% 7.99% 7.53%

Beam search (num_beams=5, length_penalty=1.0)

Split N raw WER strict WER raw CER strict CER
read_aloud 9,122 9.86% 9.01% 3.98% 3.63%
conversation 8,438 20.89% 19.21% 11.90% 11.35%
weighted avg 17,560 15.16% 13.91% 7.78% 7.34%

Beam search costs ~75% more inference time but lowers avg WER by 0.4 pp. The transcribe() helper defaults to greedy; for beam search call model.generate(..., num_beams=5, length_penalty=1.0) directly or monkey-patch the wrapper.

Versus hviske-v5.2

Metric v5.2 v5.3 ฮ”
read_aloud raw WER 10.94% 10.26% โˆ’0.68
read_aloud strict WER 9.54% 9.37% โˆ’0.17
conversation raw WER 21.87% 21.30% โˆ’0.57
weighted avg raw WER 16.20% 15.56% โˆ’0.64
weighted avg strict WER 14.90% 14.30% โˆ’0.60

Usage

import torch, numpy as np, soundfile as sf
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

processor = AutoProcessor.from_pretrained("syvai/hviske-v5.3", trust_remote_code=True)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "syvai/hviske-v5.3", trust_remote_code=True, dtype=torch.bfloat16
).to("cuda").eval()

audio, sr = sf.read("your_audio.wav")
audio = np.asarray(audio, dtype=np.float32)

hyp = model.transcribe(
    processor=processor,
    language="da",
    audio_arrays=[audio],
    sample_rates=[sr],
)[0]
print(hyp)

Audio longer than ~35 s is automatically chunked. Input is resampled to 16 kHz internally.

Training details

  • Starting point: syvai/hviske-v5.1
  • Data: CoRal-project/coral-v3 โ€” both read_aloud (299,255 samples) and conversation (147,249) train splits, interleaved and per-epoch shuffled (seed=42)
  • Eval during training: 10% of each test split (912 + 843 samples) for best-checkpoint tracking
  • Best-checkpoint tracking: saved the eval point with lowest avg WER (hit at 90% of training)
  • Hyperparameters:
    • Epochs: 5
    • Batch: 16 micro ร— 8 grad-accum = 128 effective batch
    • Layer-wise LR decay (LLRD): encoder = 0.75 ร— decoder
      • Decoder peak: 2e-4
      • Encoder peak: 1.5e-4
    • Schedule: 1,500-step linear warmup โ†’ cosine decay to zero
    • Optimizer: bnb AdamW8bit
    • Augmentation: SpecAugment (2 freq ร— 27 bins, 2 time ร— 100 frames)
    • Max audio length: 31 s (longer is dropped)
    • Precision: bf16
  • Hardware: single NVIDIA RTX PRO 6000 Blackwell Max-Q (98 GB)

License

Apache-2.0, inherited from the base model.

Downloads last month
42
Safetensors
Model size
2B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for syvai/hviske-v5.3

Finetuned
(2)
this model

Dataset used to train syvai/hviske-v5.3