Fine-tuning MedASR for Indian Regional Languages
Hi, I’m working on medical ASR use cases and looking for guidance on fine-tuning MedASR for Indian regional languages such as Hindi, Tamil, Telugu, Kannada, and Bengali. Any recommendations on datasets, multilingual fine-tuning strategies, or evaluation best practices would be really helpful. Happy to collaborate.
Thanks for reaching out! MedASR is pre-trained and finetuned on English only data. We are not yet sure how it will perform in other languages. At the very least, you need a different tokenizer because the current one is only for English. Unfortunately, we are not very familiar with datasets or evaluation in these languages at the moment. But if you already have something, you should be able to finetune the MedASR model following https://github.com/google-health/medasr/blob/main/notebooks/fine_tune_with_hugging_face.ipynb.
Hi, Any guidance on how can I build the tokeniser lets say for Hindi audio and data? Also, what would be considered an ideal number of training data hours needed to fine tune? If I change the tokeniser, doesn't that mean I need to retrain the model like I won't be able to use existing weights right?
Thanks for the contribution
Hey @darknight054 , apologies for the delayed response.
To build a tokeniser for Hindi (or any other Indian languages), you need to train a new subword tokeniser using libraries like SentencePiece. You will need a large test corpus in the target language, ideally medical reports or health related documents. It is important to apply proper unicode normalisation during preprocessing to ensure consistent handling of Indian scripts. The resulting tokeniser would replace the current English-only vocabulary in the model configuration.
Regarding the weights, you do not need to retrain the entire model. The pre-trained acoustic encoder can be reused, as it has already learned general speech representations. however, because the output layer and any token embedding layers are tied to the vocabulary size and token IDs, those components must be reinitialised and retrained. During fine-tuning, the model learns to map the existing acoustic representations to the new Hindi subword units.
In terms of data scale, there is no strict threshold, but a few hundred hours of transcribed speech is often a reasonable starting point when adapting to a new language. This model may require more data than general-domain ASR models due to specialised terminology. You can follow the fine-tuning workflow here.
Thank you!