Abstract
A large-scale Dutch medical language corpus containing approximately 100 million documents with 35 billion tokens has been created for pre-training and downstream natural language processing tasks.
Background: Dutch medical corpora are scarce, limiting NLP development. \\ Methods: We translated English datasets, identified medical text in generic corpora, and extracted open Dutch medical resources. \\ Results: The resulting corpus comprises pm 35 billion tokens across the medical domain in about 100 million documents, freely available on Hugging Face. \\ Conclusion: This work establishes the first large-scale Dutch medical language corpus for pre-training and downstream NLP tasks.
Get this paper in your agent:
hf papers read 2604.25374 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 4
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper