Papers
arxiv:2604.25374

Language corpora for the Dutch medical domain

Published on Apr 28
Authors:

Abstract

A large-scale Dutch medical language corpus containing approximately 100 million documents with 35 billion tokens has been created for pre-training and downstream natural language processing tasks.

AI-generated summary

Background: Dutch medical corpora are scarce, limiting NLP development. \\ Methods: We translated English datasets, identified medical text in generic corpora, and extracted open Dutch medical resources. \\ Results: The resulting corpus comprises pm 35 billion tokens across the medical domain in about 100 million documents, freely available on Hugging Face. \\ Conclusion: This work establishes the first large-scale Dutch medical language corpus for pre-training and downstream NLP tasks.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.25374
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.25374 in a model README.md to link it from this page.

Datasets citing this paper 4

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.25374 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.