metadata
library_name: transformers
datasets:
- HuggingFaceTB/cosmo2_training_data_subset_1M
cosmo2-tokenizer
Tokenizer for the training of cosmo2. This tokenizer was trained on 1M samples from:
- FineWeb-Edu 70%
- Cosmopedia v2 15%
- StarCoderData 8%
- OpenWebMath 5%
- StackOverFlow 2%