Books from the Survivor Library (mostly ~1920s & earlier) OCR'd with recent VLMs
BEEspoke Data
community
AI & ML interests
'an LLM is only as good as the dataset it was trained on' - Sun Tzu
Recent Activity
View all activity
Organization Card
πππ
π§"raw" pretrained smol_llama checkpoints - WIP π§
-
BEE-spoke-data/smol_llama-101M-GQA
Text Generation β’ 0.1B β’ Updated β’ 1.71k β’ 32 -
BEE-spoke-data/smol_llama-81M-tied
Text Generation β’ 81.3M β’ Updated β’ 727 β’ 9 -
BEE-spoke-data/smol_llama-220M-GQA
Text Generation β’ 0.2B β’ Updated β’ 1.97k β’ 13 -
BEE-spoke-data/verysmol_llama-v11-KIx2
Text Generation β’ 58.1M β’ Updated β’ 695 β’ 4
Books from the Survivor Library (mostly ~1920s & earlier) OCR'd with recent VLMs
π§"raw" pretrained smol_llama checkpoints - WIP π§
-
BEE-spoke-data/smol_llama-101M-GQA
Text Generation β’ 0.1B β’ Updated β’ 1.71k β’ 32 -
BEE-spoke-data/smol_llama-81M-tied
Text Generation β’ 81.3M β’ Updated β’ 727 β’ 9 -
BEE-spoke-data/smol_llama-220M-GQA
Text Generation β’ 0.2B β’ Updated β’ 1.97k β’ 13 -
BEE-spoke-data/verysmol_llama-v11-KIx2
Text Generation β’ 58.1M β’ Updated β’ 695 β’ 4
models 58
BEE-spoke-data/NVIDIA-Nemotron-Parse-v1.2
Image-Text-to-Text β’ 0.9B β’ Updated
β’ 74
BEE-spoke-data/neobert-100k-test
Fill-Mask β’ 0.1B β’ Updated
BEE-spoke-data/tiny-random-MPNetForMaskedLM
Fill-Mask β’ 237k β’ Updated
β’ 3
BEE-spoke-data/bpe-tokenizer-32k-smolNeoX
Updated
BEE-spoke-data/wordpiece-tokenizer-32k-en_code-orig
Updated
BEE-spoke-data/wordpiece-tokenizer-32k-en_code-msp
Updated
BEE-spoke-data/pegasus-x-base-synthsumm_open-16k
Summarization β’ 0.3B β’ Updated
β’ 9 β’ 2
BEE-spoke-data/tFINE-680m-e32-d16-infinity_instruct-L2
Text Generation β’ 0.7B β’ Updated
BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan
0.7B β’ Updated
β’ 1
BEE-spoke-data/tFINE-900m-instruct-orpo
0.9B β’ Updated
β’ 3
datasets 82
BEE-spoke-data/SurvivorLib-Nanonets-OCR-s
Viewer
β’ Updated
β’ 14.4k β’ 18 β’ 2
BEE-spoke-data/SurvivorLib-rolmOCR
Viewer
β’ Updated
β’ 14.6k β’ 30 β’ 1
BEE-spoke-data/govdocs1-pdf-source
Viewer
β’ Updated
β’ 235k β’ 2.61k β’ 4
BEE-spoke-data/napierone-pdf-nanonets-s
Viewer
β’ Updated
β’ 9.96k β’ 6
BEE-spoke-data/napierone-pdf-olmOCR
Viewer
β’ Updated
β’ 19k β’ 22
BEE-spoke-data/LONGCOT-merged-1M
Viewer
β’ Updated
β’ 1.7M β’ 28 β’ 2
BEE-spoke-data/cosmopedia-v2-mincols
Viewer
β’ Updated
β’ 39.1M β’ 34 β’ 1
BEE-spoke-data/reddit-title-body-hf
Viewer
β’ Updated
β’ 251M β’ 108 β’ 4
BEE-spoke-data/bigpatent-all
Viewer
β’ Updated
β’ 2.43M β’ 231
BEE-spoke-data/google_wellformed_query-hf
Viewer
β’ Updated
β’ 25.1k β’ 11