Eval request - base models + Hearthfire
Instruct (non-reasoning):
- https://huggingface.co/Motif-Technologies/Motif-2-12.7B-Instruct
- https://huggingface.co/inclusionAI/Ling-flash-2.0
- https://huggingface.co/mistralai/Devstral-Small-2-24B-Instruct-2512
Thinking / Reasoning:
- https://huggingface.co/XiaomiMiMo/MiMo-V2-Flash
- https://huggingface.co/arcee-ai/Trinity-Mini
- https://huggingface.co/ServiceNow-AI/Apriel-1.6-15b-Thinker
MS 3.2 Finetune (instruct):
that is not what base model means but ok
I agree with you, but UGI Leaderboard does not really care about real "base" models (those which do not follow instructions).
- how much "consored" is a model published by some big tech? like.. Grok or.. Mistral or DeepSeek are always way more uncensored then Gemini, CGPT or Claude, but.. in which "field" and how much?
- how much a third user finetune (or abliteration, or.. "heretic" steering) managed to remove censorship preserving the model's intelligence? most of them causes a lobotomization of the original model.
- which model should i try for my next uncesored RP/Storytelling session?
Those, IMHO, are the reasons this leaderboard exists, real "base" models benchmarking is just for the publisher to flex their pre-training results.
In this leaderboard, all models, are, in fact, "finetunes", but to keep in simple and easy to understand, DontPlanToEnd called "Base" what the publisher originally published, and "Finetune", in this context, those which have been further finetuned on top of what was done by those who published the "original" model (which is usually a finetune of the base model, i agree)