ResearchGym: Evaluating Language Model Agents on Real-World AI Research Paper • 2602.15112 • Published 11 days ago • 20
view article Article OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments +3 16 days ago • 30
view article Article We Got Claude to Build CUDA Kernels and teach open models! +2 about 1 month ago • 142
view article Article Community Evals: Because we're done trusting black-box leaderboards over the community +5 24 days ago • 80
view article Article Alyah ⭐️: Toward Robust Evaluation of Emirati Dialect Capabilities in Arabic LLMs Jan 27 • 24
view article Article AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality Jan 21 • 31
AgriLLM Collection A collection of the artifacts for the AgriLLM initiative. • 5 items • Updated Dec 15, 2025 • 5
view article Article The Open Evaluation Standard: Benchmarking NVIDIA Nemotron 3 Nano with NeMo Evaluator Dec 17, 2025 • 47
view article Article Introducing Falcon-H1-Arabic: Pushing the Boundaries of Arabic Language AI with Hybrid Architecture Jan 5 • 40
Dalla models Collection Dalla is a family of Arabic language models optimized for Arabic text processing through advanced tokenization techniques. • 4 items • Updated Dec 16, 2025 • 2
Can a Multichoice Dataset be Repurposed for Extractive Question Answering? Paper • 2404.17342 • Published Apr 26, 2024 • 2
Jais-2-Family Collection The 2nd generation of the Jais Large Language Models Family • 4 items • Updated 7 days ago • 13
Ministral 3 Collection A collection of edge models, with Base, Instruct and Reasoning variants, in 3 different sizes: 3B, 8B and 14B. All with vision capabilities. • 9 items • Updated Dec 2, 2025 • 155
Mistral Large 3 Collection A state-of-the-art, open-weight, general-purpose multimodal model with a granular Mixture-of-Experts architecture. • 4 items • Updated Dec 2, 2025 • 91
Sparse Auto-Encoders (SAEs) for Mechanistic Interpretability Collection A compilation of sparse auto-encoders trained on large language models. • 37 items • Updated Dec 16, 2025 • 23