-
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Paper • 2310.17631 • Published • 35 -
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
Paper • 2310.08491 • Published • 57 -
Generative Judge for Evaluating Alignment
Paper • 2310.05470 • Published • 1 -
Calibrating LLM-Based Evaluator
Paper • 2309.13308 • Published • 12
Andrew Reed
andrewrreed
AI & ML interests
Applied ML, Practical AI, Inference & Deployment, LLMs, Multi-modal Models, RAG
Organizations
Curated resources that support the use of LLMs to serve as automatic evaluators of other LLM outputs.
Eval Leaderboards
- Running4.73k
Arena Leaderboard
🏆4.73kView the latest LMArena model leaderboard
- Running on CPU Upgrade13.9k
Open LLM Leaderboard
🏆13.9kTrack, rank and evaluate open LLMs and chatbots
- Running on CPU Upgrade7.07k
MTEB Leaderboard
🥇7.07kEmbedding Leaderboard
- RunningFeatured584
LLM-Perf Leaderboard
🏆584Explore LLM performance across hardware configurations
AI x Audio
Hallucination Detection
-
vectara/hallucination_evaluation_model
Text Classification • Updated • 202k • 340 -
notrichardren/HaluEval
Viewer • Updated • 35k • 70 -
TRUE: Re-evaluating Factual Consistency Evaluation
Paper • 2204.04991 • Published • 1 -
Fine-grained Hallucination Detection and Editing for Language Models
Paper • 2401.06855 • Published • 4
Small, but mighty chat models
Awesome Spaces
- Running on Zero117
StableDesign
🏆117Generate furnished interior from empty room photo
- Running on ZeroFeatured5.37k
IllusionDiffusion
👁5.37kGenerate stunning high quality illusion artwork
- Runtime errorFeatured1.57k
InstantMesh
📚1.57kCreate a 3D model from an image in 10 seconds!
- Runtime errorFeatured184
Sing an idea ➡️ Music
🔥184Bring song ideas to life
LLM as a Judge
Curated resources that support the use of LLMs to serve as automatic evaluators of other LLM outputs.
-
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Paper • 2310.17631 • Published • 35 -
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
Paper • 2310.08491 • Published • 57 -
Generative Judge for Evaluating Alignment
Paper • 2310.05470 • Published • 1 -
Calibrating LLM-Based Evaluator
Paper • 2309.13308 • Published • 12
Hallucination Detection
-
vectara/hallucination_evaluation_model
Text Classification • Updated • 202k • 340 -
notrichardren/HaluEval
Viewer • Updated • 35k • 70 -
TRUE: Re-evaluating Factual Consistency Evaluation
Paper • 2204.04991 • Published • 1 -
Fine-grained Hallucination Detection and Editing for Language Models
Paper • 2401.06855 • Published • 4
Eval Leaderboards
- Running4.73k
Arena Leaderboard
🏆4.73kView the latest LMArena model leaderboard
- Running on CPU Upgrade13.9k
Open LLM Leaderboard
🏆13.9kTrack, rank and evaluate open LLMs and chatbots
- Running on CPU Upgrade7.07k
MTEB Leaderboard
🥇7.07kEmbedding Leaderboard
- RunningFeatured584
LLM-Perf Leaderboard
🏆584Explore LLM performance across hardware configurations
Small, but mighty chat models
AI x Audio
Awesome Spaces
- Running on Zero117
StableDesign
🏆117Generate furnished interior from empty room photo
- Running on ZeroFeatured5.37k
IllusionDiffusion
👁5.37kGenerate stunning high quality illusion artwork
- Runtime errorFeatured1.57k
InstantMesh
📚1.57kCreate a 3D model from an image in 10 seconds!
- Runtime errorFeatured184
Sing an idea ➡️ Music
🔥184Bring song ideas to life