Back to Tools

LLM Model Leaderboard

Compare the top large language models by ELO score (based on Chatbot Arena data as of Q4 2024).

RankModelELOContextLicenseNotes
GPT-4oOpenAI1287128kProprietaryBest at complex reasoning.
Claude 3.5 SonnetAnthropic1270200kProprietaryLong context champion.
Gemini 1.5 ProGoogle12601MProprietaryMassive context window.
4Claude 3 OpusAnthropic1248200kProprietaryMaximum depth.
5GPT-4 TurboOpenAI1245128kProprietaryFast flagship.
6Llama 3.1 405BMeta1225128kOpenBest open-source.
7Mistral Large 2Mistral121032kOpenEuropean leader.
10Claude 3 HaikuAnthropic1168200kProprietaryFastest Claude.
9Gemma 2 27BGoogle11408kOpenSmall, efficient.
8GPT-3.5 TurboOpenAI110516kProprietaryBudget option.

ELO scores are estimations based on public Chatbot Arena data and may change. Always verify with the latest sources.

Frequently Asked Questions

What is ELO Score?

ELO is a rating system borrowed from chess. In the context of LLMs, it is derived from head-to-head comparisons between models where human judges pick the better response. Higher is better.

GPT-4o vs Claude 3.5 Sonnet?

They trade blows. GPT-4o is often better at complex math/code, while Claude 3.5 Sonnet excels at long documents, summarization, and nuanced writing. Claude also has a significantly larger context window (200k vs 128k).

What does 'Open' license mean?

'Open' means the model weights are publicly available and can often be self-hosted. It does NOT necessarily mean it is free for commercial use; always check the specific license (e.g., Llama 3 Community License).

Where is this data from?

This data is a snapshot inspired by LMSYS Chatbot Arena and public benchmarks. ELO scores fluctuate daily. For the latest, visit LMSYS Leaderboard.