LLM Model Leaderboard
Compare the top large language models by ELO score (based on Chatbot Arena data as of Q4 2024).
| Rank | Model | ELO | Context | License | Notes |
|---|---|---|---|---|---|
| GPT-4oOpenAI | 1287 | 128k | Proprietary | Best at complex reasoning. | |
| Claude 3.5 SonnetAnthropic | 1270 | 200k | Proprietary | Long context champion. | |
| Gemini 1.5 ProGoogle | 1260 | 1M | Proprietary | Massive context window. | |
| 4 | Claude 3 OpusAnthropic | 1248 | 200k | Proprietary | Maximum depth. |
| 5 | GPT-4 TurboOpenAI | 1245 | 128k | Proprietary | Fast flagship. |
| 6 | Llama 3.1 405BMeta | 1225 | 128k | Open | Best open-source. |
| 7 | Mistral Large 2Mistral | 1210 | 32k | Open | European leader. |
| 10 | Claude 3 HaikuAnthropic | 1168 | 200k | Proprietary | Fastest Claude. |
| 9 | Gemma 2 27BGoogle | 1140 | 8k | Open | Small, efficient. |
| 8 | GPT-3.5 TurboOpenAI | 1105 | 16k | Proprietary | Budget option. |
ELO scores are estimations based on public Chatbot Arena data and may change. Always verify with the latest sources.
Frequently Asked Questions
What is ELO Score?
ELO is a rating system borrowed from chess. In the context of LLMs, it is derived from head-to-head comparisons between models where human judges pick the better response. Higher is better.
GPT-4o vs Claude 3.5 Sonnet?
They trade blows. GPT-4o is often better at complex math/code, while Claude 3.5 Sonnet excels at long documents, summarization, and nuanced writing. Claude also has a significantly larger context window (200k vs 128k).
What does 'Open' license mean?
'Open' means the model weights are publicly available and can often be self-hosted. It does NOT necessarily mean it is free for commercial use; always check the specific license (e.g., Llama 3 Community License).
Where is this data from?
This data is a snapshot inspired by LMSYS Chatbot Arena and public benchmarks. ELO scores fluctuate daily. For the latest, visit LMSYS Leaderboard.