LLM Model Leaderboard

Compare the top large language models by ELO score (based on Chatbot Arena data as of Q4 2024).

Rank	Model	ELO	Context	License	Notes
	GPT-4oOpenAI	1287	128k	Proprietary	Best at complex reasoning.
	Claude 3.5 SonnetAnthropic	1270	200k	Proprietary	Long context champion.
	Gemini 1.5 ProGoogle	1260	1M	Proprietary	Massive context window.
4	Claude 3 OpusAnthropic	1248	200k	Proprietary	Maximum depth.
5	GPT-4 TurboOpenAI	1245	128k	Proprietary	Fast flagship.
6	Llama 3.1 405BMeta	1225	128k	Open	Best open-source.
7	Mistral Large 2Mistral	1210	32k	Open	European leader.
10	Claude 3 HaikuAnthropic	1168	200k	Proprietary	Fastest Claude.
9	Gemma 2 27BGoogle	1140	8k	Open	Small, efficient.
8	GPT-3.5 TurboOpenAI	1105	16k	Proprietary	Budget option.

ELO scores are estimations based on public Chatbot Arena data and may change. Always verify with the latest sources.

Frequently Asked Questions

What is ELO Score?

ELO is a rating system borrowed from chess. In the context of LLMs, it is derived from head-to-head comparisons between models where human judges pick the better response. Higher is better.

GPT-4o vs Claude 3.5 Sonnet?

They trade blows. GPT-4o is often better at complex math/code, while Claude 3.5 Sonnet excels at long documents, summarization, and nuanced writing. Claude also has a significantly larger context window (200k vs 128k).

What does 'Open' license mean?

'Open' means the model weights are publicly available and can often be self-hosted. It does NOT necessarily mean it is free for commercial use; always check the specific license (e.g., Llama 3 Community License).

Where is this data from?

This data is a snapshot inspired by LMSYS Chatbot Arena and public benchmarks. ELO scores fluctuate daily. For the latest, visit LMSYS Leaderboard.