Benchmark and compare reasoning model performance across multiple tasks
DeepSeek R1 provides comprehensive evaluation results across math (AIME, MATH-500), code (Codeforces, LiveCodeBench), and reasoning (MMLU, DROP) benchmarks. Users can access detailed performance metrics to compare against GPT-4o, Claude-3.5-Sonnet, and OpenAI-o1.
