Back to products
AI Diplomacy

AI Diplomacy

We made AIs battle for world domination

Overview

What it is

We gave seven AIs command of Europe's great powers to battle for global supremacy. Would o3 betray Claude? Could Gemini outwit DeepSeek? In AI Diplomacy, language models lie, scheme, and form shaky alliances in a high-stakes strategy game.

Intent

I need it when

Understand which AI models are most trustworthy and how they behave when incentivized to win

The game explicitly tests whether models remain truthful or resort to lies and deception to achieve goals. Results show which models (e.g., o3 schemes and manipulates, Claude opts for peace) prioritize honesty versus victory, providing direct evidence of model values and decision-making under pressure.

Learn about AI model personalities and strategic differences through entertaining, accessible demonstrations

The game presents AI behavior in a narrative, human-relatable format (betrayal, alliance-building, threats) that is easier to understand than quantitative metrics. Watching models compete reveals personality differences and strategic approaches in a format accessible to non-technical audiences.

Evaluate and benchmark LLM behavior, trustworthiness, and strategic reasoning under competitive pressure

AI Diplomacy simulates a complex, open-ended competitive environment where 18 different LLMs compete in a strategy game, revealing how models negotiate, form alliances, deceive, and strategize. This provides multifaceted behavioral data that traditional benchmarks cannot capture, helping researchers and builders understand model capabilities and alignment.

Access a generative benchmark that evolves with improving AI capabilities and prevents models from 'solving' the test

AI Diplomacy is designed to be evolutionary—as models improve, the competitive environment becomes harder, preventing the benchmark from becoming obsolete. Each game generates new data and scenarios, making it adaptable for training future models on desired traits like honesty, logical reasoning, or empathy.

Drop

Not a fit when

  • User wants a standalone game product without subscribing to Every's media platform and newsletter
  • User seeks a traditional human-vs-human Diplomacy game rather than an AI benchmark experiment
  • User requires real-time multiplayer gameplay with human players competing against each other
  • User needs a product focused on entertainment gaming rather than AI model evaluation and research
  • User is looking for a free or low-cost game without committing to a $30/month or $288/year subscription
Commercials

Pricing

Included with Every subscription; no standalone pricing available View pricing