Study tiny colored grids, infer the hidden transformation rule, and complete the missing output. No trivia, no math formulas - just visual reasoning.
10 questions AI famously fails — logic traps, counting tricks + 1 ARC puzzle. GPT-4o scores 1/10.
ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) was designed by François Chollet — creator of Keras and a leading AI researcher at Google — as a rigorous test of general intelligence. Unlike benchmarks that reward memorization of training data, ARC-AGI tests a skill that humans use effortlessly: learning a new rule from just a few examples and applying it to a new case.
Each puzzle consists of colored grid transformations. You see 2–3 input/output pairs, figure out the rule, and complete a new test case. The grids are small, the colors are simple, and no domain knowledge is required. Yet these tasks remain extraordinarily difficult for AI.
When ARC-AGI-2 was released in March 2025, the results were stark: GPT-4o scored 0%, Claude 3.7 Sonnet scored 0%, Gemini 2.0 Flash scored 1.3% — compared to a human average of 60%. Every task had been solved by at least two humans in under two attempts. The gap reveals a fundamental difference between AI pattern-matching and human reasoning:
The ARC Prize Foundation runs an annual competition offering millions of dollars for AI systems that can match human performance on ARC-AGI tasks without the astronomical compute costs. The competition has driven significant research into genuine machine reasoning, with results improving each year. The ARC-AGI-1 benchmark is now considered largely solved by top systems; ARC-AGI-2 is the current frontier.
All ARC-AGI benchmark data is open source under the Apache License 2.0, making it freely available for research, education, and tools like this one.