A fast reasoning challenge built around classic AI blind spots: letter counting, hidden assumptions, math traps, and one visual pattern puzzle.
AI language models like ChatGPT (GPT-4o) are trained to predict the next token in a sequence — not to reason from first principles. This makes them surprisingly bad at tasks that seem simple to humans:
ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) was designed by deep learning pioneer François Chollet as a benchmark for general intelligence — not just memorized knowledge. Each puzzle shows a few input→output grid examples, and you must infer the rule to complete a new test case.
When ARC-AGI-2 launched in March 2025, every frontier AI model — GPT-4o, Claude 3.7, Gemini 2.0 Flash — scored between 0% and 1.3%. The human average is 60%. Only with extremely expensive multi-attempt scaffolding (costing $30–$77 per question) did AI systems approach human performance. The ARC-AGI dataset is released under the Apache 2.0 license by the ARC Prize Foundation.
GPT-4o (without extended thinking) scores approximately 1/10 on the questions in this challenge, based on documented research and widely reported failure cases from 2024–2025. The one question it reliably gets right (the 28-day riddle) is now well-known enough that it appears in GPT's training data. On novel variants of all the other questions, failure rates range from 40–100%.
Most humans who approach these questions carefully score 6–8/10. The ARC puzzle and the "portrait" logic riddle are the hardest — they trip up humans and AI alike, though for different reasons.
5 original ARC-AGI-style puzzles — the benchmark that stumps frontier AI. Humans average 60%. Base AI scores ~0%. Free.