December 11, 2024

Beyond Skills: Redefining Intelligence with ARC-AGI

ARC-AGI challenges AI to move beyond task-specific skills, emphasizing adaptability and reasoning.

Daniel Gieseler

CW's Cognition Engineer

A solitary figure stands at the center of a circular, cracked platform surrounded by plush, coral-like formations. The scene is enveloped in a soft, pale haze, giving it an otherworldly, dreamlike atmosphere.

In a previous article, we discussed the reasoning abilities of large language models (LLMs) in the context of the Tic-tac-toe game. We’re now stepping into a more advanced arena: the ARC-AGI benchmark (Abstraction and Reasoning Corpus for Artificial General Intelligence). We chose to explore the ARC-AGI because it is arguably the best challenge to push us to the next step toward achieving AGI.

ARC was introduced five years ago in the seminal paper, On the Measure of Intelligence, by François Chollet, the influential researcher and creator of Keras deep-learning library. In the paper,  Chollet addresses what he sees as a categorical error in our understanding of intelligence. The error lies in conflating intelligence - the adaptive process itself - with skill, the output of that process. One is about acquiring skills, and the other is about having skills. Let’s clarify this distinction.

Intelligence as the amount of task-specific skills

Chollet considers that focusing on task-specific skills is a "useful goal, but an incorrect measure of intelligence." Its usefulness is clear in today’s AI ecosystem, where most developments aim at solving specific problems. Such a narrow focus has undeniably impacted society and, as AI continues to advance in mastering task-specific skills, we may eventually reach a point where a significant portion of economically valuable work becomes automated, task by task. Such a shift would undoubtedly transform our relationship with labor in ways we have never experienced.

All of this is to say that aiming for task-specific skills has merit. However, there is something important  missing, while it also presents a couple of problems:

The goalpost keeps moving. This is a familiar observation in AI, where benchmarks keep saturating to human-level performance, yet we’re left unconvinced about the intelligence status of these systems. This trend has become even more pronounced in recent years, with LLMs making remarkable strides across various benchmarks, from coding tasks to graduate-level exams.

However, even after five years, the ARC benchmark has shown a unique resilience to this trend (see graph below). As Chollet expresses it, "ARC-AGI is still the only benchmark that was designed to resist memorization.” To ensure this, ARC strictly prevents public exposure of its evaluation set of tasks and designs tasks that are distinct, requiring test-takers to demonstrate a deeper understanding rather than relying solely on pattern matching from publicly available tasks.

Comparison between ARC and other benchmarks [ref-2]

You get what you optimize for. Once an objective is defined, any optimization process is prone to exploiting shortcuts to achieve it. This is a recurring challenge in machine learning, where models can often solve problems by relying on superficial statistics rather than developing the rich representations intended by its developers. For example, a model trained to recognize animals might rely solely on the sandy background to identify a camel rather than learning the animal's defining features. Typically, the developer wants their model to generalize beyond the training dataset, but this is especially tricky when optimizing for task-specific skills. Ideally, as Chollet suggests,  “one must find a way to optimize directly for flexibility and generality.”

Intelligence as the efficiency of acquiring skills

Chollet defines intelligence as “a measure of [a system’s] skill-acquisition efficiency over a range of tasks, considering its prior knowledge, experience, and the difficulty of generalizing.” Notice the emphasis on controlling for priors and experience. This is essential to the measure of intelligence. To clarify, think of priors as the behavior hard-coded by the developer, and the experience as the data used for machine learning.

As Chollet observes, “unlimited priors or unlimited training data allow developers to ‘buy’ a system’s level of skill.” This is the norm for the previous paradigm, which leads to narrow AI systems that excel in task-specific skills but are ineffective outside their specialized domains. Such systems give a misleading impression of intelligence because they are excellent at what they do, but their performance relies heavily on predefined knowledge and experience rather than the kind of generality seen in humans. However, for Chollet, true intelligence is about efficiently generalizing from familiar situations to unfamiliar ones (see image below).

Contrast between two degrees of intelligence [ref-3]

Control for priors. ARC tasks assume that the solver has priors as outlined by the Theory of Core Knowledge. These are evolutionarily ancient and widely shared across species, particularly among non-human primates. They are divided into four main categories. 1) Objectness: objects can be distinguished through spatial and color contiguity; and they can interact. 2) Goal-directedness: some objects are "agents" and their behavior can be organized according to goals. 3) Numbers: objects can be counted or sorted. 4) Geometry: objects can be mirrored, rotated, translated, combined, etc.

Control for experience. Every task follows a few-shot setup where a few demonstrations (input, output) are given for an unknown transform, and the solver is expected to apply the transformation to a test (only input). For example, in the image below 3 examples are given, and the solver is expected to have prior understanding of concepts of “line extrapolation”, “turning on obstacles”, and “efficiently reaching a goal”.

A maze-like task from ARC [ref-3]

ARC forces test-takers to demonstrate a sophisticated form of reasoning that current AI systems struggle to replicate. The limited number of examples is especially useful at exposing the stark training inefficiency of modern deep learning, which has long been criticized for requiring vast amounts of data—equivalent to many human lifetimes—to adapt to new tasks. Only time will reveal how resilient ARC remains to exploitation by machine learning, but for now, it stands as a much-needed challenge for the AI community. This challenge compels us to explore deeper, better ideas for implementing true intelligence. In the next part of this series, we will present some of the key ideas that have already emerged from state-of-the-art solutions for ARC.

References:
[1] https://www.cloudwalk.io/ai/consciousness-reasoning-and-llms-playing-tic-tac-toe
[2] https://arcprize.org/
[3] Chollet, François. "On the measure of intelligence." 2019. - https://arxiv.org/abs/1911.01547