Summary of Key Points
A new paper by Sutton, the Turing Award winner and pioneer of reinforcement learning, and scholar Rafiee highlights that current mainstream AI technologies (such as large language models and pure visual models) rely on a “passive representation” approach. These systems attempt to build internal models of the world based on static data to understand the environment, but this methodology is insufficient for dealing with the dynamic and complex realities of the real world. The authors propose that AI should adopt a “generative cognition” approach, where intelligence is not a static copy of the world but rather emerges through interaction with the environment, physical action, and autonomous evaluation. Generative cognition has four fundamental pillars: experience, the integration of perception and action, autonomy, and embodiment. Although reinforcement learning is close to this concept, it still needs improvements in areas such as external rewards and modular design to enable AI to truly understand the world.
Breakdown and Interpretation
#### 1. Why can AI write papers but doesn’t understand basic realities? – The trap of passive representation
Current AI systems are like bookworms: they can memorize vast amounts of text and visual patterns, but they have not personally experienced the real world. For example, a large language model (LLM) may state that “hot water is scalding,” yet it has never touched hot water and does not understand the sensation of pain; a video generation model can create fake videos, but it won’t reach out to catch an object that suddenly falls because its cognition is based on static data, not real-world interactions.
The root of this issue lies in “representationism”—AI tries to create a static copy of the world inside itself, but the real world is dynamic (weather changes, people move unexpectedly), and infinitely complex, making it impossible for any model to fully replicate it. Just as you cannot remember every detail of a city, AI cannot do the same.
#### 2. Generative cognition: AI must engage physically to truly understand the world
The core idea of generative cognition is that understanding comes from action, not just observation. For instance, humans learn to ride a bike by practicing and adjusting their posture; we don’t judge whether something is hot by looking at pictures but by touching it and feeling the heat (receiving feedback). For AI, this means it cannot simply sit on servers and process data; it must interact with the real world. For example, a robot should pick up a cup to feel its weight and temperature, or walk on its own to avoid obstacles. Through a cycle of action → feedback → adjustment, AI can develop genuine understanding.
#### 3. The four pillars of generative cognition: making AI learn like living beings
Generative cognition is based on four key principles that reflect how living organisms process information:
- Experience ≠ data: Experience is a result of direct interaction, not labeled data provided by others. While supervised learning involves human-provided examples and reinforcement learning involves trial and error, these methods are not enough; AI should learn continuously from the environment like animals seeking food.
- Perception and action are inseparable: When we see or hear something, we move our eyes and head; when we touch something, we use our fingers. AI should also integrate perception (seeing, hearing) with action to obtain more accurate information.
- Autonomy: Living organisms have their own goals (finding food, avoiding predators), which are not imposed by others. Future AI systems should have intrinsic motivations, such as knowing when they need to recharge.
- Embodiment: Our physical bodies influence our perception; for example, an ant sees a chair as a barrier, while a human sees it as a seat. AI should have a physical form (e.g., a robot) to truly understand the world—its arm length determines whether it can reach high objects, and sensor placement affects its field of vision.
#### 4. Reinforcement learning is three steps away from creating “living” AI
Reinforcement learning is the AI approach that comes closest to generative cognition because it emphasizes action and feedback, but it still has shortcomings:
- Rewards are external: In games, rewards are set by humans, not determined by the AI’s own needs. Future systems should generate their own sense of reward (e.g., feeling uncomfortable when running out of power and feeling relieved after recharging).
- Perception and action are separated: Many RL systems first perceive the environment, then make decisions, and only then act, which creates a disconnect between the two processes. We want AI to act instinctively, like humans who reach for a cup as soon as they see it.
- The body is a tool, not the core: Current robot bodies are merely hardware for executing commands; in the future, the body should shape AI’s cognition—flexible joints would enable more diverse actions and a deeper understanding of the world.
#### 5. The future of AI: moving from theory to practice
This paper outlines the direction for AI development:
- AI should not be trained solely on data but should interact with the real world.
- It should have a physical form rather than existing as a cloud-based model.
- It should have autonomous goals, independent of human commands.
- It should learn through action rather than passively receiving data.
Only by adopting these principles can AI evolve from mimicking humans to truly understanding the world and move closer to achieving artificial general intelligence (AGI).
The significance of this paper lies in its challenge to the traditional notion that larger models are always better. It highlights that the essence of AI is not the amount of data processed but its ability to interact with the real world, just as human intelligence develops through practical experiences in life.