虎嗅

New Architecture Model HRM-Text Sets a Record: 1 Billion Parameters, $1,000 in Funding – Even Turing Award Winners Have gotten Involved Personally

原文:新架构模型HRM-Text创新纪录,1B参数、1000美元,图灵奖得主都亲自下场了

Summary of Key Points

HRM-Text is an AI model with 1 billion parameters (1B), which was trained at a cost of only $1,500 (using 16 H100 GPUs for less than two days). Despite this, it outperformed many models with 2B to 7B parameters in benchmark tests such as mathematical reasoning (MATH: 56.2 points) and elementary arithmetic (GSM8K: 84.5 points). Its key innovation lies in abandoning the traditional approach of "accumulating more parameters, data, and computing power." Instead, it successfully achieved zero-pretraining using a redesigned model architecture (layered recursive computation) and training objectives that focused on answering specific questions, with a minimal amount of data (only 40B unique tokens, which is 1/225 of Llama3.2's 3B parameters). Its purpose is to serve as a proof of concept, demonstrating that architectural innovations can improve efficiency even with limited resources. Even Turing Award winner Yoshua Bengio has followed up with similar research, paving new paths for the development of large models.

Detailed Breakdown

1. Why Can Smaller Models Outperform Larger Ones? – Not by Accumulating Resources, but by Ingenious Computation

The traditional philosophy behind large models is that "the bigger, the better": more parameters, more data, and stronger computing power equal higher intelligence. However, HRM-Text takes a different approach: with just 1B parameters (smaller than many models), a training cost of $1,500 (much lower than that of million-parameter models), and minimal data, it still achieves excellent results. The secret lies in optimizing computational efficiency – using fewer parameters to perform more effective internal calculations before outputting the result. It’s like using the same ingredients; an ordinary chef might make a simple dish, while a skilled chef transforms them into something exquisite. HRM-Text is that skilled chef.

2. Architectural Innovation: Making the Model Think More Before Outputting

Conventional Transformer models work in a "pipeline" manner, where input passes through each layer sequentially, with each layer processing the data only once. HRM-Text uses an iterative approach:

  • It consists of two modules: H (the higher-level module, which updates slowly and handles the overall context, such as remembering the core of the problem) and L (the lower-level module, which updates quickly and makes local adjustments).
  • Before each output, the model repeatedly updates its internal state between the two modules (for example, 6 L updates + 2 H updates before predicting a word), effectively making it "think more" before answering.

To prevent the model from collapsing due to excessive iteration, HRM-Text employs two techniques:

  • MagicNorm: to control data fluctuations during computation and avoid uncontrolled results.
  • Progressive Responsibility: Initially, the model is only responsible for the most recent steps in its reasoning process; as it becomes more stable, it gradually takes responsibility for earlier steps (similar to a teacher grading assignments first and then checking previous ones).

3. Training Objectives: Focusing on Answering Questions, Not Copying Them

While traditional models are trained to predict the next word in the entire text (including the question itself), HRM-Text focuses solely on calculating the correctness of the answer. For example, when given a math problem, it doesn’t need to learn how to restate the question but only how to calculate the correct solution. Additionally, it uses PrefixLM to ensure the model fully understands the entire question (both instructions and the answer), allowing for more targeted training and higher efficiency.

4. Weaknesses and Future Directions

HRM-Text performs well in reasoning tasks but struggles with tests that require extensive knowledge (such as MMLU, which assesses knowledge across multiple subjects). This is due to its limited data and smaller number of parameters, preventing it from retaining much information. The future direction for such models is to "decouple reasoning from knowledge" – having HRM focus on computation while using external databases or retrieval systems for knowledge (similar to how humans use reference materials). The team has made early progress in this area but has not made the results public yet.

5. Industry Implications: Opening Up New Competition Paths for Large Models

The AI industry has traditionally focused on increasing model size and computing power, making the entry barrier higher. HRM-Text shows that optimizing the computational process can also enhance performance. This is analogous to the automotive industry, where improvements in efficiency can be achieved without simply increasing engine capacity. Bengio’s research further validates this approach, potentially encouraging smaller teams to innovate without relying solely on massive resource investment.

Conclusion

HRM-Text does not aim to replace large models but offers a new path towards lower-cost and higher-performance AI solutions. Its value lies in demonstrating that advancements in large models can come from more innovative approaches to computation, rather than simply increasing their size. This breakthrough signals a shift away from the belief that scale alone determines success in the industry.