虎嗅

Microsoft's MAI-Base-1 MFU: Why Does It Appear to Have Only Half the Capacity of DeepSeek-V3?

原文:微软MAI-Base-1的MFU ,为什么看上去仅有DeepSeek-V3的一半

Summary of Key Points

This article focuses on the compute efficiency (MFU) of Microsoft’s trillion-parameter large model, MAI-Base-1. The main conclusion is that although MAI-Base-1 has an MFU of only around 20%—which seems much lower than DeepSeek-V3’s 39%-44%—this does not indicate a lack of technical prowess on Microsoft’s part. Instead, it reflects the challenges associated with the ongoing optimization of complex MoE (Mixed Expert) models, where system efficiency initially decreases but then strives to recover. The article also analyzes the key factors contributing to these differences and emphasizes that the essence of competition among large models lies in the effectiveness of compute resource utilization.

1. Understanding MFU: What exactly is it?

MFU (Model Compute Efficiency) is a measure of how much of the hardware’s theoretical maximum computing power is actually used for model training. To illustrate:

Imagine you have a supercomputer that can solve 100 math problems per second (theoretical peak performance), but during model training, it may only use 20 problems (MFU = 20%). The remaining 80% of the time might be spent waiting for data transfer or handling unrelated tasks. Note that MFU is not the same as GPU utilization; it represents the proportion of resources actually dedicated to model computation and is a crucial indicator of overall system efficiency.

2. Microsoft’s MAI-Base-1’s 20%: Not a sign of weakness, but a growth challenge for complex models

MAI-Base-1 is a trillion-parameter MoE model, similar to a team of experts working together, with each expert handling only a subset of data. During its evolution from version v1 to v5, adding new features led to initial declines in MFU, which were later reversed through optimizations:

  • Version v2: Used 4096 GPUs; the model became more complex but with lower initial MFU (18%). Optimizations to communication and code improved this to 22%.
  • Version v3: Switched to a more efficient routing method (no data loss), but increased synchronization costs; optimization maintained MFU at 22%.
  • Version v4: Increased the number of experts from 192 to 512, and the number of routes from 4 to 8; expanded GPUs to 8192, resulting in an MFU drop to 16%. Further optimizations to the kernel/CPU improved efficiency back to 20%.
  • Version v5: Increased parameters from 600B to 1T. Initial ZeRO-3 optimizations slowed down communication; switching to ZeRO-2 and disabling unnecessary activation values maintained MFU at 20%.

This 20% represents a balance between enhancing model capabilities and maintaining system efficiency. It’s not a lack of optimization, but rather the high coordination costs inherent in MoE models.

3. DeepSeek-V3’s higher MFU: Achieved through hardware limitations

DeepSeek-V3 is also an MoE model with an MFU of 39% (causal) or 44% (non-causal). The reasons for this are:

  • Hardware constraints: It uses H800 GPUs, which are less powerful than Microsoft’s GB200 GPUs and have limited bandwidth. The Chinese development team had to maximize hardware efficiency by reducing data transfer waste, lowering memory usage, and optimizing the code architecture to make it better fit the hardware.
  • Extensive optimizations: They made deep improvements in model architecture, precision (BF16), routing methods, and parallelization strategies, unlike Microsoft’s approach of merely adapting to the hardware.

In summary, when hardware is not as powerful, teams rely on meticulous optimization to maximize compute efficiency.

4. Why such a difference in MFU? Analysis of key factors

MFU is influenced by five main factors:

  • Model architecture: Dense models (like large matrices) are more efficient, whereas MoE models (with multiple experts) have higher coordination costs, leading to lower MFUs.
  • GPU scale: More GPUs increase data transfer complexity and synchronization time, potentially reducing MFU.
  • Parallelization strategies: The way data and models are divided affects transfer speed (faster within the same rack, slower across racks).
  • Precision formats: Different precision levels (FP8/BF16) impact performance and memory usage, making direct comparisons difficult.
  • Software tools: Using optimization frameworks like FlashAttention/Triton improves efficiency by automating processes.

5. Industry implications: The competition among models is about compute efficiency

Training large models is extremely costly (e.g., training GPT-3 can cost millions of dollars). Teams that can improve MFU can train better models with less resources and faster. For example, DeepSeek’s optimizations allow it to compete with Microsoft despite using less advanced hardware.

In conclusion: The competition among large models has shifted from focusing on the number of parameters to the effectiveness of how compute power is utilized.