虎嗅

Stop asking if they've caught up: This is where the real gap lies between China and the US in large-scale models.

原文：别再问追没追上：中美大模型的真实差距在这里

2026-06-08 阅读原文

Summary of Key Points

By mid-2026, the large-scale language models from China and the United States are no longer "generally one generation behind" each other; instead, they have entered a phase of scenario differentiation. Chinese leading models have approached or even slightly surpassed American models in areas such as open-source ecosystems, local deployment, adaptation to the Chinese language context, cost-effectiveness, OCR/document understanding, and short-video generation. However, American closed-source models still hold significant advantages in areas like high-stability long-term programming agents, complex tool invocation, enterprise-level fault tolerance, multi-modal GUI automation, global trust, and product ecosystems. The key difference lies not in intelligence but in the stability of handling complex tasks and the ability to create a complete, market-ready product.

Detailed Analysis

1. Scenario Differentiation: Each Has Its Strengths

Chinese models excel in areas that are more practical and cost-effective:

Chinese Language Context: Models like DouBao and DeepSeek provide better experiences in Chinese dialogue, learning, and summarization. DouBao has 155 million daily active users (the highest in China), with users switching to them due to their free nature, privacy features, and adaptation to the Chinese language.
Open Source and Local Deployment: Small models from Qwen and DeepSeek (e.g., Qwen3-0.6B/4B) have millions of downloads and can run on ordinary computers or gaming devices. Developers appreciate their controllability and privacy.
OCR/Document Understanding: Qwen2.5-VL offers similar accuracy to GPT-4o in Chinese document and table extraction, with a better cost-performance ratio.
Short-Video Generation: Models like Kling and Seedance are strong in converting images to videos, maintaining facial features, and controlling costs, making them competitive on a global scale.

American models are stronger in terms of stability when handling complex tasks:

Long-Term Programming Agents: Models like GPT-5.5 and Claude Sonnet 4.6 can perform multi-step tasks such as cross-file editing and toolchain iteration, while Chinese models often lose information or encounter errors during tool invocation.
Enterprise-Level Deployment: ChatGPT remains the most widely used AI product globally, and Claude is more trusted for enterprise compliance and low failure rates.
GUI Automation: These models can reliably operate computer interfaces (e.g., browsers, IDEs), whereas Chinese models often suffer from issues with loops or coordinate errors.

2. Small Models and Open Source: China's Hidden Ace

Small models (with less than 40B parameters) are a strength for Chinese models:

Why do users choose them? Not because they are the most intelligent, but because they are controllable, affordable, and private. For example, Qwen3-30B-A3B can run on computers with 12GB of RAM at a speed of 12 tokens per second, making them suitable for processing sensitive data locally.
Open Source Impact: Hugging Face accounts for 41% of all OpenSourceModel downloads in China, and DeepSeek has more token traffic on OpenRouter than Meta and Mistral. Microsoft's inclusion of DeepSeek R1 in its Azure cloud platform indicates that Chinese open-source models have entered the Western corporate ecosystem.

However, being open-source does not equate to global dominance. ChatGPT still receives 2.7 times the web traffic of Gemini, and American closed-source models continue to dominate the consumer and enterprise payment markets.

3. Stability: A More Critical Factor Than Intelligence

Real user feedback shows that Chinese models' main issue is inability to perform tasks stably:

Long-Term Tasks: They often make mistakes, such as losing directories or forgetting targets when handling large contexts (e.g., 32K).
Toolchain Bugs: The models generate correct tool invocation commands, but the parsers may incorrectly interpret numbers or fail with chat templates.
Quantization Impact: Low-bit quantization (e.g., Q4/Q5) reduces their tool invocation and inference capabilities, whereas American models remain stable even after quantization.

American models are known for being less prone to unexpected errors, allowing users to entrust them with complex tasks (e.g., fixing large codebases) over the long term.

4. Multi-Modal Abilities: Strong in OCR, Weak in GUI, Closing the Gap in Video

OCR/Document Understanding: Chinese models are leading, with Qwen2.5-VL achieving 75% accuracy in extracting information from 1000 documents (competing with GPT-4o).
GUI Automation: There is a clear gap; Chinese models can describe screens but struggle with stable operations (e.g., coordinate errors), while American models can handle multi-step tasks in browsers and IDEs.
Video Generation: China is the closest to catching up. Models like Kling are strong in image-to-video conversion, but American products like Veo have better audio quality; however, Western products also have issues (e.g., instability with Luma Dream Machine), with the overall gap being smaller compared to other areas.

5. Underlying Differences: Technology, Data, Ecosystems, and External Factors

Technical Approaches: Chinese models focus on efficiency, quantizability, and local deployment (suitable for open-source dissemination), while American models use large-scale closed-loop training to optimize stability.
Data Availability: China has a advantage with Chinese language content, but the U.S. has more English technical documents, corporate codebases, and SaaS tools.
Ecosystem Positioning: Chinese open-source models are being integrated by global developers (e.g., Ollama, LM Studio), while American models have closed product ecosystems (e.g., Cursor IDE using Kimi as a foundation for powerful programming agents).
External Factors: U.S. chip restrictions have forced Chinese models to optimize their compatibility with domestic chips, but this also limits cutting-edge training. Regulatory and data storage considerations (Chinese models storing data domestically) affect international user trust.

Future Trends: Key Indicators for Catch-Up

To truly assess the situation, we should focus on:

1. Real-User Adoption: Whether a large number of users switch from Claude/GPT to Chinese models for complex tasks.

2. Long-Term Task Performance: Scores from professional tests like Terminal-Bench and SWE-bench Pro.

3. Toolchain Bug Rates: Reduction in errors with Chinese model parsers and streaming processes.

4. Western Product Adoption: Whether more American products (e.g., IDEs, agent platforms) use Chinese open-source foundations.

5. Video Quality: Whether Chinese models can match the quality of Veo/Runway in terms of audio and long-duration video processing.

In summary, Chinese models have made significant progress in practical applications, but they still need time to catch up in areas requiring high stability and global trust. Ordinary dialogue and small-scale tasks have been covered, while short-video generation and OCR are on track. For complex programming agents, it will take 1-2 years; for enterprise-level global adoption, 2-4 years.

(The entire analysis is presented in plain language to make it easy for non-financial/technical readers to understand the actual differences between Chinese and American large-scale language models.)