第一财经

The National Data Administration has issued an implementation plan to promote the construction of high-quality datasets in various industries.

原文:国家数据局印发实施方案,推进行业高质量数据集建设行动

Summary of Key Points

The National Data Administration has introduced a plan aimed at providing “high-quality fuel” for the development of artificial intelligence (AI) – that is, high-quality datasets tailored for AI model training. Through six specific initiatives focusing on strengthening foundational capabilities, enhancing data quality and efficiency, facilitating application integration, improving management services, and realizing the value of data, the goal is to establish a collection of datasets covering key sectors by 2028. This will foster relevant enterprises and talents, creating a virtuous cycle of “data → models → applications → more data,” ultimately empowering various industries and driving new growth in the intelligent economy.

Detailed Explanation

1. High-Quality Datasets are Crucial for AI Development

AI models are like cars, and data is the fuel: ordinary (disparate, unprocessed) data is insufficient for efficient operations; only high-quality datasets can enable AI to perform accurately. For example, training AI for cancer diagnosis requires a large amount of annotated medical records and imaging data, while training for autonomous driving necessitates structured data on road scenarios and vehicle behavior. Currently, many industry datasets are either fragmented or of poor quality, hindering the practical application of AI. This plan seeks to address this “fuel crisis” and enable AI to be effectively integrated across all sectors.

2. Focusing on Critical Areas: People’s Livelihoods and Emerging Technologies

The plan specifies the areas for which datasets should be developed, divided into two categories:

  • Essential for People’s Livelihoods: Industrial manufacturing (factory equipment data), agriculture and rural areas (soil, crop growth data), healthcare (medical records, imaging), education (teaching resources), finance (risk assessment data), etc. AI applications in these fields directly impact people’s daily lives.
  • Emerging Technologies: Low-altitude economy (drone data), autonomous driving, embodied intelligence (robot interaction data), biomanufacturing, etc., which represent future growth areas for the intelligent economy. Datasets must be tailored to specific needs; for instance, agricultural datasets should support AI in predicting pests and diseases, and medical datasets should assist in disease diagnosis.

3. Upgrading Data Annotation Processes

Data annotation involves assigning labels to data (e.g., identifying cats in images or diagnosing diabetes from medical records) to make it understandable to AI. The approach includes:

  • Automated Initial Annotation: Machines perform initial labeling, followed by manual verification to improve accuracy.
  • Expert Input: Specialized experts are involved in annotating data from critical fields (e.g., medical and legal datasets) to ensure precision.
  • Industry-Led Development: The state will support the development of annotation services in seven pilot cities, then expand to other regions, supporting leading companies (e.g., those specializing in medical data annotation) and training professionals (through education programs and vocational certifications), creating new job opportunities.

4. Turning Data into a Valuable Asset

Data should not remain static but be transformed into a valuable resource:

  • Data Circulation: Real-world scenarios drive the need for data, which in turn drives dataset creation; models are trained using this data, and these models generate more data that can further refine the models (e.g., AI in factories producing equipment data for model improvement).
  • Business Models: Data can be traded on platforms, offered as subscriptions, accessed through APIs, or even sold in smaller units (e.g., precise industry terms).
  • Assetization: Data can be utilized as an asset, such as for collateral in loans or equity investments, generating financial value.

5. Comprehensive Implementation with National Coordination and Security Measures

Success requires:

  • National and Local Collaboration: The National Data Administration will coordinate efforts, while local authorities should tailor initiatives to regional industrial characteristics to avoid duplication.
  • Financial Support: Financial institutions and industry funds should be encouraged to invest, with local governments providing dedicated funding.
  • Security Safeguards: Measures must be in place to prevent data breaches and contamination (e.g., by ensuring only accurate data is used).

This plan aims to empower AI by providing the necessary resources, enabling its integration into various sectors and bringing intelligent technologies closer to our daily lives – more accurate medical diagnoses, smarter factories, and safer autonomous vehicles.