虎嗅

Gemma4 has now matched the performance of a top-tier closed-source model from a year and a half ago: A 5 billion parameter model that requires only 2GB of video memory. The technical ambitions behind Gemma4 are impressive.

原文：Gemma4已经追平一年半前的顶尖闭源模型：50亿参数模型只需2GB显存，Gemma4背后的技术野心

2026-06-03 阅读原文

Summary of Key Points

Gemma 4 is the latest open-source AI model released by Google DeepMind. Although its parameter size (approximately 30 billion) is the same as that of its predecessor, it has significantly improved the "intelligence density per parameter" through technological innovations such as the E2B (Efficient to Boot) architecture. Notable features include: a 5-billion-parameter model that requires only 2GB of RAM to run on mobile devices and Raspberry Pi; a small team that coordinated with over 50 partners to complete its release; multi-modal capabilities covering audio, images, and short videos; support for 140 languages; and discussions on the boundaries between small and large models, fine-tuning trends, and the advantages and disadvantages of the MOE (Mixed Expert) architecture. Overall, Gemma 4 represents an important move by Google in the open-source AI ecosystem and mobile deployment, with the goal of making AI more accessible to ordinary users and developers.

I. E2B Architecture: Empowering Small Models on Mobile Devices

The most impressive technology in Gemma 4 is the E2B parameter卸loading approach, which essentially splits the model into two parts—those frequently used are placed on the GPU (for faster processing), while the less frequently used ones are stored on the CPU or disk (to save space). In traditional AI models, all parameters must be loaded into GPU memory, making them unsuitable for small devices. However, Gemma 4’s 5-billion-parameter model only needs 2GB of RAM because it stores 3 billion parameters on the CPU or disk and loads the remaining 2 billion frequently used parameters onto the GPU. This is similar to keeping only the most commonly used dictionary pages at hand while putting the less frequent ones on a shelf, saving space without sacrificing speed. This design is specifically optimized for mobile devices like phones and Raspberry Pi. For larger models (with hundreds of billions of parameters), more compact architectures or MOE (Mixed Expert) models are required. The Gemini Nano pre-installed in some Pixel phones and high-end Samsung smartphones is an example of a Gemma-based model that users can use out of the box.

II. How a Small Team Collaborated with Over 50 Partners

The Gemma team is relatively small, consisting of 2-3 product managers, 1 marketing specialist, engineers, and researchers. Yet, they coordinated with nearly 50 external partners (such as llama.cpp, Ollama, Hugging Face, Nvidia) and internal teams (like Google Cloud, Android) to complete the release. The reason for this extensive collaboration is that open-source models rely on a supportive ecosystem: llama.cpp enables model execution on computers, Ollama simplifies deployment, and Hugging Face provides a platform. Gemma 4 was also directly integrated into Android Studio, allowing developers to write Android code offline without needing to connect to APIs. The goal is for Gemma 4 to quickly penetrate various use cases, from mobile devices to development tools, thereby establishing a strong open-source ecosystem.

III. Small Models vs. Large Models: Knowledge as the Final Barrier

Gemma 4 has caught up with top closed-source models from a year and a half ago (such as early versions of GPT-4) and can perform tasks like proxy services, function calls, and dialogues. However, the gap between it and large models (like Gemini) lies in knowledge storage. Small models have limited parameters and cannot retain a lot of information (e.g., who was the president of a country 25 years ago), whereas large models can store more data. Omar predicts that within 1-2 years, mobile devices will be capable of running Gemini 3 Pro-level models locally, enabling most daily tasks (such as chatting, coding, and image processing) to be completed offline. Only extremely complex tasks (e.g., long document analysis, high-precision reasoning) will still require large models. Therefore, small and large models are not in a competitive relationship but rather complement each other—small models handle everyday tasks, while large models handle specialized tasks.

IV. Multi-modal and Multilingual Capabilities

Based on Gemini 3’s technology, Gemma 4 supports multi-modal tasks, including understanding audio (speech recognition, text transcription, questioning), images (object detection, description), and short videos (30-60 seconds). However, it has some limitations: it cannot perform image segmentation (e.g., extracting a cat from an image) or simultaneously process video and audio. In terms of languages, Gemma supports 140 languages, thanks to a high-quality tokenizer that breaks text into units understandable by the model. For example, its tokenizer performs better at capturing language nuances when fine-tuning for Vietnamese.

V. Is Fine-tuning Obsolete? The Pros and Cons of MOE Models

In the past, fine-tuning models (adding industry-specific data to general models) was a common practice, but with Gemma 4’s out-of-the-box performance, many partners found that fine-tuning was unnecessary for their visual models. Only specific fields like finance and healthcare still require fine-tuning. Gemma comes in two similar-sized versions: a 31B dense model (using all parameters) and a 27B MOE model (activating only a subset of parameters). MOE models are fast for inference but difficult to fine-tune due to the routing mechanism that determines which parameters to use, which involves adjusting many variables. The current trend is to use pre-trained models for general tasks and fine-tune them only for specific use cases. MOE models are suitable for scenarios where speed is critical but require specialized expertise.

Conclusion

The release of Gemma 4 marks a significant milestone in Google’s open-source AI and mobile deployment strategy, as it enables powerful AI models to run on ordinary phones and lowers the barriers to usage through ecosystem collaboration. In the next 1-2 years, as phones become capable of running large models, our daily experiences (e.g., offline AI assistants, local image processing) will change significantly. Google is leveraging the Gemma series to gain a competitive advantage in the open-source AI landscape and differentiate itself from closed-source models like GPT-4.