Moonshot AI, the Chinese startup behind Kimi, used NVIDIA GTC 2026 on March 18 to give its clearest public account yet of how it scaled Kimi K2.5 and why it is reworking core Transformer plumbing rather than only shipping new chatbot features. Chinese media coverage of founder Yang Zhilin’s talk said the company framed its roadmap around token efficiency, long context and “agent swarms,” while spotlighting a new Attention Residuals (AttnRes) paper and open-source repository released on March 16. The bigger significance is that a Chinese frontier-model company chose NVIDIA’s flagship global developer event to argue for an architecture-level change with measurable training and performance claims.
GTC gave Moonshot a global stage, not just a domestic media spike
The conference context matters here. NVIDIA says GTC 2026 is running from March 16 to March 19 in San Jose, with more than 30,000 attendees from over 190 countries and more than 1,000 sessions across the AI stack. That scale makes GTC more than another conference appearance: it is one of the industry’s biggest stages for infrastructure, models and deployment strategy. When a Chinese model company uses that setting to discuss residual connections, scaling bottlenecks and training efficiency, it is trying to enter a global engineering conversation rather than only a Chinese product-news cycle, much like NVIDIA’s broader push to turn GTC-week model releases into infrastructure narratives.
That is also why the story stands out from many recent China AI headlines. Over the past few rounds, the louder stories were about policy rollout, automaker hardware, or earnings-driven product updates. Moonshot’s GTC moment is different. It is not mainly about a new chatbot feature, a funding rumor, or a benchmark screenshot. It is about a Chinese AI company using an international venue to say its next edge may come from changing how large models are built at the architectural level.
The March 16 paper moved the discussion from application layer to model internals
The key technical anchor is the March 16 arXiv paper “Attention Residuals.” In the abstract, the Kimi Team says residual connections with PreNorm are standard in modern large language models, but they accumulate all layer outputs with fixed unit weights. According to the paper, that uniform aggregation can lead to uncontrolled hidden-state growth as depth increases and can dilute the contribution of each individual layer. AttnRes replaces that fixed accumulation with softmax attention over preceding layer outputs, letting each layer selectively combine earlier representations using learned, input-dependent weights.
That may sound like an abstract systems tweak, but it matters because residual connections are one of the deep assumptions built into today’s Transformer families. Moonshot is effectively arguing that one of the architecture’s default habits, not just model size or data scale, has become a bottleneck. In a market where many AI companies compete by layering agents, search, coding or enterprise workflows on top of existing backbones, Moonshot’s message is that some of the bigger gains may still come from touching the base machinery itself.
Block AttnRes is the bridge from research claim to deployable engineering
The second hard fact is that Moonshot did not stop at publishing a concept paper. The arXiv report says attending over all preceding layer outputs creates memory and communication overhead in large-scale training, so the team introduced Block AttnRes, which partitions layers into blocks and attends over block-level representations instead. The paper describes it as a practical drop-in replacement for standard residual connections with minimal overhead when combined with cache-based pipeline communication and a two-phase computation strategy.
Moonshot’s GitHub repository pushes the engineering claim further. The repository summary says Full AttnRes is too expensive for practical use at scale, while Block AttnRes gets close to the effect of a baseline trained with 1.25x more compute at near-zero additional burden. That is the kind of statement that gives conference-week discussion real traction, because it translates a theoretical idea into the language model builders actually care about: compute efficiency, training cost and whether a new mechanism can be slotted into production-scale systems without blowing up the budget.
Moonshot is packaging Kimi as a roadmap, not just a paper drop
The third anchor comes from Chinese media coverage of Yang Zhilin’s GTC talk, titled “How We Scaled Kimi K2.5.” Sina Finance, citing China Fund News, said this was the first time Moonshot systematically laid out Kimi’s technical roadmap after the late-January release of Kimi K2.5. In that account, Yang grouped the company’s scaling logic into three interacting tracks: token efficiency, long context and agent swarms. He also pointed to Kimi Linear and Attention Residuals as examples of how the team wants to reopen assumptions around attention mechanisms and residual design.
That packaging matters as much as the individual research artifact. A paper and GitHub repository can circulate on their own, but a roadmap gives investors, developers and rivals a frame for where the company thinks the next round of model improvement will come from. Moonshot is not only saying, “Here is a new paper.” It is saying, “Here is the direction we believe frontier-model scaling is taking, and here are the modules we think matter.” That is a much more ambitious positioning move, especially when read alongside recent coverage of Moonshot’s funding ambitions, which framed the company mainly as a capital-markets story rather than an architecture story.
The paper’s own numbers make the story more concrete
Moonshot also gave the market enough concrete numbers to keep the story from sounding hand-wavy. The arXiv abstract says AttnRes was integrated into the Kimi Linear architecture, described there as a 48B-total, 3B-activated-parameter model, and pre-trained on 1.4 trillion tokens. The authors say AttnRes helped mitigate PreNorm dilution, produced more uniform output magnitudes and gradient distribution across depth, and improved downstream performance across all evaluated tasks. Those are still self-reported results, not independent benchmarks from the broader community, but they are materially more specific than the vague “better reasoning” or “smarter agent” claims that dominate many AI launch posts.
The combination of those numbers with the GitHub efficiency claim is what makes this story stronger than a routine conference mention. Moonshot is tying together a public paper, an open repository, a named model family and a global event stage. For international readers, that is the sharper angle: a Chinese AI company is not only showing an application built on somebody else’s infrastructure, but also presenting itself as a contributor to the design of next-generation model infrastructure.
The safest version of the story is still the best one
There is also value in being strict about what is verified. Some Chinese tech outlets amplified the story with the claim that Elon Musk praised the work. That may be useful as a signal of how the topic spread on Chinese social media, but it should not be the backbone of the article because a stable original public link is not yet part of the strongest fact chain available here. The firmer reporting rests on five clearer pieces: NVIDIA’s GTC schedule and scale, the March 16 arXiv paper, the MoonshotAI repository, and Chinese media coverage of Yang’s March 18 talk.
That caution actually improves the story. It shifts the focus away from celebrity endorsement and back to the more meaningful development: Moonshot used GTC to translate a Chinese lab’s internal architectural work into a public global narrative. If the company wanted only short-term attention, it could have leaned on virality. Instead, the stronger evidence points to a deliberate effort to build credibility through a paper, a repo and a roadmap-level talk.
What changed, and what could happen next
What changed this week is that Moonshot AI turned Kimi from a product name into a more explicit research-and-systems story on a global stage. By pairing a GTC talk with the Attention Residuals paper and repository, the company signaled that its next competitive argument is not just user growth or agent UX, but deeper control over how models scale, remember and coordinate across depth. For a Chinese model company, making that case at NVIDIA’s flagship event is itself a statement about confidence and ambition.
What could happen next is more important than the conference splash. If Block AttnRes or related ideas show up in future Kimi releases, attract outside replication, or influence how developers think about residual design in long-context and agent-heavy systems, Moonshot will have strengthened its position as more than another chatbot maker. If that does not happen, this will still stand as a smart GTC-week reveal, but mostly as a signaling exercise. Either way, Moonshot has already changed the frame: the conversation is no longer only about what Kimi can do, but about what Kimi thinks modern model architecture should become.
Sources
- NVIDIA Newsroom — “NVIDIA CEO Jensen Huang and Global Technology Leaders to Showcase Age of AI at GTC 2026”
https://nvidianews.nvidia.com/news/nvidia-ceo-jensen-huang-and-global-technology-leaders-to-showcase-age-of-ai-at-gtc-2026 - NVIDIA GTC 2026 official page
https://www.nvidia.com/gtc/ - arXiv — “Attention Residuals” (Kimi Team)
https://arxiv.org/abs/2603.15031 - MoonshotAI GitHub — “Attention-Residuals”
https://github.com/MoonshotAI/Attention-Residuals - Sina Finance / China Fund News — report on Yang Zhilin’s GTC 2026 talk
https://finance.sina.com.cn/wm/2026-03-18/doc-inhrkwsv9335574.shtml