[HTML payload içeriği buraya]
25.2 C
Jakarta
Friday, January 30, 2026

Maia 200: The AI accelerator constructed for inference


At the moment, we’re proud to introduce Maia 200, a breakthrough inference accelerator engineered to dramatically enhance the economics of AI token technology. Maia 200 is an AI inference powerhouse: an accelerator constructed on TSMC’s 3nm course of with native FP8/FP4 tensor cores, a redesigned reminiscence system with 216GB HBM3e at 7 TB/s and 272MB of on-chip SRAM, plus information motion engines that hold large fashions fed, quick and extremely utilized. This makes Maia 200 essentially the most performant, first-party silicon from any hyperscaler, with thrice the FP4 efficiency of the third technology Amazon Trainium, and FP8 efficiency above Google’s seventh technology TPU. Maia 200 can be essentially the most environment friendly inference system Microsoft has ever deployed, with 30% higher efficiency per greenback than the newest technology {hardware} in our fleet immediately.

Maia 200 is a part of our heterogenous AI infrastructure and can serve a number of fashions, together with the newest GPT-5.2 fashions from OpenAI, bringing efficiency per greenback benefit to Microsoft Foundry and Microsoft 365 Copilot. The Microsoft Superintelligence crew will use Maia 200 for artificial information technology and reinforcement studying to enhance next-generation in-house fashions. For artificial information pipeline use instances, Maia 200’s distinctive design helps speed up the speed at which high-quality, domain-specific information could be generated and filtered, feeding downstream coaching with brisker, extra focused alerts.

Maia 200 is deployed in our US Central datacenter area close to Des Moines, Iowa, with the US West 3 datacenter area close to Phoenix, Arizona, coming subsequent and future areas to observe. Maia 200 integrates seamlessly with Azure, and we’re previewing the Maia SDK with an entire set of instruments to construct and optimize fashions for Maia 200. It features a full set of capabilities, together with PyTorch integration, a Triton compiler and optimized kernel library, and entry to Maia’s low-level programming language. This provides builders fine-grained management when wanted whereas enabling straightforward mannequin porting throughout heterogeneous {hardware} accelerators.

YouTube Video

Engineered for AI inference

Fabricated on TSMC’s cutting-edge 3-nanometer course of, every Maia 200 chip comprises over 140 billion transistors and is tailor-made for large-scale AI workloads whereas additionally delivering environment friendly efficiency per greenback. On each fronts, Maia 200 is constructed to excel. It’s designed for the newest fashions utilizing low-precision compute, with every Maia 200 chip delivering over 10 petaFLOPS in 4-bit precision (FP4) and over 5 petaFLOPS of 8-bit (FP8) efficiency, all inside a 750W SoC TDP envelope. In sensible phrases, Maia 200 can effortlessly run immediately’s largest fashions, with loads of headroom for even larger fashions sooner or later.

A close-up of the Maia 200 AI accelerator chip.

Crucially, FLOPS aren’t the one ingredient for sooner AI. Feeding information is equally necessary. Maia 200 assaults this bottleneck with a redesigned reminiscence subsystem. The Maia 200 reminiscence subsystem is centered on narrow-precision datatypes, a specialised DMA engine, on-die SRAM and a specialised NoC cloth for top‑bandwidth information motion, rising token throughput.

A table with the title “Industry-leading capability” shows peak specifications for Azure Maia 200, AWS Trainium 3 and Google TPU v7.

Optimized AI methods

On the methods stage, Maia 200 introduces a novel, two-tier scale-up community design constructed on customary Ethernet. A customized transport layer and tightly built-in NIC unlocks efficiency, robust reliability and vital value benefits with out counting on proprietary materials.

Every accelerator exposes:

  • 2.8 TB/s of bidirectional, devoted scaleup bandwidth
  • Predictable, high-performance collective operations throughout clusters of as much as 6,144 accelerators

This structure delivers scalable efficiency for dense inference clusters whereas decreasing energy utilization and general TCO throughout Azure’s world fleet.

Inside every tray, 4 Maia accelerators are totally linked with direct, non‑switched hyperlinks, protecting excessive‑bandwidth communication native for optimum inference effectivity. The identical communication protocols are used for intra-rack and inter-rack networking utilizing the Maia AI transport protocol, enabling seamless scaling throughout nodes, racks and clusters of accelerators with minimal community hops. This unified cloth simplifies programming, improves workload flexibility and reduces stranded capability whereas sustaining constant efficiency and value effectivity at cloud scale.

A top-down view of the Maia 200 server blade.

A cloud-native growth strategy

A core precept of Microsoft’s silicon growth applications is to validate as a lot of the end-to-end system as potential forward of ultimate silicon availability.

A classy pre-silicon atmosphere guided the Maia 200 structure from its earliest levels, modeling the computation and communication patterns of LLMs with excessive constancy. This early co-development atmosphere enabled us to optimize silicon, networking and system software program as a unified complete, lengthy earlier than first silicon.

We additionally designed Maia 200 for quick, seamless availability within the datacenter from the start, constructing out early validation of among the most advanced system components, together with the backend community and our second-generation, closed loop, liquid cooling Warmth Exchanger Unit. Native integration with the Azure management airplane delivers safety, telemetry, diagnostics and administration capabilities at each the chip and rack ranges, maximizing reliability and uptime for production-critical AI workloads.

On account of these investments, AI fashions had been operating on Maia 200 silicon inside days of first packaged half arrival. Time from first silicon to first datacenter rack deployment was decreased to lower than half that of comparable AI infrastructure applications. And this end-to-end strategy, from chip to software program to datacenter, interprets straight into increased utilization, sooner time to manufacturing and sustained enhancements in efficiency per greenback and per watt at cloud scale.

A view of the Maia 200 rack and the HXU cooling unit.

Join the Maia SDK preview

The period of large-scale AI is simply starting, and infrastructure will outline what’s potential. Our Maia AI accelerator program is designed to be multi-generational. As we deploy Maia 200 throughout our world infrastructure, we’re already designing for future generations and count on every technology will regularly set new benchmarks for what’s potential and ship ever higher efficiency and effectivity for a very powerful AI workloads.

At the moment, we’re inviting builders, AI startups and lecturers to start exploring early mannequin and workload optimization with the brand new Maia 200 software program growth package (SDK). The SDK features a Triton Compiler, help for PyTorch, low-level programming in NPL and a Maia simulator and value calculator to optimize for efficiencies earlier within the code lifecycle. Join the preview right here.

Get extra photographs, video and assets on our Maia 200 web site and learn extra particulars.

Scott Guthrie is chargeable for hyperscale cloud computing options and providers together with Azure, Microsoft’s cloud computing platform, generative AI options, information platforms and data and cybersecurity. These platforms and providers assist organizations worldwide clear up pressing challenges and drive long-term transformation.

Tags: , ,



Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles