Accelerating open-source infrastructure growth for frontier AI at scale

Microsoft is contributing new requirements throughout energy, cooling, sustainability, safety, networking, and fleet resiliency to advance innovation.

Within the transition from constructing computing infrastructure for cloud scale to constructing cloud and AI infrastructure for frontier scale, the world of computing has skilled tectonic shifts in innovation. All through this journey, Microsoft has shared its learnings and greatest practices, optimizing our cloud infrastructure stack in cross-industry boards such because the Open Compute Undertaking (OCP) International Basis.

At the moment, we see that the following section of cloud infrastructure innovation is poised to be probably the most consequential interval of transformation but. In simply the final yr, Microsoft has added greater than 2 gigawatts of recent capability and launched the world’s strongest AI datacenter, which delivers 10x the efficiency of the world’s quickest supercomputer as we speak. But, that is just the start.

Delivering AI infrastructure on the highest efficiency and lowest price requires a methods method, with optimizations throughout the stack to drive high quality, pace, and resiliency at a stage that may present a constant expertise to our prospects. Within the quest to provide resilient, sustainable, safe, and extensively scalable expertise to deal with the breadth of AI workloads, we’re embarking on an bold new journey: one not simply of redefining infrastructure innovation at each layer of execution from silicon to methods, however one in every of tightly built-in {industry} alignment on requirements that supply a mannequin for international interoperability and standardization.

At this yr’s OCP International Summit, Microsoft is contributing new requirements throughout energy, cooling, sustainability, safety, networking, and fleet resiliency to additional advance innovation within the {industry}.

Redefining energy distribution for the AI period

As AI workloads scale globally, hyperscale datacenters are experiencing unprecedented energy density and distribution challenges.

Final yr, on the OCP International Summit, we partnered with Meta and Google within the growth of Mt. Diablo, a disaggregated energy structure. This yr, we’re constructing on this innovation with the following step of our full-stack transformation of datacenter energy methods: solid-state transformers. Stable-state transformers simplify the facility chain with new conversion applied sciences and safety schemes that may accommodate future rack voltage necessities.

Coaching giant fashions throughout hundreds of GPUs additionally introduces variable and intense energy draw patterns that may pressure the grid. The utility, and conventional energy supply methods. These fluctuations not solely danger {hardware} reliability and operational effectivity but in addition create challenges throughout capability planning and sustainability targets.

Along with key {industry} companions, Microsoft is main an influence stabilization initiative to deal with this problem. In a not too long ago printed paper with OpenAI and NVIDIA—Energy Stabilization for AI Coaching Datacenters—we deal with how full-stack improvements spanning rack-level {hardware}, firmware orchestration, predictive telemetry, and facility integration can clean energy spikes, scale back energy overshoot by 40%, and mitigate operational danger and prices to allow predictable, and scalable energy supply for AI coaching clusters.

This yr, on the OCP International Summit, Microsoft is becoming a member of forces with {industry} companions to launch a devoted energy stabilization workgroup. Our aim is to foster open collaboration throughout hyperscalers and {hardware} companions, sharing our learnings from full-stack innovation and alluring the neighborhood to co-develop new methodologies that deal with the distinctive energy challenges of AI coaching datacenters. By constructing on the insights from our not too long ago printed white paper, we intention to speed up industry-wide adoption of resilient, scalable energy supply options for the following technology of AI infrastructure. Learn extra about our energy stabilization efforts.

Cooling improvements for resiliency

As the facility profile for AI infrastructure modifications, we’re additionally persevering with to rearchitect our cooling infrastructure to assist evolving wants round vitality consumption, area optimization, and total datacenter sustainability. Numerous cooling options should be carried out to assist the size of our enlargement—as we search to construct new AI-scale datacenters, we’re additionally using Warmth Exchanger Unit (HXU)-based liquid cooling to quickly deploy new AI capability inside our present air-cooled datacenter footprint.

Microsoft’s subsequent technology HXU is an upcoming OCP contribution that permits liquid cooling for high-performance AI methods in air-cooled datacenters, supporting international scalability and speedy deployment. The modular HXU design delivers 2X the efficiency of present fashions and maintains >99.9% cooling service availability for AI workloads. No datacenter modifications are required, permitting seamless integration and enlargement. Study extra in regards to the subsequent technology HXU right here.

In the meantime, we’re persevering with to innovate throughout a number of layers of the stack to deal with modifications in energy and warmth dissipation—using facility water cooling at datacenter-scale, circulating liquid in closed-loops from server to chiller; and exploring on-chip cooling improvements like microfluidics to effectively take away warmth instantly from the silicon.

Unified networking options for rising infrastructure calls for

Scaling tons of of hundreds of GPUs to function as a single, coherent system comes with vital challenges to create rack-scale interconnects that may ship low-latency, excessive bandwidth materials which might be each environment friendly and interoperable. As AI workloads develop exponentially and infrastructure calls for intensify, we’re exploring networking optimizations that may assist these wants. To that finish, we have now developed options leveraging scale-up, scale-out, and Large Space Community (WAN) options to allow large-scale distributed coaching.

We accomplice intently with requirements our bodies, like UEC (Extremely Ethernet Consortium) and UALink, centered on innovation in networking applied sciences for this crucial factor of AI methods. We’re additionally driving ahead adoption of Ethernet for scale-up networking throughout the ecosystem and are excited to see Ethernet for Scale-up Networking (ESUN) workstream launch underneath the OCP Networking Undertaking. We look ahead to selling adoption of cutting-edge networking options and enabling multi-vendor Ecosystem primarily based on open requirements.

Safety, sustainability, and high quality: Basic pillars for resilient AI operations

Protection in depth: Belief at each layer

Our complete method to scaling AI methods responsibly contains embedding belief and safety into each layer of our platform. This yr, we’re introducing new safety contributions that construct on our present physique of labor in {hardware} safety and introduce new protocols which might be uniquely match to assist new scientific breakthroughs which were accelerated with the introduction of AI:

Constructing on previous years’ contributions and Microsoft’s collaboration with AMD, Google, and NVIDIA, we have now additional enhanced Caliptra, our open-source silicon root of belief The introduction of Caliptra 2.1 extends the {hardware} root-of-trust to a full safety subsystem. Study extra about Caliptra 2.1 right here.
We now have additionally added Adams Bridge 2.0 to Caliptra to increase assist for quantum-resilient cryptographic algorithms to the root-of-trust.
Lastly, we’re contributing OCP Layered Open-source Cryptographic Key Administration (L.O.C.Okay)—a key administration block for storage gadgets that secures media encryption keys in {hardware}. L.O.C.Okay was developed via collaboration between Google, Kioxia, Microsoft, Samsung, and Solidigm.

Advancing datacenter-scale sustainability

Sustainability continues to be a serious space of alternative for {industry} collaboration and standardization via communities such because the Open Compute Undertaking. Working collaboratively as an ecosystem of hyperscalers and {hardware} companions is one catalyst to deal with the necessity for sustainable datacenter infrastructure that may successfully scale as compute calls for proceed to evolve. This yr, we’re happy to proceed our collaborations as a part of OCP’s Sustainability workgroup throughout areas reminiscent of carbon reporting, accounting, and circularity:

Introduced at this yr’s International Summit, we’re partnering with AWS, Google, and Meta to fund the Product Class Rule initiative underneath the OCP Sustainability workgroup, with the aim of standardizing carbon measurement methodology for gadgets and datacenter gear.
Along with Google, Meta, OCP, Schneider Electrical, and the iMasons Local weather Accord, we’re establishing the Embodied Carbon Disclosure Base Specification to ascertain a typical framework for reporting the carbon impression of datacenter gear.
Microsoft is advancing the adoption of waste warmth reuse (WHR). In partnership with the NetZero Innovation Hub, NREL, and EU and US collaborators, Microsoft has printed warmth reuse reference designs and is creating an financial modeling device which give knowledge heart operators and waste warmth off takers/shoppers the fee it takes to develop the waste warmth reuse infrastructure primarily based on the circumstances like the scale and capability of the WHR system, season, location, WHR mandates and subsidies in place. These region-specific options assist operators convert extra warmth into usable vitality—assembly regulatory necessities and unlocking new capability, particularly in areas like Europe the place warmth reuse is turning into necessary.
We now have developed an open methodology for Life Cycle Evaluation (LCA) at scale throughout large-scale IT {hardware} fleets to drive in the direction of a “gold commonplace” in sustainable cloud infrastructure.

Rethinking node administration: Fleet operational resiliency for the frontier period

As AI infrastructure scales at an unprecedented tempo, Microsoft is investing in standardizing how numerous compute nodes are deployed, up to date, monitored, and serviced throughout hyperscale datacenters. In collaboration with AMD, Arm, Google, Intel, Meta, and NVIDIA, we’re driving a sequence of Open Compute Undertaking (OCP) contributions centered on streamlining fleet operations, unifying firmware administration, manageability interfaces and enhancing diagnostics, debug, and RAS (Reliability, Availability, and Serviceability) capabilities. This standardized method to lifecycle administration lays the inspiration for constant, scalable node operations throughout this era of speedy enlargement. Learn extra about our method to resilient fleet operations.

Paving the best way for frontier-scale AI computing

As we enter a brand new period of frontier-scale AI growth, Microsoft takes satisfaction in main the development of requirements that may drive the way forward for globally deployable AI supercomputing. Our dedication is mirrored in our lively position in shaping the ecosystem that permits scalable, safe, and dependable AI infrastructure throughout the globe. We invite attendees of this yr’s OCP International Summit to attach with Microsoft at sales space #B53 to find our newest cloud {hardware} demonstrations. These demonstrations showcase our ongoing collaborations with companions all through the OCP neighborhood, highlighting improvements that assist the evolution of AI and cloud applied sciences.

Accelerating open-source infrastructure growth for frontier AI at scale

Redefining energy distribution for the AI period

Cooling improvements for resiliency

Unified networking options for rising infrastructure calls for

Safety, sustainability, and high quality: Basic pillars for resilient AI operations

Protection in depth: Belief at each layer

Advancing datacenter-scale sustainability

Rethinking node administration: Fleet operational resiliency for the frontier period

Paving the best way for frontier-scale AI computing

Join with Microsoft on the OCP International Summit 2025 and past

Related Articles

These Seven AI Rings Translate Signal Language in Actual Time

Honolulu police raid Kalihi unlawful playing room, arrest two suspects, seizing machines

The Obtain: a Nobel winner on AI, and the case for fixing every part

LEAVE A REPLY Cancel reply

Latest Articles

These Seven AI Rings Translate Signal Language in Actual Time

Honolulu police raid Kalihi unlawful playing room, arrest two suspects, seizing machines

The Obtain: a Nobel winner on AI, and the case for fixing every part

Within the Scramble to Energy AI, Buyers Wager $140 Million on Knowledge Facilities at Sea

Using an AI rally, Robinhood preps second retail enterprise IPO

ABOUT US