OpenAI Unveils MRC: The High-Performance Networking 'Circulatory System' for AI Supercomputers

OpenAI introduces MRC (Multipath Reliable Connection), a novel networking protocol developed with tech giants to eliminate congestion and ensure microsecond resilience for clusters exceeding 100,000 GPUs.
The Networking Bottleneck in AI Training
In the world of frontier AI training, a single data transfer arriving late can ripple through the entire cluster, causing thousands of GPUs to sit idle. As OpenAI scales toward its ambitious "Stargate" project, traditional dynamic routing and single-path flows have become the primary sources of delay and jitter. To solve this, OpenAI partnered with AMD, Broadcom, Intel, Microsoft, and NVIDIA to develop MRC (Multipath Reliable Connection).
The Core Innovation: Multi-plane and Packet Spraying
MRC fundamentally rethinks how data moves between GPUs. Instead of a single high-speed link, MRC utilizes multi-plane networks, splitting an 800Gb/s interface into eight 100Gb/s parallel planes. This allows for a massive cluster of 131,000 GPUs to be fully connected using only two tiers of switches, drastically reducing power consumption and complexity.
Complementing this is Adaptive Packet Spraying. Unlike traditional protocols that require strict packet order, MRC sprays packets across hundreds of distinct paths simultaneously. If a path becomes congested, MRC swaps it in microseconds. If a packet is lost, it assumes path failure and reroutes immediately, bypassing failure modes that previously took seconds to resolve.
The Shift Toward Proactive Stance
This release signals a strategic move by OpenAI to commoditize and standardize the infrastructure layer. By contributing the MRC specification to the Open Compute Project (OCP), OpenAI is effectively ensuring that its research roadmap is never held hostage by proprietary networking silos. This "lead-by-standardization" approach forces hardware vendors to align with OpenAI’s scaling needs, ensuring that future supercomputers are built with high-resiliency, low-latency fabrics as a baseline requirement. It is a proactive effort to turn the "failure amplifier" of massive clusters into a stable, self-healing environment.
Source: OpenAI MRC Networking