Alibaba unveils the network and datacenter design it uses for large language model training

Alibaba has revealed its datacenter design for LLM training, which apparently consists of an Ethernet-based network in which each host contains eight GPUs and nine NICs that each have two 200 GB/sec ports.

The tech giant, which also offers one of the best large language models (LLM) around via its Qwen model, trained on 110 billion parameters, says this design has been used in production for eight months, and aims to maximize the utilization of a GPU’s PCIe capabilities increasing the send/receive capacity of the network.

Another feature that increases speed is the use of NVlink for the intra-host network providing more bandwidth between hosts. Each port on the NICs is connected to a different top-of-rack switch avoiding a single point of failure a design that Alibaba call rail-optimized.

Each pod contains 15,000 GPUs

A new type of network is required because the traffic patterns in LLM training is different from general cloud computing because of low entropy and bursty traffic. there is also a higher sensitivity to faults and single point failures.

“Based on the unique characteristics of LLM training, we decided to build a new network architecture specifically for LLM training. We should meet the following goals; scalability, high performance, and single-ToR fault tolerance,” the company said.

Another part of the infrastructure that was revealed was the cooling mechanism. As no vendors could provide a solution to keep chips below 105C, the temperature at which switches begin to shut down, Alibaba designed and created its own vapor chamber heat sink along with using more wicked pillars at the center of chips carrying heat away more efficiently.

The design for LLM training is encapsulated in pods that contain 15,000 GPUs and each pod can be located in a single datacenter. “All datacenter buildings in commission in Alibaba Cloud have an overall power constraint of 18MW, and an 18MW building can accommodate approximately 15K GPUs. In conjunction with HPN, each single building perfectly houses an entire Pod, making predominant links inside the same building.” Alibaba wrote.

Alibaba also wrote it expects model parameters to continue to rise by an order of magnitude in the next several years from one trillion to 10 trillion parameters, and that its new architecture is planned to be able to support this and increase to a scale of 100,000 GPUs.

Via The Register

More from TechRadar Pro

Services Marketplace – Listings, Bookings & Reviews

Entertainment blogs & Forums

Best Apple Watch (2026): Series 11, SE 3, and Ultra 3

How to Choose the Right Gaming Laptop (2026): What You Need to Know

Best Alternatives to Google’s Android Operating System (2026), Tested and Reviewed

Ring Kills Flock Safety Deal After Super Bowl Ad Uproar

Best Apple Watch (2026): Series 11, SE 3, and Ultra 3

The Science Fiction and Fantasy Books You Can’t Afford to Miss in September!

Send a newsletter? This $100 list-building tool is just $12 right now.

There’s officially a snake named after Salazar Slytherin now

Best Apple Watch (2026): Series 11, SE 3, and Ultra 3

How to Choose the Right Gaming Laptop (2026): What You Need to Know

Best Alternatives to Google’s Android Operating System (2026), Tested and Reviewed

Ring Kills Flock Safety Deal After Super Bowl Ad Uproar

Alibaba unveils the network and datacenter design it uses for large language model training

Bydls

Each pod contains 15,000 GPUs

More from TechRadar Pro

Related Post

Best Apple Watch (2026): Series 11, SE 3, and Ultra 3

How to Choose the Right Gaming Laptop (2026): What You Need to Know

Ring Kills Flock Safety Deal After Super Bowl Ad Uproar

You missed

Best Apple Watch (2026): Series 11, SE 3, and Ultra 3

How to Choose the Right Gaming Laptop (2026): What You Need to Know

Best Alternatives to Google’s Android Operating System (2026), Tested and Reviewed

Ring Kills Flock Safety Deal After Super Bowl Ad Uproar