‘A virtual DPU within a GPU’: Could clever hardware hack be behind DeepSeek’s groundbreaking AI efficiency?

Bydls

Jan 30, 2025 #Pro

A new approach called DualPipe seems to be the key to DeekSeek’s success
One expert describes it as an on-GPU virtual DPU that maximizes bandwidth efficiency
While DeepSeek has used Nvidia GPUs only, one wonders how AMD’s Instinct would fare

China’s DeepSeek AI chatbot has stunned the tech industry, representing a credible alternative to OpenAI’s ChatGPT at a fraction of the cost.

A recent paper revealed DeepSeek V3 was trained on a cluster of 2,048 Nvidia H800 GPUs – crippled versions of the H100 (we can only imagine how much more powerful it would be running on AMD Instinct accelerators!). It reportedly required 2.79 million GPU-hours for pretraining, fine-tuning on 14.8 trillion tokens, and cost – according to calculations made by The Next Platform – a mere $5.58 million.

But exactly how DeepSeek’s developers managed this feat is likely down to a clever hack.

A virtual DPU on the GPU itself

First, some background. DeepSeek is an advanced Mixture-of-Experts (MoE) language model designed to optimize performance by selectively activating only the most relevant parts of its architecture for each task. The third version of the model, DeepSeek-V3, features a total of 671 billion parameters, with only 37 billion activated for any given token prediction. This selective activation massively reduces computational costs while maintaining high performance and accuracy – which you’ll see if you try it.

It’s easy to be skeptical of DeepSeek and the claims made regarding its training, but the paper reveals some of the magic the developers came up with to make the most of the crippled hardware they had to work with. This includes the creation of the DualPipe algorithm for efficient pipeline parallelism.

According to the information published by DeepSeek, DualPipe overlaps forward and backward computation, reduces latency, and optimizes data movement across GPUs. By efficiently managing communication, it minimizes idle time (pipeline bubbles) and dynamically balances GPU compute cores (Streaming Multiprocessors) between computation and communication, preventing data transfer bottlenecks as the model scales.

A commenter on The Next Platform describes DualPipe as “essentially creating a virtual DPU on the GPU itself to handle all-to-all communication,” which highlights its role in optimizing data transfer efficiency.

The paper goes into further detail, “In order to ensure sufficient computational performance for DualPipe, we customize efficient cross-node all-to-all communication kernels (including dispatching and combining) to conserve the number of SMs dedicated to communication. The implementation of the kernels is co-designed with the MoE gating algorithm and the network topology of our cluster. To be specific, in our cluster, cross-node GPUs are fully interconnected with IB, and intra-node communications are handled via NVLink.”

Example DualPipe scheduling for 8 PP ranks and 20 micro-batches in two directions. The micro-batches in the reverse direction are symmetric to those in the forward direction, so we omit their batch ID for illustration simplicity. Two cells enclosed by a shared black border have mutually overlapped computation and communication. (Image credit: DeekSeek)

Services Marketplace – Listings, Bookings & Reviews

Entertainment blogs & Forums

Vodafone makes ‘world’s first’ satellite video call with a standard phone –here’s why that’s a big deal

Forget mega yachts, AI data centers are quickly becoming the next battleground for billionaires as Zuckerberg pledges $65 billion CAPEX spend in 2025

North Korean Lazarus hackers launch large-scale cyberattack by cloning open source software

Watch out Nvidia, a Linux leak revealing three new Intel Arc Battlemage GPUs may challenge the RTX 5000 series

Vodafone makes ‘world’s first’ satellite video call with a standard phone –here’s why that’s a big deal

The Science Fiction and Fantasy Books You Can’t Afford to Miss in September!

Send a newsletter? This $100 list-building tool is just $12 right now.

There’s officially a snake named after Salazar Slytherin now

Vodafone makes ‘world’s first’ satellite video call with a standard phone –here’s why that’s a big deal

Forget mega yachts, AI data centers are quickly becoming the next battleground for billionaires as Zuckerberg pledges $65 billion CAPEX spend in 2025

North Korean Lazarus hackers launch large-scale cyberattack by cloning open source software

Watch out Nvidia, a Linux leak revealing three new Intel Arc Battlemage GPUs may challenge the RTX 5000 series

‘A virtual DPU within a GPU’: Could clever hardware hack be behind DeepSeek’s groundbreaking AI efficiency?

Bydls

A virtual DPU on the GPU itself

You might also like

Related Post

Vodafone makes ‘world’s first’ satellite video call with a standard phone –here’s why that’s a big deal

Forget mega yachts, AI data centers are quickly becoming the next battleground for billionaires as Zuckerberg pledges $65 billion CAPEX spend in 2025

North Korean Lazarus hackers launch large-scale cyberattack by cloning open source software

Leave a Reply Cancel reply

You missed

Vodafone makes ‘world’s first’ satellite video call with a standard phone –here’s why that’s a big deal

Forget mega yachts, AI data centers are quickly becoming the next battleground for billionaires as Zuckerberg pledges $65 billion CAPEX spend in 2025

North Korean Lazarus hackers launch large-scale cyberattack by cloning open source software

Watch out Nvidia, a Linux leak revealing three new Intel Arc Battlemage GPUs may challenge the RTX 5000 series