Nvidia's Blackwell GPUs are some of the most complex semiconductor devices ever built, significantly increasing their manufacturing, packaging, and testing difficulty. According to the Financial Times, citing Doug Lefever, the chief executive of Advantest, it takes three to four times longer to test Blackwell data center GPUs than Hopper data center GPUs because each unit has to be tested dozens of times on different tools before shipping.
An Nvidia Blackwell B100/B200 GPU consists of two compute chiplets packing 104 billion transistors alongside eight HBM3E memory chiplets interconnected using an interface enabled by TSMC's CoWoS-L packaging technology. This contrasts Nvidia's Hopper H100 GPU, which has one 80-billion transistor chiplet and six HBM3 memory stacks.
Typically, as the transistor count grows, test complexity grows almost exponentially, as chips require more test patterns and longer test times. Test protocols must cover high-speed interconnects, stress conditions, thermal conditions (which, in the case of the B200, are extreme), and multiple operational modes (Blackwell adds FP4 support). Since Blackwell involves two highly complex chiplets with new features and higher thermals, it is natural that testing them takes more than twice as long.
There's more to this. TSMC's CoWoS-L 2.5D packaging techniques introduce additional test steps (and sometimes multiple test phases) to ensure that each component in the package is functioning correctly and that interconnects are reliable.
With Blackwell, one has to test compute chiplets and memory chiplets separately (though DRAM makers test HBM3 stacks) and then test the GPU multiple times as these chiplets are added to their RDL interposer. We do not know the exact number of times that Nvidia's Blackwell B100 and B200 GPUs and modules are tested during packaging and assembly. Still, these GPUs undergo considerably more testing iterations than Hopper H100 GPUs.
In general, Blackwell's long test time reflects the growing complexity of these AI and HPC GPUs and the requirement for extensive validation to ensure performance and reliability in various data center environments when functioning alongside other components, such as CPUs, DPUs, and network interface cards.