News

[News] NVIDIA’s Blackwell GPUs Reportedly Overheat in Server Racks, Sparking Delay Concerns


2024-11-18 Semiconductors editor

NVIDIA Blackwell

Is NVIDIA facing another hitch before Blackwell GPUs officially hit the market? Following the yield issues a couple of month ago, the AI giant’s Blackwell processors are reportedly encountering overheating problems when installed in high-capacity server racks, according to a report by The Information, cited by Tom’s Hardware and Reuters.

Notably, these challenges have led to design modifications and delays, sparking concerns from major customers such as Google, Meta, and Microsoft about the timely deployment of Blackwell servers, according to the reports.

The reports note that insiders told The Information that Blackwell GPUs for AI and high-performance computing (HPC) face overheating issues in servers housing 72 processors, which can demand up to 120kW per rack.

Therefore, NVIDIA has reportedly revised its server rack designs multiple times, as overheating not only hampers GPU performance but also risks hardware damage.

As NVIDIA’s GPUs are critical to clients like Google, Meta, and Microsoft, who rely on them to train their most advanced large language models, an NVIDIA spokesperson told Reuters that the company is collaborating closely with cloud providers and described the design adjustments as a routine part of the development process.

It is worth noting that according to Tom’s Hardware, although such adjustments are common in large-scale technology rollouts, they have contributed to delays, which may further postpone expected shipping timelines.

Tom’s hardware notes that the final revision of Blackwell entered mass production only in late October, indicating shipments are expected to begin in late January. Whether the latest overheating snag would further delay the shipment of Blackwell or not remains to be seen.

This is certainly not the first time NVIDIA has encountered issues on Blackwell. A couple of month ago, the GPUs have reportedly suffered from design flaw affecting processor yields, which is said to be related to TSMC’s CoWoS advanced packaging but eventually resolved by making changes to the GPUs’ masks.

However, NVIDIA CEO Jensen Huang dispelled rumors that TSMC was to blame in October, emphasizing that TSMC help fix the problem and resume the manufacturing “at an incredible pace.” He also described the demand for Blackwell as “insane.”

Read more

(Photo credit: NVIDIA)

 

Please note that this article cites information from The Information, Reuters, and Tom’s Hardware.

Get in touch with us