According to a report from Tom’s Hardware, while major tech companies are investing heavily in datacenter GPUs, the lifespan of these GPUs may only be 1 to 3 years, depending on their utilization rates.
The report, citing a general architect at Alphabet, noted that because GPUs are under heavy workload of AI training and inference, they tend to wear out more quickly than other components.
According to the report, in datacenters operated by cloud service providers (CSPs), the utilization rate of GPUs for AI workloads ranges from approximately 60% to 70%.
The report indicated that, citing the words from the general architect at Alphabet, at this utilization rate, a GPU can typically survive for 1 to 2 years, or up to 3 years. While the report stated that this claim cannot be considered 100% accurate and requires further confirmation, it highlighted that modern datacenter GPUs for AI and HPC applications consume and dissipate 700W of power or more, which is significant stress for chips.
One way to extend the life of the GPUs is to reduce the utilization rate, according to the report. However, to reduce the utilization rate implies that the GPUs will lose value more gradually and it will take longer to return their capital, which isn’t ideal for business. Therefore, the report pointed out that most cloud service providers will use their GPUs at a high utilization rate.
The report also references a study conducted by Meta, which describes training its Llama 3 405B model on a cluster powered by 16,384 NVIDIA H100 80GB GPUs. According to the report, in that study, the model flop utilization (MFU) rate of the cluster was about 38% (using BF16), while during a 54-day pre-training snapshot, out of 419 unforeseen disruptions, 148 (30.1%) were caused by GPU failures (including NVLink fails) and 72 (17.2%) were due to HBM3 memory failures.
This result carried out by Meta, according to the report, is quite favorable for NVIDIA’s H100 GPUs. If GPUs and their memory fail at Meta’s rate, the annualized failure rate will be about 9%, and in 3 years, it will be about 27%. However, GPUs will likely fail more frequently after a year of heavy use, as the report pointed out.
(Photo credit: NVIDIA)