How Meta Trains Its LLMs at Scale

Last updated: 2024/06/19 at 8:48 AM

9 Views 5 Min Read

Contents

From traditional AI to GenAI Related Maintaining reliability Next up: 100,000 GPUs and up

How does Meta train its large language models (LLMs)? A new blog post by a trio of Meta engineers shed some light on how the social networking giant trains its LLaMa AI models.

While traditional AI training typically involved training a massive number of models requiring a comparatively smaller number of GPUs, GenAI flipped things around, according to Adi Gangidi, KR Kishore, and Jenya Lee.

From traditional AI to GenAI

Specifically, training LLM meant a shift towards fewer jobs that are “incredibly large”. Under the hood, training GenAI at scale also called for a rethink of how software, hardware, and network infrastructure come together, the engineers explained.

“As we increase the number of GPUs in a job, the likelihood of an interruption due to a hardware failure also increases. Also, all of these GPUs still need to communicate on the same high-speed fabric to perform optimally.”

Considerations such as hardware reliability, fast recovery on failure, and the preservation of training state are vital in this new paradigm. The idea is to minimize failure, or barring that, have the system recover quickly by reducing overhead and re-initializing training in the shortest time possible. Achieving the latter means saving the training state regularly.

Finally, given the need to transfer vast amounts of data between GPUs in a synchronized fashion, slow network infrastructure or slow data exchange between GPUs can compound and slow down the entire process. Solving this problem requires a robust and high-speed network infrastructure, as well as efficient data transfer protocols and algorithms.

Maintaining reliability

Failures do grow with more GPUs being deployed, too. The authors wrote: “The number of failures scales with the size of the cluster, and having a job that spans the cluster makes it necessary to keep adequate spare capacity to restart the job as soon as possible.”

To minimize downtime during hardware failures, the team had to plan for detection and remediation when the systems break down. Fortunately, it’s sometimes possible to take preventive measures to mitigate downtime.

The most frequent failure modes would be GPUs not being detected by the server and the failure of networking cables. These happen most frequently in the “early life” of the server. Another common failure is uncorrectable errors in GPU memory. These are tracked carefully and a return of a bad GPU is initiated should error rates exceed vendor-defined thresholds.

The team apparently couldn’t decide between using RDMA over Converged Ethernet (RoCE) or InfiniBand fabrics for the requisite high-speed networking. Instead, they built two clusters with 24,000 GPUs each, one utilizing RoCE and the other InfiniBand.

Both were used to train Llama 3, with the RoCE cluster used for training the largest model. According to the authors, both provided equivalent performance for training.

Next up: 100,000 GPUs and up

Moving ahead, Meta says it will be working with hundreds of thousands of GPUs and even larger volumes of data. The additional GPUs will necessitate longer distances and latencies, which means adopting new hardware technologies, new GPUs, and further evolving the infrastructure.

To put this statement into context, Meta’s Yann LeCun in May confirmed that Meta has purchased another 500,000 Nvidia GPUs, giving it an incredible arsenal of a million GPUs for training AI models.

“These challenges will push us to innovate and adapt in ways we can’t fully predict yet. But one thing is certain: We are only at the beginning of this journey. As we continue to navigate the evolving landscape of AI, we remain committed to pushing the boundaries of what’s possible,” they wrote.

Image credit: iStock/Derick Hudson