AI's New Frontier: Training Trillion-Parameter Models with Much Fewer GPUs

This articles was originally published on AI Business.

Training a language model the size of OpenAI’s ChatGPT would normally require a sizable supercomputer. But scientists working on the world’s most powerful supercomputer discovered innovative techniques to train gigantic models using a lot less hardware.

In a new research paper, scientists from the famed Oak Ridge National Laboratory trained a one trillion parameter model using just a few thousand GPUs in their Frontier supercomputer, the most powerful non-distributed supercomputer in the world and one of only two exascale systems globally.

They used just 3,072 GPUs to train the giant large language model out of 37,888 AMD GPUs housed in Frontier. That means the researchers trained a model comparable to ChatGPT’s rumored size of a trillion parameters on just 8% of Frontier's computing power.

The Frontier team achieved this feat using distributed training strategies to essentially train the model across the unit's parallel architecture. Using techniques like shuffled data parallelism to reduce communication between layers of nodes and tensor parallelism to handle memory constraints, the team was able to distribute the training of the model more efficiently.

Other techniques the researchers employed to coordinate the model’s training include pipeline parallelism to train the model across various nodes in stages to improve speed.

The results saw 100% weak scaling efficiency for models 175 billion parameter and 1 trillion parameters in size. The project also achieved strong scaling efficiencies of 89% and 87% for these two models.

A Trillion Parameters

Training a large language model with a trillion parameters is always a challenging undertaking. The authors said the sheer size of the model stood at a minimum 14 terabytes. For contrast, one MI250X GPU found in Frontier only has 64 Gigabytes.

Methods like the ones the researchers explored will need to be developed to overcome issues with memory.

However, one issue they faced was loss divergence due to large batch sizes. Their paper states that future research into bringing down training time for large-scale systems must see an improvement in large-batch training with smaller per-replica batch sizes.

The researchers also called for more work to be done around AMD GPUs. They wrote that most large-scale model training is done on platforms that support Nvidia solutions. While the researchers created what they called a blueprint for efficient training of LLMs on non-Nvidia platforms, they wrote: “There needs to be more work exploring efficient training performance on AMD GPUs.”

Frontier held onto its crown as the most powerful supercomputer in the most recent Top500 list, pipping the Intel-powered Aurora supercomputer.

Comments

Plain text