Space Industry and Business News
TECH SPACE
AI Training Strategies Tested on World's Fastest Supercomputer
illustration only
AI Training Strategies Tested on World's Fastest Supercomputer
by Clarence Oxford
Los Angeles CA (SPX) May 16, 2024

Researchers at Oak Ridge National Laboratory (ORNL) investigated training techniques for a significant AI model using the Frontier supercomputer.

The study led by Sajal Dash, Feiyi Wang, and Prasanna Balaprakash utilized Frontier, the world's first exascale supercomputer, for initial stages of training on a large language model. They tested how models with 22 billion, 175 billion, and 1 trillion parameters could run across 128 and later 384 of Frontier's more than 9,400 nodes. The team did not complete the training of a full model.

Large language models aim to mimic human brain patterns in learning and recognizing words and numbers, improving over time with more training. The goal is to create a model that can apply learned knowledge to new, unfamiliar tasks.

Traditionally, the resources needed for such training are held by private companies, limiting research opportunities and verification. Frontier's supercomputing power, however, offers new possibilities for training AI models more efficiently.

"Traditionally, this process has relied on expert knowledge or on trial and error," said Prasanna Balaprakash, ORNL's director of AI programs. "One of the highlights of our work in this study is the automation of identifying high-performing strategies among a vast array of options. We leveraged DeepHyper, an open-source scalable tuning software, to automatically determine the optimal settings. We plan to extend this automated approach to fine-tune system-level performance and enhance efficiency at an...

Training a large language model with a trillion parameters from start to finish without optimizations would take months, even at Frontier's speeds. The ORNL study looked at data parallelism, breaking a large problem into smaller parts to reach a solution faster, to train AI and transfer training across different GPU frameworks.

"It's about finding the best combination of training strategies while getting the best throughput," Dash said. "Most deep-learning frameworks target the GPUs made by NVIDIA rather than the GPUs made by AMD that power Frontier. We wanted to see if existing models could run on Frontier, how to make the best use of Frontier's computing power and how to make that level of performance possible across GPU platforms.

"We can't train a model this size on a single GPU or a single node, for example, and every time we cross the barrier between nodes that requires more communication that consumes more time. How do we slice up the model across GPUs so that we can fit and train the model without losing too much time and energy communicating between nodes?"

The researchers found a blend of parallelism strategies worked best when tailored to the computing platform but said their work's far from finished.

"The efficiency we achieved on Frontier with this model was decent but not decent enough," Wang said. "At extreme scale, we achieved 30% efficiency - which means we left about 70% of Frontier's computing power on the floor. We need much more optimization to make the machine more efficient at this scale."

Next steps include training a model further with peer-reviewed scientific data across more nodes.

"This study and our findings aren't so much a manual as a potential set of guidelines for users training a large model," Dash said. "They can draw from our experience to decide how to use Frontier's resources to train their particular model and make the most effective use of their allotted computing time."

The study was presented at the International Supercomputing Conference High Performance 2024 in Hamburg, Germany. Collaborators included Isaac Lyngaas, Junqi Yin, Xiao Wang, and Guojing Cong of ORNL and Romaine Egele of Paris-Saclay University.

The study focused on optimizing the use of GPUs for training AI, with each of Frontier's nodes relying on four AMD MI250X GPUs.

The training ran for a few hours on about 100 million tokens of test data, a small fraction of the data needed for a trillion-parameter model.

"This study was largely an exercise to show we can train this particular size of model on Frontier at this particular scale with this particular level of efficiency," Wang said. "We didn't get anywhere near the finish line of a complete large language model."

Research Report:Optimizing Distributed Training on Frontier for Large Language Models

Related Links
Oak Ridge National Laboratory
Innovative and Novel Computational Impact on Theoryand Experiment Program
Space Technology News - Applications and Research

Subscribe Free To Our Daily Newsletters
Tweet

RELATED CONTENT
The following news reports may link to other Space Media Network websites.
TECH SPACE
Amazon cloud division head unexpectedly steps down
San Francisco (AFP) May 14, 2024
The head of Amazon's AWS cloud computing business, Adam Selipsky, who was helping lead the company's expansion into AI, told workers he was stepping down Tuesday. Amazon Web Services is a key subsidiary of the tech giant, having made $25 billion worldwide in the first quarter, capitalizing on the growing appetite among businesses for remote computer and artificial intelligence services. In a memo to staff, Selipsky said he was leaving with "mixed emotions," but "given the state of the business a ... read more

TECH SPACE
Energy transition risks critical mineral shortage: IEA

Microbial Enzyme Could Make Plastics Biodegradable

SwRI investigates boiling processes in partial gravity

AI Training Strategies Tested on World's Fastest Supercomputer

TECH SPACE
China launches communication test satellites into medium-Earth orbit

CesiumAstro provides multi-beam Ka-band payloads for Rocket Lab under Tranche 2 contract

Rocket Lab Advances SDA Satellite Program with New Subcontractor Partnerships

Enhancing connectivity and readiness at Space Systems Command

TECH SPACE
TECH SPACE
China Encourages BeiDou System Integration in Electric Bicycles

Estonia summons Russian envoy over GPS jamming

OneNav introduces new L5-direct GNSS receiver in response to increased GPS jamming

Galileo satellite constellation expands with two new additions

TECH SPACE
US imposes trade curbs on Chinese firms over balloon incident

Fighter jet crashes at Singapore airbase

Australian PM calls China warplane conduct 'unacceptable'

Health body recommends Brussels night flight ban

TECH SPACE
Rapidus 'last opportunity' to put Japan back on global chip map

3D Printed Glass Sensors on Optical Fiber for Enhanced Connectivity

Experiment Allows for Potential Millions of Qubits on Single Chip

Biden sharply hikes US tariffs on Chinese EVs and chips

TECH SPACE
Enabled Intelligence partners with Pixxel for advanced hyperspectral data solutions

China sees continued decline in NOx emissions despite higher fossil fuel use

Ariane 6 set to launch 3Cat-4 CubeSat for Earth observation

NASA Advances Climate Research with New Earth System Explorers Program Proposals

TECH SPACE
New strategy for removing persistent PFAS contaminants unveiled

Panama president-elect proposes 'calm' talks on contested mine

Judge tosses California children's pollution suit against US govt

Nepal's nature threatened by new development push: conservationists

Subscribe Free To Our Daily Newsletters




The content herein, unless otherwise known to be public domain, are Copyright 1995-2024 - Space Media Network. All websites are published in Australia and are solely subject to Australian law and governed by Fair Use principals for news reporting and research purposes. AFP, UPI and IANS news wire stories are copyright Agence France-Presse, United Press International and Indo-Asia News Service. ESA news reports are copyright European Space Agency. All NASA sourced material is public domain. Additional copyrights may apply in whole or part to other bona fide parties. All articles labeled "by Staff Writers" include reports supplied to Space Media Network by industry news wires, PR agencies, corporate press officers and the like. Such articles are individually curated and edited by Space Media Network staff on the basis of the report's information value to our industry and professional readership. Advertising does not imply endorsement, agreement or approval of any opinions, statements or information provided by Space Media Network on any Web page published or hosted by Space Media Network. General Data Protection Regulation (GDPR) Statement Our advertisers use various cookies and the like to deliver the best ad banner available at one time. All network advertising suppliers have GDPR policies (Legitimate Interest) that conform with EU regulations for data collection. By using our websites you consent to cookie based advertising. If you do not agree with this then you must stop using the websites from May 25, 2018. Privacy Statement. Additional information can be found here at About Us.