The Carbon Cost of Training Large AI Models

The Carbon Cost of Training Large AI Models

Training large AI models requires a lot of computational power, which means high energy consumption and significant carbon emissions. As model sizes grow, so does their environmental footprint, often measured in tens or hundreds of thousands of kilograms of CO2 equivalent. 

Recent studies have quantified emissions from large AI models like GPT-3 and BERT, revealing environmental costs comparable to those of entire households or commercial flights. At the same time, AI holds a lot of potential to support sustainability through smarter energy systems and efficient resource management. 

This dual impact makes it essential to understand the carbon cost of large-scale AI training, the factors influencing it, and the strategies being developed to reduce its footprint while promoting its positive applications.

Understanding the Lifecycle Emissions of AI Model Training

The carbon cost of training large AI models extends beyond the direct electricity used by Graphics Processing Units (GPUs) or Tensor Processing Units (TPUs), it also includes the entire infrastructure used in the process. This measurement accounts for all aspects of the training process, ranging from hardware to its operation.

The main sources of lifecycle emissions are the following:

  • Compute Hardware Operations: GPU-powered training requires large amounts of electricity for hours or even months.
  • Cooling and Data Center Infrastructure: Keeping data center equipment at the required temperature increases the amount of energy consumed.
  • Electricity Generation Sources: Electricity generated from fossil fuels leads to significantly greater carbon emissions than sources that rely on renewable energy.

The amount of greenhouse gases released is often measured in units of carbon dioxide equivalents. A common question ‘Is using AI bad for the environment’ might cross your mind. You can read this blog to understand the facts about AI and environmental sustainability.

How Much Energy Is Required to Pretrain Large Language Models?

Pretraining large language models requires large amounts of data to be processed through deep neural networks using many compute cores over a prolonged time period. The majority of a model’s energy usage is attributed to the pretraining phase.

Advanced systems such as NVIDIA A100s and Google’s TPUs provide higher efficiency but still require considerable energy when used in large clusters. Pretraining usually requires multiple weeks and grows directly proportional to: 

  • Model size (number of parameters)
  • Batch size and sequence length
  • The dataset used for training consists of hundreds of billions of tokens.
  • Repeated experiments and unsuccessful attempts contribute to wasteful energy usage.

Reported Carbon Emission Estimates from Notable Models

Determining the carbon footprint of individual AI models helps us understand the environmental costs of their training. Research and public statements by major companies have provided information about the carbon emissions produced by popular AI models.

GPT-3 (OpenAI): It is estimated that training GPT-3 required around 1,278 megawatt hours of electricity, emitting 552 metric tons of carbon dioxide equivalent when we consider the average emissions of U.S. power generation. This is similar to the yearly emissions produced by 120 passenger cars or 600 transatlantic flights per person.

GPT 4 (Open AI): A peer-reviewed study found that GPT-4’s carbon emissions were twelve times higher than GPT-3’s (7,138 metric tons of CO2e), making its footprint roughly equal to the annual emissions of 1550 US citizens.

BERT (Google): A 2019 study found that training BERT-base with hyperparameter tuning released approximately 1400 pounds of CO2e into the atmosphere. Training larger models like BERT-large can result in emissions of over 4,000 pounds of CO2e, depending on the training parameters and hardware used.

PaLM (Google): Training PaLM, which has 540 billion parameters, was estimated to take 8.9 million GPU hours. It is estimated that training such a model could produce emissions exceeding 1,000 metric tons of CO2e.

Meta’s LlaMA 2 (Meta AI): Meta disclosed using 16,000 GPUs to train its LlaMA and LlaMA 2 models. Nonetheless, the unadjusted CO2 emissions from training are likely to be several hundred metric tons.

The exact numbers are influenced by factors such as hardware efficiencies, training durations, and the adoption of optimization strategies. However, the trend is consistent: Larger models require more energy to train and therefore produce greater amounts of CO2 emissions. 

Did you know? Data centers that are essential for training large AI models consumed 460 Tera Watt Hours of electricity in 2022, representing 2% of global electricity usage as reported by the International Energy Agency (IEA). The IEA projects that this could double to over 1,000 TWh by 2026.

Does the Location of Training AI Models Affect Their Emissions Intensity?

The geographic location of training AI models defines the carbon intensity of their associated carbon emissions. This is because the source of power is different at different locations.

For instance, training a model using data centers powered by coal (such as in parts of Asia or the U.S. Midwest) produces higher carbon emissions than training in regions with cleaner energy mixes, such as Quebec (hydropower) or Norway (hydro and wind). 

Additionally, hyperscale cloud providers often distribute workloads across multiple regions, making it difficult to assess emissions unless detailed location data is disclosed. 

Interesting Fact: According to research, Google's data center in Finland operates on 97% carbon-free energy, whereas its data centers in parts of Asia operate on only 4–18% carbon-free energy. This highlights the contribution of using fossil fuels to air pollution.

Offset Strategies and Emissions Accounting by AI Companies

AI companies are now responding to growing concerns about their environmental footprint by implementing carbon offsets and sustainability initiatives. Nonetheless, there are wide differences in how much information companies disclose about their initiatives.

OpenAI has not disclosed specific GPT-4 emissions information, yet they claim that they are trying to optimize the model’s deployment to reduce environmental impact. The organization has not provided any information on an emissions accounting or offset program.

Google DeepMind has shared information indicating its AI research is now powered by carbon-neutral data centers supported by RECs and carbon offsets. Implementing DeepMind's machine learning algorithms led to a 40% reduction in the energy required for cooling Google's data centers.

According to Meta’s model card, training LLaMA 3 (1B and 3B models) used just over 581 MWh of energy, about half as much as GPT-3. It produced 240 tons of CO₂, but since almost all the electricity came from renewables, the process was mostly carbon neutral.

Microsoft has revealed its commitment to achieving carbon negativity by 2030 and accounting for carbon emissions from cloud-based AI training. It also aims to remove its historical carbon emissions by 2050.

Despite all these commitments and efforts, a significant obstacle still exists: Inconsistency in how emissions are measured and reported. Most organizations rely on averages or offsets to hide the real emissions from their operations.

Emerging Research on Efficient Training and Green AI

Green AI means building and using AI in a way that uses less energy and produces fewer carbon emissions. It focuses on making AI models more efficient without wasting resources.

Researchers are now exploring ways to promote what is now referred to as Green AI. The growing debate over the environmental impact of AI has driven researchers to find ways to enhance the efficiency of model training with less impact on the environment. Read our blog to explore more about this.

Key areas of advancement include:

Sparse Models: According to research published on arXiv, sparse models (such as Google’s Switch Transformer) dynamically switch off parts of the network that aren't needed at any given time. Training time is reduced while maintaining the quality of results, allowing dramatically more energy-efficient training of extremely large models.

Low-Precision Arithmetic: It has been studied that using lower precision floating-point operations, like 16 or 8-bit, can significantly decrease memory needs and the efficiency with which the model is computationally performed. Special GPUs like NVIDIA’s Tensor Cores and TPUs from Google are designed for these operations, significantly increasing training speed.

Efficient Architecture Search: NAS techniques enable the creation of efficient models that outperform their competitors while using fewer computational resources. For example, DistilBERT, a distilled version of BERT, retains about 97% of BERT’s language understanding capabilities while using 40% fewer parameters and providing a 60% faster interface.

Do you care about sustainability in tech and everyday living? Choose greener options today. Discover our eco-friendly products for home, self-care, and pets. Make your contribution towards a greener planet today!

We hope you enjoyed this article. Please feel free to leave a comment below if you want to engage in the discussion.

If you want to read more like this, make sure to check out our Blog and follow us on Instagram. If you are interested in truly sustainable products, check out our Shop.


Back to blog

Leave a comment

Please note, comments need to be approved before they are published.