● LIVE   Breaking News & Analysis
Ehedrick
2026-05-06
Science & Space

How to Evaluate Extreme Weather Forecasts: Why Traditional Models Still Outperform AI

Step-by-step guide comparing AI and physics-based weather models for extreme events. Explains why traditional models outperform AI for record-breaking forecasts.

Introduction

Extreme weather events—record-breaking heatwaves, cold snaps, and storms—cause hundreds of billions of dollars in damages annually. Early warning systems save lives, but choosing the right forecasting model is critical. While artificial intelligence (AI) models have revolutionized weather prediction for routine forecasts, a recent study in Science Advances reveals that they still fall short for extreme, record-breaking events. This guide explains step by step how to assess and compare physics-based (traditional) and AI-based weather models for forecasting the most severe weather. By the end, you'll understand why traditional models remain essential and how to avoid over-relying on AI for extremes.

How to Evaluate Extreme Weather Forecasts: Why Traditional Models Still Outperform AI
Source: www.carbonbrief.org

What You Need

  • A basic understanding of weather models – physics-based (numerical weather prediction) and AI (machine learning based on historical data).
  • Access to model outputs – both from a traditional physics-based model and an AI-based model for the same period.
  • Historical extreme event data – a dataset of record-breaking hot, cold, and windy events (e.g., from a reputable climate agency for years like 2018 and 2020).
  • Statistical analysis tools – software to compare frequency and intensity (e.g., R, Python with libraries like NumPy and Matplotlib).
  • Evaluation metrics – ability to measure forecast error for extremes (e.g., probability of detection, false alarm ratio, or intensity bias).

Step-by-Step Guide

Step 1: Understand the Two Model Types

Before comparing, know the fundamental difference:

  • Physics-based models use complex equations of atmospheric and oceanic processes rooted in physical laws. They simulate weather days or weeks ahead by solving these equations with supercomputers.
  • AI models learn patterns from massive historical weather data. They do not use physics; instead, they recognize statistical correlations and make predictions based on training data.

This distinction explains why AI models can struggle with events that are rare or outside their training range. As study author Prof Sebastian Engelke notes, AI models are “relatively constrained to the range of this dataset.”

Step 2: Identify Record-Breaking Extreme Events

Select a set of record-breaking events from a chosen period (e.g., 2018 and 2020 as in the study). These should be extreme in temperature (hot and cold) and wind. Ensure the events are indeed record-breakers—exceeding previous maxima or minima in the historical record. This dataset will be your benchmark.

Step 3: Obtain Forecasts from Both Model Types

Run or retrieve forecasts from:

  • A state-of-the-art physics-based model (e.g., ECMWF IFS).
  • A top-performing AI model (e.g., Google's GraphCast, Huawei's Pangu-Weather, or Nvidia's FourCastNet).

Ensure the forecasts cover the exact same time window as the identified extreme events. Record the predicted values for temperature and wind speed at relevant locations.

Step 4: Compare Frequency of Predicted Extremes

Count how many of the record-breaking events each model correctly forecasts. For any given threshold (e.g., temperature above the previous record), determine:

  • How many events does the physics model predict?
  • How many does the AI model predict?

The study found that AI models underestimate the frequency of record-breaking events—they miss many that physics models capture. If your analysis shows AI missing a larger proportion, that’s a red flag.

Step 5: Compare Intensity of Predicted Extremes

For the events that both models do predict, compare the magnitude. Compute the difference between the forecast value and the actual observed value for each extreme event. Physics-based models tend to produce more accurate intensities (closer to the record). AI models often forecast values that are too low, underestimating the severity.

How to Evaluate Extreme Weather Forecasts: Why Traditional Models Still Outperform AI
Source: www.carbonbrief.org

Step 6: Evaluate Dependence on Training Data

AI models are only as good as the data they learn from. If an extreme event is unprecedented—never seen in the training set—the model cannot extrapolate. Check the historical training data period for the AI model. If it covers only recent decades, it may lack examples of rare extremes. Physics models, grounded in physical laws, can simulate novel combinations of conditions.

Step 7: Assess Overall Performance for Routine vs. Extreme Forecasts

While AI may excel for everyday weather (e.g., temperature, precipitation patterns), this guide focuses on extremes. Create a performance matrix:

  • For normal weather: AI often outperforms physics models.
  • For extreme events: Physics models are more reliable.

The study labels this AI weakness a “warning shot” against replacing traditional models too quickly. In your analysis, weight the importance of extreme events based on your application (public safety, infrastructure planning).

Step 8: Consider a Hybrid Approach

Given the complementary strengths, use both models together. For example, run an AI model for rapid, low-cost forecasts under typical conditions, but switch to physics-based for extreme event warnings. Some operational centers are already blending outputs. Test a hybrid ensemble and see if it improves prediction of both frequency and intensity of record-breakers.

Tips for Success

  • Always validate against historical extremes – don’t trust AI performance on rare events without testing. Use a separate holdout dataset of record-breakers not seen in training.
  • Monitor model updates – AI models improve rapidly; repeat this evaluation every year or after major releases.
  • Pair with domain experts – meteorologists can interpret why a physics model might beat AI for a particular event (e.g., blocking patterns).
  • Document biases – note if AI consistently underestimates cold snaps more than heatwaves, and adjust warning thresholds accordingly.
  • Do not abandon physics models – as the study shows, they remain essential for record-breaking extremes. Even as AI advances, maintaining both capabilities is prudent.

Remember: The goal is not to declare AI as useless for extreme weather, but to use it wisely. By following these steps, you can make informed decisions about which model to trust when lives and property are at stake.