surge-fm-v3 β Chronos-2 fine-tuned for every US EIA-930 balancing authority
Full fine-tune of amazon/chronos-2 on 7 years (2019β2025) of hourly load data across 53 balancing authorities β every BA that publishes a demand series to EIA-930, spanning the Eastern, Western, and Texas interconnections. Temperature (NOAA ASOS) and US-calendar features are passed as covariates.
This is the breadth generalist. For the sharper 7-RTO specialist, see surge-fm-v2.
Results on 2025 hold-out
7 RTO/ISOs (same benchmark as v1/v2)
| Model | Test MASE | vs seasonal-naive-24 |
|---|---|---|
| seasonal-naive-24 (baseline) | 1.044 | β |
| XGBoost hourly-binned (Roy '25) | 0.901 | β14% |
| N-BEATS (Pelekis '23) | 0.714 | β32% |
| Chronos-Bolt zero-shot | 0.688 | β34% |
| Chronos-2 zero-shot + covariates | 0.567 | β46% |
| surge-fm-v2 (7-BA specialist) | 0.492 | β53% |
| surge-fm-v3 (53-BA generalist) β RTO subset | 0.518 | β50% |
Per-RTO day-ahead MAPE on the 2025 hold-out: PJM 1.22%, MISO 1.97%, ERCO 2.50%, CISO 2.60%, SWPP 2.81%, NYIS 2.90%, ISNE 4.37%. Mean 2.62%.
All 53 BAs
| Slice | Test MASE | n BAs | Macro MAE (MW) |
|---|---|---|---|
| All demand-reporting BAs | 0.636 | 53 | 272 |
| 7 RTO/ISOs | 0.518 | 7 | 889 |
| 46 non-RTO utilities | 0.654 | 46 | 178 |
All numbers: 2025 hold-out, rolling 24h-ahead windows at step=24, MASE
denominator = per-BA train-set seasonal-naive (m=24). Per-BA metrics are
shipped alongside the weights in eval_2025_test.json.
vs. the operators' own forecasts (EIA day-ahead demand forecast)
Every RTO submits a day-ahead load forecast to EIA each morning β the production forecast used to commit generation. Scored on the same 2025 window, same 24h horizon, 8,760 hours per BA:
| Region | surge-fm-v3 MAE | Operator (EIA DF) MAE | Ratio | surge MASE | Operator MASE |
|---|---|---|---|---|---|
| PJM | 1,937 MW | 3,297 MW | 1.70Γ | 0.40 | 0.68 |
| CAISO | 652 MW | 2,098 MW | 3.22Γ | 0.51 | 1.66 |
| ERCOT | 1,215 MW | 1,366 MW | 1.12Γ | 0.50 | 0.56 |
| MISO | 1,450 MW | 1,786 MW | 1.23Γ | 0.45 | 0.55 |
| NYISO | 501 MW | 560 MW | 1.12Γ | 0.53 | 0.60 |
| ISO-NE | 577 MW | 306 MW | 0.53Γ | 0.63 | 0.34 |
| SPP | 896 MW | 2,590 MW | 2.89Γ | 0.61 | 1.77 |
| macro | 1,032 MW | 1,715 MW | 1.66Γ | 0.52 | 0.88 |
surge-fm-v3 beats the operators on 6 of 7 RTOs. ISO-NE's own forecasting team is the sole exception (MASE 0.34 is elite). CAISO and SPP operator submissions both land above MASE 1.0 on 2025 β worse than a same-as-yesterday baseline β so the macro gap is partly a reflection of operator pipeline variance, not purely model strength.
Reproduce the operator numbers: the comparison script lives in the repo
at scripts/compare_eia_df.py.
Use
import torch
from chronos import BaseChronosPipeline
pipe = BaseChronosPipeline.from_pretrained(
"Tylerbry1/surge-fm-v3",
device_map="cuda" if torch.cuda.is_available() else "cpu",
torch_dtype=torch.bfloat16,
)
See github.com/tylergibbs1/surge
for the full feature-construction recipe (src/surge/api/forecaster.py),
a FastAPI inference service, and the Next.js playground that renders
forecasts over a 53-BA US map.
Training setup
- Backbone:
amazon/chronos-2(119 M params) - Mode: full fine-tune, LR 5e-6 (cosine), 3 000 steps, batch 16, bf16, seed 42
- Context: 2 048 hours Β· horizon: 24 hours
- Loss: quantile regression at the full Chronos-2 21-level quantile grid
- Covariates: 2-m temperature (ASOS), hour-of-day sin/cos, day-of-week
sin/cos,
is_weekend,is_holiday(US federal) - Trained in ~7 minutes on a single NVIDIA H100 80 GB
Data splits
| Split | Period | Rows per BA |
|---|---|---|
| Train | 2019-01 β 2023-12 | ~43 800 |
| Val | 2024 | 8 808 |
| Test (reported above) | 2025 + early 2026 | ~11 300 |
Covered balancing authorities (53)
7 RTO/ISOs: PJM, CISO, ERCO, MISO, NYIS, ISNE, SWPP.
46 non-RTO utilities, federal power admins, and PUDs: SOCO, TVA, DUK,
CPLE, CPLW, LGEE, AECI, SCEG, SC, FPL, FPC, TEC, FMPP, SEC, JEA, TAL, GVL,
HST, BPAT, PACE, PACW, AZPS, SRP, PSCO, NEVP, LDWP, PGE, IPCO, PSEI, SCL,
TPWR, AVA, NWMT, PNM, EPE, TEPC, WACM, WALC, WAUW, BANC, TIDC, IID, DOPD,
CHPD, GCPD, SPA. See src/surge/bas.py in the Surge repo for the full
registry (including the 14 gen-/transmission-only BAs that EIA-930 lists
but which don't publish a demand series).
Limitations
- Not bankable. For research and reference use only. Do not use for regulated bidding, financial settlement, or any decision where a legally-attested forecast is required.
- Heterogeneous weather signal. The 7 original RTO BAs were trained with real ASOS temperature; the 46 new BAs currently use a zero-filled fallback pending an Iowa Mesonet backfill (blocked on their rate-limit cooldown at release time). Expect the non-RTO MASE to improve once the backfill lands and the model is retrained.
- Univariate target. The model forecasts load only; other covariates are passed through but not predicted.
- Weather ideality. Evaluation used ground-truth ASOS temperature as "future covariate" β the upper bound. A production pipeline with HRRR/GFS forecast temperature will degrade 10β20% in MAPE terms.
- Out-of-sample years. Metrics reported on a 2025 hold-out. Novel regimes (extreme heat/cold, large fleet additions, load-growth shifts) can erode accuracy without warning.
License
MIT. Base model (Chronos-2) is Apache 2.0.
Cite
@software{surge_fm_v3,
title = {surge-fm-v3 β Chronos-2 fine-tuned for 53-BA US grid load forecasting},
author = {Surge contributors},
year = {2026},
url = {https://github.com/tylergibbs1/surge}
}
- Downloads last month
- 60
Model tree for Tylerbry1/surge-fm-v3
Base model
amazon/chronos-2