surge-fm-v3 β€” Chronos-2 fine-tuned for every US EIA-930 balancing authority

Full fine-tune of amazon/chronos-2 on 7 years (2019–2025) of hourly load data across 53 balancing authorities β€” every BA that publishes a demand series to EIA-930, spanning the Eastern, Western, and Texas interconnections. Temperature (NOAA ASOS) and US-calendar features are passed as covariates.

This is the breadth generalist. For the sharper 7-RTO specialist, see surge-fm-v2.

Results on 2025 hold-out

7 RTO/ISOs (same benchmark as v1/v2)

Model Test MASE vs seasonal-naive-24
seasonal-naive-24 (baseline) 1.044 β€”
XGBoost hourly-binned (Roy '25) 0.901 βˆ’14%
N-BEATS (Pelekis '23) 0.714 βˆ’32%
Chronos-Bolt zero-shot 0.688 βˆ’34%
Chronos-2 zero-shot + covariates 0.567 βˆ’46%
surge-fm-v2 (7-BA specialist) 0.492 βˆ’53%
surge-fm-v3 (53-BA generalist) β€” RTO subset 0.518 βˆ’50%

Per-RTO day-ahead MAPE on the 2025 hold-out: PJM 1.22%, MISO 1.97%, ERCO 2.50%, CISO 2.60%, SWPP 2.81%, NYIS 2.90%, ISNE 4.37%. Mean 2.62%.

All 53 BAs

Slice Test MASE n BAs Macro MAE (MW)
All demand-reporting BAs 0.636 53 272
7 RTO/ISOs 0.518 7 889
46 non-RTO utilities 0.654 46 178

All numbers: 2025 hold-out, rolling 24h-ahead windows at step=24, MASE denominator = per-BA train-set seasonal-naive (m=24). Per-BA metrics are shipped alongside the weights in eval_2025_test.json.

vs. the operators' own forecasts (EIA day-ahead demand forecast)

Every RTO submits a day-ahead load forecast to EIA each morning β€” the production forecast used to commit generation. Scored on the same 2025 window, same 24h horizon, 8,760 hours per BA:

Region surge-fm-v3 MAE Operator (EIA DF) MAE Ratio surge MASE Operator MASE
PJM 1,937 MW 3,297 MW 1.70Γ— 0.40 0.68
CAISO 652 MW 2,098 MW 3.22Γ— 0.51 1.66
ERCOT 1,215 MW 1,366 MW 1.12Γ— 0.50 0.56
MISO 1,450 MW 1,786 MW 1.23Γ— 0.45 0.55
NYISO 501 MW 560 MW 1.12Γ— 0.53 0.60
ISO-NE 577 MW 306 MW 0.53Γ— 0.63 0.34
SPP 896 MW 2,590 MW 2.89Γ— 0.61 1.77
macro 1,032 MW 1,715 MW 1.66Γ— 0.52 0.88

surge-fm-v3 beats the operators on 6 of 7 RTOs. ISO-NE's own forecasting team is the sole exception (MASE 0.34 is elite). CAISO and SPP operator submissions both land above MASE 1.0 on 2025 β€” worse than a same-as-yesterday baseline β€” so the macro gap is partly a reflection of operator pipeline variance, not purely model strength.

Reproduce the operator numbers: the comparison script lives in the repo at scripts/compare_eia_df.py.

Use

import torch
from chronos import BaseChronosPipeline

pipe = BaseChronosPipeline.from_pretrained(
    "Tylerbry1/surge-fm-v3",
    device_map="cuda" if torch.cuda.is_available() else "cpu",
    torch_dtype=torch.bfloat16,
)

See github.com/tylergibbs1/surge for the full feature-construction recipe (src/surge/api/forecaster.py), a FastAPI inference service, and the Next.js playground that renders forecasts over a 53-BA US map.

Training setup

  • Backbone: amazon/chronos-2 (119 M params)
  • Mode: full fine-tune, LR 5e-6 (cosine), 3 000 steps, batch 16, bf16, seed 42
  • Context: 2 048 hours Β· horizon: 24 hours
  • Loss: quantile regression at the full Chronos-2 21-level quantile grid
  • Covariates: 2-m temperature (ASOS), hour-of-day sin/cos, day-of-week sin/cos, is_weekend, is_holiday (US federal)
  • Trained in ~7 minutes on a single NVIDIA H100 80 GB

Data splits

Split Period Rows per BA
Train 2019-01 β†’ 2023-12 ~43 800
Val 2024 8 808
Test (reported above) 2025 + early 2026 ~11 300

Covered balancing authorities (53)

7 RTO/ISOs: PJM, CISO, ERCO, MISO, NYIS, ISNE, SWPP.

46 non-RTO utilities, federal power admins, and PUDs: SOCO, TVA, DUK, CPLE, CPLW, LGEE, AECI, SCEG, SC, FPL, FPC, TEC, FMPP, SEC, JEA, TAL, GVL, HST, BPAT, PACE, PACW, AZPS, SRP, PSCO, NEVP, LDWP, PGE, IPCO, PSEI, SCL, TPWR, AVA, NWMT, PNM, EPE, TEPC, WACM, WALC, WAUW, BANC, TIDC, IID, DOPD, CHPD, GCPD, SPA. See src/surge/bas.py in the Surge repo for the full registry (including the 14 gen-/transmission-only BAs that EIA-930 lists but which don't publish a demand series).

Limitations

  • Not bankable. For research and reference use only. Do not use for regulated bidding, financial settlement, or any decision where a legally-attested forecast is required.
  • Heterogeneous weather signal. The 7 original RTO BAs were trained with real ASOS temperature; the 46 new BAs currently use a zero-filled fallback pending an Iowa Mesonet backfill (blocked on their rate-limit cooldown at release time). Expect the non-RTO MASE to improve once the backfill lands and the model is retrained.
  • Univariate target. The model forecasts load only; other covariates are passed through but not predicted.
  • Weather ideality. Evaluation used ground-truth ASOS temperature as "future covariate" β€” the upper bound. A production pipeline with HRRR/GFS forecast temperature will degrade 10–20% in MAPE terms.
  • Out-of-sample years. Metrics reported on a 2025 hold-out. Novel regimes (extreme heat/cold, large fleet additions, load-growth shifts) can erode accuracy without warning.

License

MIT. Base model (Chronos-2) is Apache 2.0.

Cite

@software{surge_fm_v3,
  title  = {surge-fm-v3 β€” Chronos-2 fine-tuned for 53-BA US grid load forecasting},
  author = {Surge contributors},
  year   = {2026},
  url    = {https://github.com/tylergibbs1/surge}
}
Downloads last month
60
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Tylerbry1/surge-fm-v3

Base model

amazon/chronos-2
Finetuned
(4)
this model