surge-fm-v3 — Chronos-2 fine-tuned for every US EIA-930 balancing authority

Full fine-tune of amazon/chronos-2 on 7 years (2019–2025) of hourly load data across 53 balancing authorities — every BA that publishes a demand series to EIA-930, spanning the Eastern, Western, and Texas interconnections. Temperature (NOAA ASOS) and US-calendar features are passed as covariates.

This is the breadth generalist. For the sharper 7-RTO specialist, see surge-fm-v2.

Results on 2025 hold-out

7 RTO/ISOs (same benchmark as v1/v2)

Model	Test MASE	vs seasonal-naive-24
seasonal-naive-24 (baseline)	1.044	—
XGBoost hourly-binned (Roy '25)	0.901	−14%
N-BEATS (Pelekis '23)	0.714	−32%
Chronos-Bolt zero-shot	0.688	−34%
Chronos-2 zero-shot + covariates	0.567	−46%
surge-fm-v2 (7-BA specialist)	0.492	−53%
surge-fm-v3 (53-BA generalist) — RTO subset	0.518	−50%

Per-RTO day-ahead MAPE on the 2025 hold-out: PJM 1.22%, MISO 1.97%, ERCO 2.50%, CISO 2.60%, SWPP 2.81%, NYIS 2.90%, ISNE 4.37%. Mean 2.62%.

All 53 BAs

Slice	Test MASE	n BAs	Macro MAE (MW)
All demand-reporting BAs	0.636	53	272
7 RTO/ISOs	0.518	7	889
46 non-RTO utilities	0.654	46	178

All numbers: 2025 hold-out, rolling 24h-ahead windows at step=24, MASE denominator = per-BA train-set seasonal-naive (m=24). Per-BA metrics are shipped alongside the weights in eval_2025_test.json.

vs. the operators' own forecasts (EIA day-ahead demand forecast)

Every RTO submits a day-ahead load forecast to EIA each morning — the production forecast used to commit generation. Scored on the same 2025 window, same 24h horizon, 8,760 hours per BA:

Region	surge-fm-v3 MAE	Operator (EIA DF) MAE	Ratio	surge MASE	Operator MASE
PJM	1,937 MW	3,297 MW	1.70×	0.40	0.68
CAISO	652 MW	2,098 MW	3.22×	0.51	1.66
ERCOT	1,215 MW	1,366 MW	1.12×	0.50	0.56
MISO	1,450 MW	1,786 MW	1.23×	0.45	0.55
NYISO	501 MW	560 MW	1.12×	0.53	0.60
ISO-NE	577 MW	306 MW	0.53×	0.63	0.34
SPP	896 MW	2,590 MW	2.89×	0.61	1.77
macro	1,032 MW	1,715 MW	1.66×	0.52	0.88

surge-fm-v3 beats the operators on 6 of 7 RTOs. ISO-NE's own forecasting team is the sole exception (MASE 0.34 is elite). CAISO and SPP operator submissions both land above MASE 1.0 on 2025 — worse than a same-as-yesterday baseline — so the macro gap is partly a reflection of operator pipeline variance, not purely model strength.

Reproduce the operator numbers: the comparison script lives in the repo at scripts/compare_eia_df.py.

Use

import torch
from chronos import BaseChronosPipeline

pipe = BaseChronosPipeline.from_pretrained(
    "Tylerbry1/surge-fm-v3",
    device_map="cuda" if torch.cuda.is_available() else "cpu",
    torch_dtype=torch.bfloat16,
)

See github.com/tylergibbs1/surge for the full feature-construction recipe (src/surge/api/forecaster.py), a FastAPI inference service, and the Next.js playground that renders forecasts over a 53-BA US map.

Training setup

Backbone: amazon/chronos-2 (119 M params)
Mode: full fine-tune, LR 5e-6 (cosine), 3 000 steps, batch 16, bf16, seed 42
Context: 2 048 hours · horizon: 24 hours
Loss: quantile regression at the full Chronos-2 21-level quantile grid
Covariates: 2-m temperature (ASOS), hour-of-day sin/cos, day-of-week sin/cos, is_weekend, is_holiday (US federal)
Trained in ~7 minutes on a single NVIDIA H100 80 GB

Data splits

Split	Period	Rows per BA
Train	2019-01 → 2023-12	~43 800
Val	2024	8 808
Test (reported above)	2025 + early 2026	~11 300

Covered balancing authorities (53)

7 RTO/ISOs: PJM, CISO, ERCO, MISO, NYIS, ISNE, SWPP.

46 non-RTO utilities, federal power admins, and PUDs: SOCO, TVA, DUK, CPLE, CPLW, LGEE, AECI, SCEG, SC, FPL, FPC, TEC, FMPP, SEC, JEA, TAL, GVL, HST, BPAT, PACE, PACW, AZPS, SRP, PSCO, NEVP, LDWP, PGE, IPCO, PSEI, SCL, TPWR, AVA, NWMT, PNM, EPE, TEPC, WACM, WALC, WAUW, BANC, TIDC, IID, DOPD, CHPD, GCPD, SPA. See src/surge/bas.py in the Surge repo for the full registry (including the 14 gen-/transmission-only BAs that EIA-930 lists but which don't publish a demand series).

Limitations

Not bankable. For research and reference use only. Do not use for regulated bidding, financial settlement, or any decision where a legally-attested forecast is required.
Heterogeneous weather signal. The 7 original RTO BAs were trained with real ASOS temperature; the 46 new BAs currently use a zero-filled fallback pending an Iowa Mesonet backfill (blocked on their rate-limit cooldown at release time). Expect the non-RTO MASE to improve once the backfill lands and the model is retrained.
Univariate target. The model forecasts load only; other covariates are passed through but not predicted.
Weather ideality. Evaluation used ground-truth ASOS temperature as "future covariate" — the upper bound. A production pipeline with HRRR/GFS forecast temperature will degrade 10–20% in MAPE terms.
Out-of-sample years. Metrics reported on a 2025 hold-out. Novel regimes (extreme heat/cold, large fleet additions, load-growth shifts) can erode accuracy without warning.

License

MIT. Base model (Chronos-2) is Apache 2.0.

Cite

@software{surge_fm_v3,
  title  = {surge-fm-v3 — Chronos-2 fine-tuned for 53-BA US grid load forecasting},
  author = {Surge contributors},
  year   = {2026},
  url    = {https://github.com/tylergibbs1/surge}
}

Downloads last month: 60

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

Time Series Forecasting

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Tylerbry1/surge-fm-v3

Base model

amazon/chronos-2

Finetuned

(4)

this model