AGILLM-3-Large — Tenstorrent N300 Port

Proof-of-concept port of AGILLM-3-Large training to the Tenstorrent N300 accelerator via TT-XLA/PJRT.

This repo documents one of the first known attempts to train a 698M parameter language model on Tenstorrent Wormhole (N300) hardware.

Model

Parameter	Value
Parameters	698M
Hidden dim	1024
Layers	24
Heads	16
Rank	128
Expansion	2x
Vocab	128,832 (padded to 32-aligned for TT tile ops)
Architecture	Joint AR + SAT (autoregressive + span-aware transformer)
Training mode	AR-only on N300 (SAT requires features not yet supported on TT-XLA)

Hardware & Software

Hardware: Tenstorrent Wormhole N300 (2 chips), hosted on Koyeb gpu-tenstorrent-n300s
Backend: TT-XLA / PJRT (PyTorch XLA with Tenstorrent backend)
Precision: BF16
UMD: v2.5.0, Firmware 19.3.1
Training script: n_tenstorrent_port.py (in this repo)

Results

What worked

TT-XLA/PJRT backend initializes correctly on N300 (both chips detected, fabric established)
Forward pass, backward pass, and optimizer step all execute on TT device
Gradient checkpointing with use_reentrant=True works on XLA
Checkpointing and resumption works reliably across OOM crashes
~274+ training steps completed over ~2 weeks with automatic crash recovery

What didn't work well

Effective throughput: ~0.067 tokens/sec — roughly 300,000x slower than an RTX 4090
OOM every ~4 steps — N300 memory management causes periodic crashes
XLA recompilation on every restart — no persistent program cache, so each OOM recovery costs 19–80 minutes of recompilation
Some N300 machines have broken Ethernet — required hardware detection in the run script to cycle past bad machines
Embedding and AR head too large for TT L1 memory — must remain on CPU

Key throughput comparison

Hardware	Effective tok/s	Cost/hr	Notes
RTX 4090 (Vast.ai)	~20,000	$0.27	Main training run, ~14B tokens completed
Tenstorrent N300 (Koyeb)	~0.067	Free trial	Proof of concept, ~17.5K tokens completed

Bugs Fixed During Port

Over ~1 week of debugging, the following issues were resolved to get training running:

V36: ar_head.weight → ar_head.proj.weight (ARHead wraps Linear in .proj)
V39: VOCAB=128815 not 32-aligned → XLA pad/depad reshape crash in TTNN. Fixed by padding VOCAB to 128832.
V43: Block[0] backward gradients unflushed before optimizer → TTNN reshape crash. Fixed with embedding output hook + sync barrier.
V47: use_reentrant=False incompatible with XLA → switched to use_reentrant=True
V49: BF16/F32 dtype mismatch on CPU detach path → added .float() cast
V57: Broken N300 Ethernet detection (some sjc2 machines) → hardware test now catches timeout
Vocab mismatch on resume: Old checkpoints had shape [128815, 1024] vs model [128832, 1024] → added _pad_vocab_state() zero-padding

Checkpoint Files

pretrain_step*.pt — Training checkpoints (4.45 GB each for large preset with block=64)
final.pt — Warm-start checkpoint from main AGILLM-3 training
n_tenstorrent_port.py — Full training script with TT-XLA backend support
run.sh — Crash-recovery wrapper with hardware detection, OOM retry (up to 1000 retries), and auto-upload to HF

Conclusion

The Tenstorrent N300 can train a 698M parameter transformer — but the TT-XLA software stack is not yet mature enough for practical use. The main bottlenecks are memory management (frequent OOM) and XLA recompilation (no persistent cache). If Tenstorrent addresses these issues, the hardware could become competitive for ML training. As of March 2026, an RTX 4090 at $0.27/hr remains orders of magnitude more practical.

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support