AGILLM-3-Large β Tenstorrent N300 Port
Proof-of-concept port of AGILLM-3-Large training to the Tenstorrent N300 accelerator via TT-XLA/PJRT.
This repo documents one of the first known attempts to train a 698M parameter language model on Tenstorrent Wormhole (N300) hardware.
Model
| Parameter | Value |
|---|---|
| Parameters | 698M |
| Hidden dim | 1024 |
| Layers | 24 |
| Heads | 16 |
| Rank | 128 |
| Expansion | 2x |
| Vocab | 128,832 (padded to 32-aligned for TT tile ops) |
| Architecture | Joint AR + SAT (autoregressive + span-aware transformer) |
| Training mode | AR-only on N300 (SAT requires features not yet supported on TT-XLA) |
Hardware & Software
- Hardware: Tenstorrent Wormhole N300 (2 chips), hosted on Koyeb
gpu-tenstorrent-n300s - Backend: TT-XLA / PJRT (PyTorch XLA with Tenstorrent backend)
- Precision: BF16
- UMD: v2.5.0, Firmware 19.3.1
- Training script:
n_tenstorrent_port.py(in this repo)
Results
What worked
- TT-XLA/PJRT backend initializes correctly on N300 (both chips detected, fabric established)
- Forward pass, backward pass, and optimizer step all execute on TT device
- Gradient checkpointing with
use_reentrant=Trueworks on XLA - Checkpointing and resumption works reliably across OOM crashes
- ~274+ training steps completed over ~2 weeks with automatic crash recovery
What didn't work well
- Effective throughput: ~0.067 tokens/sec β roughly 300,000x slower than an RTX 4090
- OOM every ~4 steps β N300 memory management causes periodic crashes
- XLA recompilation on every restart β no persistent program cache, so each OOM recovery costs 19β80 minutes of recompilation
- Some N300 machines have broken Ethernet β required hardware detection in the run script to cycle past bad machines
- Embedding and AR head too large for TT L1 memory β must remain on CPU
Key throughput comparison
| Hardware | Effective tok/s | Cost/hr | Notes |
|---|---|---|---|
| RTX 4090 (Vast.ai) | ~20,000 | $0.27 | Main training run, ~14B tokens completed |
| Tenstorrent N300 (Koyeb) | ~0.067 | Free trial | Proof of concept, ~17.5K tokens completed |
Bugs Fixed During Port
Over ~1 week of debugging, the following issues were resolved to get training running:
- V36:
ar_head.weightβar_head.proj.weight(ARHead wraps Linear in.proj) - V39: VOCAB=128815 not 32-aligned β XLA pad/depad reshape crash in TTNN. Fixed by padding VOCAB to 128832.
- V43: Block[0] backward gradients unflushed before optimizer β TTNN reshape crash. Fixed with embedding output hook + sync barrier.
- V47:
use_reentrant=Falseincompatible with XLA β switched touse_reentrant=True - V49: BF16/F32 dtype mismatch on CPU detach path β added
.float()cast - V57: Broken N300 Ethernet detection (some sjc2 machines) β hardware test now catches timeout
- Vocab mismatch on resume: Old checkpoints had shape [128815, 1024] vs model [128832, 1024] β added
_pad_vocab_state()zero-padding
Checkpoint Files
pretrain_step*.ptβ Training checkpoints (4.45 GB each for large preset with block=64)final.ptβ Warm-start checkpoint from main AGILLM-3 trainingn_tenstorrent_port.pyβ Full training script with TT-XLA backend supportrun.shβ Crash-recovery wrapper with hardware detection, OOM retry (up to 1000 retries), and auto-upload to HF
Conclusion
The Tenstorrent N300 can train a 698M parameter transformer β but the TT-XLA software stack is not yet mature enough for practical use. The main bottlenecks are memory management (frequent OOM) and XLA recompilation (no persistent cache). If Tenstorrent addresses these issues, the hardware could become competitive for ML training. As of March 2026, an RTX 4090 at $0.27/hr remains orders of magnitude more practical.
License
Apache 2.0