AGILLM-3-Large β€” Tenstorrent N300 Port

Proof-of-concept port of AGILLM-3-Large training to the Tenstorrent N300 accelerator via TT-XLA/PJRT.

This repo documents one of the first known attempts to train a 698M parameter language model on Tenstorrent Wormhole (N300) hardware.

Model

Parameter Value
Parameters 698M
Hidden dim 1024
Layers 24
Heads 16
Rank 128
Expansion 2x
Vocab 128,832 (padded to 32-aligned for TT tile ops)
Architecture Joint AR + SAT (autoregressive + span-aware transformer)
Training mode AR-only on N300 (SAT requires features not yet supported on TT-XLA)

Hardware & Software

  • Hardware: Tenstorrent Wormhole N300 (2 chips), hosted on Koyeb gpu-tenstorrent-n300s
  • Backend: TT-XLA / PJRT (PyTorch XLA with Tenstorrent backend)
  • Precision: BF16
  • UMD: v2.5.0, Firmware 19.3.1
  • Training script: n_tenstorrent_port.py (in this repo)

Results

What worked

  • TT-XLA/PJRT backend initializes correctly on N300 (both chips detected, fabric established)
  • Forward pass, backward pass, and optimizer step all execute on TT device
  • Gradient checkpointing with use_reentrant=True works on XLA
  • Checkpointing and resumption works reliably across OOM crashes
  • ~274+ training steps completed over ~2 weeks with automatic crash recovery

What didn't work well

  • Effective throughput: ~0.067 tokens/sec β€” roughly 300,000x slower than an RTX 4090
  • OOM every ~4 steps β€” N300 memory management causes periodic crashes
  • XLA recompilation on every restart β€” no persistent program cache, so each OOM recovery costs 19–80 minutes of recompilation
  • Some N300 machines have broken Ethernet β€” required hardware detection in the run script to cycle past bad machines
  • Embedding and AR head too large for TT L1 memory β€” must remain on CPU

Key throughput comparison

Hardware Effective tok/s Cost/hr Notes
RTX 4090 (Vast.ai) ~20,000 $0.27 Main training run, ~14B tokens completed
Tenstorrent N300 (Koyeb) ~0.067 Free trial Proof of concept, ~17.5K tokens completed

Bugs Fixed During Port

Over ~1 week of debugging, the following issues were resolved to get training running:

  1. V36: ar_head.weight β†’ ar_head.proj.weight (ARHead wraps Linear in .proj)
  2. V39: VOCAB=128815 not 32-aligned β†’ XLA pad/depad reshape crash in TTNN. Fixed by padding VOCAB to 128832.
  3. V43: Block[0] backward gradients unflushed before optimizer β†’ TTNN reshape crash. Fixed with embedding output hook + sync barrier.
  4. V47: use_reentrant=False incompatible with XLA β†’ switched to use_reentrant=True
  5. V49: BF16/F32 dtype mismatch on CPU detach path β†’ added .float() cast
  6. V57: Broken N300 Ethernet detection (some sjc2 machines) β†’ hardware test now catches timeout
  7. Vocab mismatch on resume: Old checkpoints had shape [128815, 1024] vs model [128832, 1024] β†’ added _pad_vocab_state() zero-padding

Checkpoint Files

  • pretrain_step*.pt β€” Training checkpoints (4.45 GB each for large preset with block=64)
  • final.pt β€” Warm-start checkpoint from main AGILLM-3 training
  • n_tenstorrent_port.py β€” Full training script with TT-XLA backend support
  • run.sh β€” Crash-recovery wrapper with hardware detection, OOM retry (up to 1000 retries), and auto-upload to HF

Conclusion

The Tenstorrent N300 can train a 698M parameter transformer β€” but the TT-XLA software stack is not yet mature enough for practical use. The main bottlenecks are memory management (frequent OOM) and XLA recompilation (no persistent cache). If Tenstorrent addresses these issues, the hardware could become competitive for ML training. As of March 2026, an RTX 4090 at $0.27/hr remains orders of magnitude more practical.

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support