This TOBA model is a trilingual language model based on GPT-2 architecture with 1.2 billion parameters, trained on a corpus encompassing Indonesian, Batak, and Minangkabau using syllabic-agglutinative tokenization. The architecture integrates an Engram Memory mechanism, an adaptive n-gram-based memory system with a 500,000 x 768 embedding table that captures morphological dependencies through bigram and trigram pathways.

Model file:

model.safetensors

Install

Install PyTorch first according to your CPU/CUDA environment, then install the repo requirements:

pip install -r requirements.txt

Usage

Single prompt, chat mode:

python infer.py --mode chat --prompt "Horas amang inang saluhutna"

Single prompt, completion mode:

python infer.py --mode completion --prompt "Horas amang inang saluhutna"

Interactive:

python infer.py --interactive --mode chat

Quick examples:

python infer.py --prompt "Horas!"
python infer.py --mode completion --prompt "Patorang ma aha do dalihan natolu "
python infer.py --interactive

Ref

Downloads last month: 8

Safetensors

Model size

1B params

Tensor type

I64

F16

BOOL

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for ai-toba/toba-trilingual-1.2B

Adaptive Engram Memory System for Indonesian Language Model: Generative AI Based on TOBA LM for Batak and Minang Language

Paper • 2603.10006 • Published Feb 17 • 3

Tokenizations for Austronesian Language Models: study on languages in Indonesia Archipelago

Paper • 2602.06998 • Published Jan 28 • 3

Syllabic Agglutinative Tokenizations for Indonesian LLM: A Study from Gasing Literacy Learning System

Paper • 2601.11643 • Published Jan 14 • 3