molcrawl-rna-bert-small

Model Description

GPT-2 small (124M parameters) foundation model pre-trained on RNA gene expression sequences from the MolCrawl dataset.

Model Type: bert
Data Type: RNA
Training Date: 2026-04-24

Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

model = AutoModelForMaskedLM.from_pretrained("kojima-lab/molcrawl-rna-bert-small")
tokenizer = AutoTokenizer.from_pretrained("kojima-lab/molcrawl-rna-bert-small")

# Predict masked RNA token
# Use tokenizer.mask_token instead of hardcoded "[MASK]":
# BERT-style tokenizers vary ("[MASK]", "<mask>", etc.)
if tokenizer.mask_token is None:
    raise ValueError("This tokenizer has no mask_token; masked LM inference is not supported.")
prompt = "AUGCAUGC{MASK}AUGCAUGC".replace("{MASK}", tokenizer.mask_token)
inputs = tokenizer(prompt, return_tensors="pt")
mask_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]

with torch.no_grad():
    outputs = model(**inputs)
logits = outputs.logits

predicted_token_id = logits[0, mask_index].argmax(dim=-1)
predicted_token = tokenizer.decode(predicted_token_id)
result = prompt.replace(tokenizer.mask_token, predicted_token)
print(f"Predicted: {result}")

Training

This model was trained with the RIKEN Foundation Model pipeline. For more details, please refer to the training configuration files included in this repository.

License

This model is released under the APACHE-2.0 license.

Citation

If you use this model, please cite:

@misc{molcrawl_rna_bert_small,
  title={molcrawl-rna-bert-small},
  author={{RIKEN}},
  year={2026},
  publisher={{Hugging Face}},
  url={{https://huggingface.co/kojima-lab/molcrawl-rna-bert-small}}
}

Downloads last month: 37

Safetensors

Model size

0.1B params

Tensor type

F32

Collection including kojima-lab/molcrawl-rna-bert-small

MolCrawl/rna

Collection

9 items • Updated 10 days ago