molcrawl-protein-sequence-proteingym-gpt2-large

Model Description

GPT-2 large (774M parameters) fine-tuned on ProteinGym protein sequence data, starting from the molcrawl-protein-sequence-gpt2-large pre-trained model.

Datasets

ProteinGym: https://proteingym.org/ (Fine-tuning dataset)
MolCrawl protein sequence dataset: https://github.com/mmai-framework-lab/MolCrawl-HFuploader/blob/main/workflows/hugging_face/run_upload_hf.sh (Pre-training corpus used by the base model)
Model Type: gpt2
Data Type: Protein
Training Date: 2026-04-14

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("kojima-lab/molcrawl-protein-sequence-proteingym-gpt2-large")
tokenizer = AutoTokenizer.from_pretrained("kojima-lab/molcrawl-protein-sequence-proteingym-gpt2-large")

# Generate protein sequence
prompt = "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGT"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=50,
        do_sample=True,
        temperature=0.8,
        eos_token_id=None,  # HF config.json has legacy eos_token_id=0; disable early stop
        pad_token_id=0,
    )
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Training

This model was trained with the RIKEN Foundation Model pipeline. For more details, please refer to the training configuration files included in this repository.

License

This model is released under the APACHE-2.0 license.

Citation

If you use this model, please cite:

@misc{molcrawl_protein_sequence_proteingym_gpt2_large,
  title={molcrawl-protein-sequence-proteingym-gpt2-large},
  author={{RIKEN}},
  year={2026},
  publisher={{Hugging Face}},
  url={{https://huggingface.co/kojima-lab/molcrawl-protein-sequence-proteingym-gpt2-large}}
}

Downloads last month: 808

Collection including kojima-lab/molcrawl-protein-sequence-proteingym-gpt2-large

MolCrawl/protein_sequence

Collection

9 items • Updated 10 days ago • 1