FishNALM-8L_pretrain

FishNALM-8L_pretrain is a fish-specific foundation DNA language model in the FishNALM family. It was pretrained on fish genome sequences with masked language modeling (MLM) and is intended for genomic sequence representation learning and downstream transfer to fish genomics tasks.

Model description

FishNALM is a fish-specific DNA foundation model family developed for fish genomes. The model uses a BERT-style masked language modeling objective to learn sequence representations from large-scale fish genomic data.

This repository contains the pretrained checkpoint of FishNALM-8L_pretrain.

Model sources

Project: FishNALM
GitHub: https://github.com/bioinfoihb/FishNALM
Manuscript: FishNALM: A Foundation DNA Language Model for Fish Genomes
Organization / Author: Institute of Hydrobiology, Chinese Academy of Sciences

Architecture

Backbone: BERT-style masked language model
Tokenization: 6-mer + BPE
Vocabulary size: 4863
Maximum input length: 512 tokens
Framework: Hugging Face Transformers

Training data

The model was pretrained on curated fish genome sequences collected from publicly available fish genome resources. The training corpus was processed through a unified genome preprocessing pipeline before masked language modeling.

Please describe your training data more specifically here. Suggested details include:

number of species used for pretraining
data source(s), such as NCBI assemblies
total sequence volume after preprocessing
sequence filtering or chromosome / contig cleaning strategy

Example sentence:

The pretraining corpus was constructed from curated fish reference genomes downloaded from public databases and processed through a unified filtering and sequence preparation pipeline designed for fish genomic modeling.

Training objective

This model was pretrained with masked language modeling (MLM). DNA sequences were tokenized using a hybrid 6-mer + BPE strategy and then used to train a BERT-style encoder to recover masked tokens.

Intended uses

This model is intended for:

genomic sequence representation learning
transfer learning for fish genomics tasks
initialization for downstream fine-tuning
exploratory analysis of fish genomic regulatory sequences

Limitations

This model was primarily developed for fish genomic sequences.
Performance outside fish-related genomic contexts may be limited.
The model is a research model and should not be used as a clinical or diagnostic system.
Performance may vary across species, sequence types, and downstream task definitions.

How to use

Load tokenizer and model

from transformers import AutoTokenizer, AutoModelForMaskedLM

repo_name = "xia-lab/<REPO_NAME>"

tokenizer = AutoTokenizer.from_pretrained(repo_name)
model = AutoModelForMaskedLM.from_pretrained(repo_name)

Example inference

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

repo_name = "xia-lab/<REPO_NAME>"
sequence = "ATGCGTACGTTAGCTAGCTAGCTAGCTAGCTA"

tokenizer = AutoTokenizer.from_pretrained(repo_name)
model = AutoModelForMaskedLM.from_pretrained(repo_name)

inputs = tokenizer(
    sequence,
    return_tensors="pt",
    truncation=True,
    padding="max_length",
    max_length=512,
)

with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

last_hidden_state = outputs.hidden_states[-1]
logits = outputs.logits

print(last_hidden_state.shape)
print(logits.shape)

Files in this repository

Typical files in this repository may include:

config.json
model.safetensors
tokenizer.json
tokenizer_config.json
special_tokens_map.json
vocab.txt
README.md

Contact

For questions, please contact: xqxia@ihb.ac.cn

Downloads last month: 16

Safetensors

Model size

0.3B params

Tensor type

F32