FishNALM-8L_pretrain

FishNALM-8L_pretrain is a fish-specific foundation DNA language model in the FishNALM family. It was pretrained on fish genome sequences with masked language modeling (MLM) and is intended for genomic sequence representation learning and downstream transfer to fish genomics tasks.

Model description

FishNALM is a fish-specific DNA foundation model family developed for fish genomes. The model uses a BERT-style masked language modeling objective to learn sequence representations from large-scale fish genomic data.

This repository contains the pretrained checkpoint of FishNALM-8L_pretrain.

Model sources

  • Project: FishNALM
  • GitHub: https://github.com/bioinfoihb/FishNALM
  • Manuscript: FishNALM: A Foundation DNA Language Model for Fish Genomes
  • Organization / Author: Institute of Hydrobiology, Chinese Academy of Sciences

Architecture

  • Backbone: BERT-style masked language model
  • Tokenization: 6-mer + BPE
  • Vocabulary size: 4863
  • Maximum input length: 512 tokens
  • Framework: Hugging Face Transformers

Training data

The model was pretrained on curated fish genome sequences collected from publicly available fish genome resources. The training corpus was processed through a unified genome preprocessing pipeline before masked language modeling.

Please describe your training data more specifically here. Suggested details include:

  • number of species used for pretraining
  • data source(s), such as NCBI assemblies
  • total sequence volume after preprocessing
  • sequence filtering or chromosome / contig cleaning strategy

Example sentence:

The pretraining corpus was constructed from curated fish reference genomes downloaded from public databases and processed through a unified filtering and sequence preparation pipeline designed for fish genomic modeling.

Training objective

This model was pretrained with masked language modeling (MLM). DNA sequences were tokenized using a hybrid 6-mer + BPE strategy and then used to train a BERT-style encoder to recover masked tokens.

Intended uses

This model is intended for:

  • genomic sequence representation learning
  • transfer learning for fish genomics tasks
  • initialization for downstream fine-tuning
  • exploratory analysis of fish genomic regulatory sequences

Limitations

  • This model was primarily developed for fish genomic sequences.
  • Performance outside fish-related genomic contexts may be limited.
  • The model is a research model and should not be used as a clinical or diagnostic system.
  • Performance may vary across species, sequence types, and downstream task definitions.

How to use

Load tokenizer and model

from transformers import AutoTokenizer, AutoModelForMaskedLM

repo_name = "xia-lab/<REPO_NAME>"

tokenizer = AutoTokenizer.from_pretrained(repo_name)
model = AutoModelForMaskedLM.from_pretrained(repo_name)

Example inference

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

repo_name = "xia-lab/<REPO_NAME>"
sequence = "ATGCGTACGTTAGCTAGCTAGCTAGCTAGCTA"

tokenizer = AutoTokenizer.from_pretrained(repo_name)
model = AutoModelForMaskedLM.from_pretrained(repo_name)

inputs = tokenizer(
    sequence,
    return_tensors="pt",
    truncation=True,
    padding="max_length",
    max_length=512,
)

with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

last_hidden_state = outputs.hidden_states[-1]
logits = outputs.logits

print(last_hidden_state.shape)
print(logits.shape)

Files in this repository

Typical files in this repository may include:

  • config.json
  • model.safetensors
  • tokenizer.json
  • tokenizer_config.json
  • special_tokens_map.json
  • vocab.txt
  • README.md

Contact

For questions, please contact: xqxia@ihb.ac.cn

Downloads last month
16
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support