FishNALM-8L_pretrain
FishNALM-8L_pretrain is a fish-specific foundation DNA language model in the FishNALM family. It was pretrained on fish genome sequences with masked language modeling (MLM) and is intended for genomic sequence representation learning and downstream transfer to fish genomics tasks.
Model description
FishNALM is a fish-specific DNA foundation model family developed for fish genomes. The model uses a BERT-style masked language modeling objective to learn sequence representations from large-scale fish genomic data.
This repository contains the pretrained checkpoint of FishNALM-8L_pretrain.
Model sources
- Project: FishNALM
- GitHub: https://github.com/bioinfoihb/FishNALM
- Manuscript: FishNALM: A Foundation DNA Language Model for Fish Genomes
- Organization / Author: Institute of Hydrobiology, Chinese Academy of Sciences
Architecture
- Backbone: BERT-style masked language model
- Tokenization: 6-mer + BPE
- Vocabulary size: 4863
- Maximum input length: 512 tokens
- Framework: Hugging Face Transformers
Training data
The model was pretrained on curated fish genome sequences collected from publicly available fish genome resources. The training corpus was processed through a unified genome preprocessing pipeline before masked language modeling.
Please describe your training data more specifically here. Suggested details include:
- number of species used for pretraining
- data source(s), such as NCBI assemblies
- total sequence volume after preprocessing
- sequence filtering or chromosome / contig cleaning strategy
Example sentence:
The pretraining corpus was constructed from curated fish reference genomes downloaded from public databases and processed through a unified filtering and sequence preparation pipeline designed for fish genomic modeling.
Training objective
This model was pretrained with masked language modeling (MLM). DNA sequences were tokenized using a hybrid 6-mer + BPE strategy and then used to train a BERT-style encoder to recover masked tokens.
Intended uses
This model is intended for:
- genomic sequence representation learning
- transfer learning for fish genomics tasks
- initialization for downstream fine-tuning
- exploratory analysis of fish genomic regulatory sequences
Limitations
- This model was primarily developed for fish genomic sequences.
- Performance outside fish-related genomic contexts may be limited.
- The model is a research model and should not be used as a clinical or diagnostic system.
- Performance may vary across species, sequence types, and downstream task definitions.
How to use
Load tokenizer and model
from transformers import AutoTokenizer, AutoModelForMaskedLM
repo_name = "xia-lab/<REPO_NAME>"
tokenizer = AutoTokenizer.from_pretrained(repo_name)
model = AutoModelForMaskedLM.from_pretrained(repo_name)
Example inference
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
repo_name = "xia-lab/<REPO_NAME>"
sequence = "ATGCGTACGTTAGCTAGCTAGCTAGCTAGCTA"
tokenizer = AutoTokenizer.from_pretrained(repo_name)
model = AutoModelForMaskedLM.from_pretrained(repo_name)
inputs = tokenizer(
sequence,
return_tensors="pt",
truncation=True,
padding="max_length",
max_length=512,
)
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
last_hidden_state = outputs.hidden_states[-1]
logits = outputs.logits
print(last_hidden_state.shape)
print(logits.shape)
Files in this repository
Typical files in this repository may include:
config.jsonmodel.safetensorstokenizer.jsontokenizer_config.jsonspecial_tokens_map.jsonvocab.txtREADME.md
Contact
For questions, please contact: xqxia@ihb.ac.cn
- Downloads last month
- 16