TrackMAE: Video Representation Learning via Track Mask and Predict
Paper • 2603.27268 • Published
Pretrained and fine-tuned video checkpoints from TrackMAE: Video Representation Learning via Track Mask and Predict. The weights are PyTorch state dictionaries compatible with the official TrackMAE codebase.
| File | Backbone | Dataset | Epochs | Spatial Target |
|---|---|---|---|---|
pretrain/vit_b/videomae_cotracker_base_bs_512_mask_tube_800_lambda_0.25_up_14_clip.pth |
ViT-B | Kinetics-400 | 800 | CLIP |
pretrain/vit_l/videomae_cotracker_large_bs_1024_mask_tube_800_lambda_0.25_up_14_clip_k700.pth |
ViT-L | Kinetics-700 | 800 | CLIP |
| File | Backbone | Pretraining | Fine-tuning |
|---|---|---|---|
finetune/vit_b/k400_finetuned_vit_b_with_k400_pretraining_trackmae.pth |
ViT-B | Kinetics-400 | Kinetics-400 |
finetune/vit_b/ssv2_finetuned_vit_b_with_k400_pretraining_trackmae.pth |
ViT-B | Kinetics-400 | Something-Something V2 |
finetune/vit_l/ssv2_finetuned_vit_l_with_k700_pretraining_trackmae.pth |
ViT-L | Kinetics-700 | Something-Something V2 |
| TBA | ViT-L | Kinetics-700 | Kinetics-400 |
Download a pretrained checkpoint:
from huggingface_hub import hf_hub_download
checkpoint = hf_hub_download(
repo_id="rvandeghen/TrackMAE",
filename="pretrain/vit_b/videomae_cotracker_base_bs_512_mask_tube_800_lambda_0.25_up_14_clip.pth",
)
@inproceedings{vandeghen2026trackmae,
title = {TrackMAE: Video Representation Learning via Track Mask and Predict},
author = {Vandeghen, Renaud and Thoker, Fida Mohammad and Van Droogenbroeck, Marc and Ghanem, Bernard},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2026}
}