MolmoPoint: Better Pointing for VLMs with Grounding Tokens Paper • 2603.28069 • Published 3 days ago • 6
MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models Paper • 2603.25744 • Published 6 days ago • 12
RealMaster: Lifting Rendered Scenes into Photorealistic Video Paper • 2603.23462 • Published 8 days ago • 31
Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing Paper • 2603.12254 • Published 20 days ago • 21
SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning Paper • 2603.22057 • Published 9 days ago • 45
mSFT: Addressing Dataset Mixtures Overfiting Heterogeneously in Multi-task SFT Paper • 2603.21606 • Published 10 days ago • 38
Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection Paper • 2603.21944 • Published 9 days ago • 26
VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding Paper • 2603.22285 • Published 9 days ago • 50
HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning Paper • 2603.17024 • Published 15 days ago • 106
A Subgoal-driven Framework for Improving Long-Horizon LLM Agents Paper • 2603.19685 • Published 13 days ago • 19
Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models Paper • 2603.18002 • Published 14 days ago • 13
Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding Paper • 2603.19235 • Published 13 days ago • 93
Unified Spatio-Temporal Token Scoring for Efficient Video VLMs Paper • 2603.18004 • Published 14 days ago • 12
MosaicMem: Hybrid Spatial Memory for Controllable Video World Models Paper • 2603.17117 • Published 15 days ago • 87
MolmoB0T: Large-Scale Simulation Enables Zero-Shot Manipulation Paper • 2603.16861 • Published 15 days ago • 9