VLM ποΈποΈ
updated
2.5 Years in Class: A Multimodal Textbook for Vision-Language
Pretraining
Paper
β’
2501.00958
β’
Published
β’
109
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse
Task Synthesis
Paper
β’
2412.19723
β’
Published
β’
87
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Paper
β’
2412.04467
β’
Published
β’
117
PaliGemma 2: A Family of Versatile VLMs for Transfer
Paper
β’
2412.03555
β’
Published
β’
133
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper
β’
2411.17465
β’
Published
β’
89
Are VLMs Ready for Autonomous Driving? An Empirical Study from the
Reliability, Data, and Metric Perspectives
Paper
β’
2501.04003
β’
Published
β’
27
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with
Video LLM
Paper
β’
2501.00599
β’
Published
β’
46
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper
β’
2501.06186
β’
Published
β’
65
DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision
Language Models in Real-World Scenarios with Driving Theory Tests
Paper
β’
2501.04671
β’
Published
The Hidden Life of Tokens: Reducing Hallucination of Large
Vision-Language Models via Visual Information Steering
Paper
β’
2502.03628
β’
Published
β’
12
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic
Understanding, Localization, and Dense Features
Paper
β’
2502.14786
β’
Published
β’
157
Qwen2.5-VL Technical Report
Paper
β’
2502.13923
β’
Published
β’
212