vision
updated
What matters when building vision-language models?
Paper
• 2405.02246
• Published
• 103
An Introduction to Vision-Language Modeling
Paper
• 2405.17247
• Published
• 90
DeMamba: AI-Generated Video Detection on Million-Scale GenVideo
Benchmark
Paper
• 2405.19707
• Published
• 8
Scaling Up Your Kernels: Large Kernel Design in ConvNets towards
Universal Representations
Paper
• 2410.08049
• Published
• 8
Task Vectors are Cross-Modal
Paper
• 2410.22330
• Published
• 11
DINO-X: A Unified Vision Model for Open-World Object Detection and
Understanding
Paper
• 2411.14347
• Published
• 16
Florence-VL: Enhancing Vision-Language Models with Generative Vision
Encoder and Depth-Breadth Fusion
Paper
• 2412.04424
• Published
• 62
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Paper
• 2412.04467
• Published
• 117
iFormer: Integrating ConvNet and Transformer for Mobile Application
Paper
• 2501.15369
• Published
• 13
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic
Understanding, Localization, and Dense Features
Paper
• 2502.14786
• Published
• 158
Paper
• 2502.17941
• Published
• 10
Paper
• 2508.10104
• Published
• 298
Real-Time Object Detection Meets DINOv3
Paper
• 2509.20787
• Published
• 11
RF-DETR: Neural Architecture Search for Real-Time Detection Transformers
Paper
• 2511.09554
• Published
• 9
YOLO Meets Mixture-of-Experts: Adaptive Expert Routing for Robust Object Detection
Paper
• 2511.13344
• Published
• 3