Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
Abstract
Region-to-Image Distillation enables fine-grained visual perception in MLLMs by training models to internally perform iterative zooming during inference, eliminating the need for repeated tool calls and visual re-encoding while maintaining high performance across multiple benchmarks.
Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro-cropped regions to let strong teacher models generate high-quality VQA data, and then distill this region-grounded supervision back to the full image. After training on such data, the smaller student model improves "single-glance" fine-grained perception without tool use. To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional "zooming gap". Experiments show that our models achieve leading performance across multiple fine-grained perception benchmarks, and also improve general multimodal cognition on benchmarks such as visual reasoning and GUI agents. We further discuss when "Thinking-with-Images" is necessary versus when its gains can be distilled into a single forward pass. Our code is available at https://github.com/inclusionAI/Zooming-without-Zooming.
Community
Recent "Thinking-with-Images" methods improve fine-grained perception by iteratively zooming into regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. In this work, we present ZwZ models (4/7/8B), achieving SOTA performance on multimodal perception benchmarks among open-source models. In addition, we present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional "zooming gap".
Github: https://github.com/inclusionAI/Zooming-without-Zooming
Huggingface collection: https://huggingface.co/collections/inclusionAI/zooming-without-zooming
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning (2026)
- GUI-Eyes: Tool-Augmented Perception for Visual Grounding in GUI Agents (2026)
- RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations (2025)
- Chatting with Images for Introspective Visual Thinking (2026)
- Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning (2026)
- SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning (2026)
- REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 3
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper
