Abstract
Diffusion models are adapted for optical character recognition by using block discrete diffusion to enable faster, parallel processing while maintaining high accuracy.
Optical Character Recognition (OCR) is a fundamental task for digitizing information, serving as a critical bridge between visual data and textual understanding. While modern Vision-Language Models (VLM) have achieved high accuracy in this domain, they predominantly rely on autoregressive decoding, which becomes computationally expensive and slow for long documents as it requires a sequential forward pass for every generated token. We identify a key opportunity to overcome this bottleneck: unlike open-ended generation, OCR is a highly deterministic task where the visual input strictly dictates a unique output sequence, theoretically enabling efficient, parallel decoding via diffusion models. However, we show that existing masked diffusion models fail to harness this potential; those introduce structural instabilities that are benign in flexible tasks, like captioning, but catastrophic for the rigid, exact-match requirements of OCR. To bridge this gap, we introduce DODO, the first VLM to utilize block discrete diffusion and unlock its speedup potential for OCR. By decomposing generation into blocks, DODO mitigates the synchronization errors of global diffusion. Empirically, our method achieves near state-of-the-art accuracy while enabling up to 3x faster inference compared to autoregressive baselines.
Community
TL;DR – OCR is an almost deterministic task, which enables, in theory, parallel decoding. We realize this potential by training a Block Masked Diffusion Model. We achieve competitive results with SOTA autoregressive OCR VLMs, while decoding more than 10 tokens per step on average.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding (2026)
- Streaming-dLLM: Accelerating Diffusion LLMs via Suffix Pruning and Dynamic Decoding (2026)
- d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation (2026)
- Global Context Compression with Interleaved Vision-Text Transformation (2026)
- LinMU: Multimodal Understanding Made Linear (2026)
- Autoregressive Image Generation with Masked Bit Modeling (2026)
- DeepSeek-OCR 2: Visual Causal Flow (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper