Muon has gone from an experiment to a mainstream optimizer, but does it hold up for fineβtuning? We ran headβtoβhead tests on Qwen3β4B (10k+ highβquality instruction rows) to find out.
Short story: Pure Muon converged fastest at the start, but its gradientβnorm spikes made training unstable. MuonClip (Kimi K2βs clipping) stabilizes long pretraining runs, yet in our smallβscale fineβtune it underperformed, lower token accuracy and slower convergence. The winner was the hybrid: Muon for 2D layers + AdamW for 1D layers. It delivered the best balance of stability and final performance and even beat vanilla AdamW.
Takeaway: for small-scale fine-tuning, hybrid = practical and reliable.
Next Step: scale to larger models/datasets to see if Muonβs spikes become catastrophic or if clipping wins out.
πποΈπ New Research Alert - ICCV 2025 (Poster)! πποΈπ π Title: Is Less More? Exploring Token Condensation as Training-Free Test-Time Adaptation π
π Description: Token Condensation as Adaptation (TCA) improves the performance and efficiency of Vision Language Models in zero-shot inference by introducing domain anchor tokens.
π₯ Authors: Zixin Wang, Dong Gong, Sen Wang, Zi Huang, Yadan Luo
πποΈπ New Research Alert - ICCV 2025 (Oral)! πποΈπ π Title: Diving into the Fusion of Monocular Priors for Generalized Stereo Matching π
π Description: The proposed method enhances stereo matching by efficiently combining unbiased monocular priors from vision foundation models. This method addresses misalignment and local optima issues using a binary local ordering map and pixel-wise linear regression.
πππ New Research Alert - ICCV 2025 (Oral)! ππ€π π Title: Understanding Co-speech Gestures in-the-wild π
π Description: JEGAL is a tri-modal model that learns from gestures, speech and text simultaneously, enabling devices to interpret co-speech gestures in the wild.
π₯ Authors: @sindhuhegde, K R Prajwal, Taein Kwon, and Andrew Zisserman
ππ‘π New Research Alert - ICCV 2025 (Oral)! ππͺπ π Title: LoftUp: Learning a Coordinate-based Feature Upsampler for Vision Foundation Models π
π Description: LoftUp is a coordinate-based transformer that upscales the low-resolution features of VFMs (e.g. DINOv2 and CLIP) using cross-attention and self-distilled pseudo-ground truth (pseudo-GT) from SAM.
π₯ Authors: Haiwen Huang, Anpei Chen, Volodymyr Havrylov, Andreas Geiger, and Dan Zhang
π Description: The HeLlO framework is a new corpus distillation method that removes the need for large soft labels. It uses a lightweight, online image-to-label projector based on CLIP. This projector has been adapted using LoRA-style, parameter-efficient tuning. It has also been initialized with text embeddings.
ππ€π New Research Alert - ICCV 2025 (Oral)! ππ€π π Title: Variance-based Pruning for Accelerating and Compressing Trained Networks π
π Description: The one-shot pruning method efficiently compresses networks, reducing computation and memory usage while retaining almost full performance and requiring minimal fine-tuning.
π₯ Authors: Uranik Berisha, Jens Mehnert, and Alexandru Paul Condurache
πποΈπ New Research Alert - ICCV 2025 (Oral)! πποΈπ π Title: Token Activation Map to Visually Explain Multimodal LLMs π
π Description: The Token Activation Map (TAM) is an advanced explainability method for multimodal LLMs. Using causal inference and a Rank Gaussian Filter, TAM reveals token-level interactions and eliminates redundant activations. The result is clearer, high-quality visualizations that enhance understanding of object localization, reasoning and multimodal alignment across models.
π₯ Authors: Yi Li, Hualiang Wang, Xinpeng Ding, Haonan Wang, and Xiaomeng Li
deepseek-ai/DeepSeek-OCR is out! π₯ my take β€΅οΈ > pretty insane it can parse and re-render charts in HTML > it uses CLIP and SAM features concatenated, so better grounding > very efficient per vision tokens/performance ratio > covers 100 languages
Tokenization is one of the most important processes in AI - yet many would like to kill it π
What's tokenization? The neural networks inside LLMs actually only process numbers, not text: tokenization is the process that makes text readable for them, by converting sentences into lists of numbers.
β‘οΈ For instance, "This is tokenization" would be split into "This | is | token | ization", then each of the parts (tokens) are converted to IDs according to a predefined mapping: for instance "ization" could map to id 2438. Thus "This is tokenization" can become 1335 | 135 | 2980 | 2438 => now the model can process the sentence!
Most tokenizers today use pre-specified mappings called "vocabularies", generally built about the compression algorithme Byte-Pair Encoding (BPE) that learns from a big corpuses of texts an optimized split to efficiently encode any text from the same distribution into a list token IDs.
π€¨ Now, these current tokenizers have flaws. For instance, the rigidity of their mapping creates losses ; the prime example being that a tokenizer designed for English (thus optimized for tokens like "has", "been", "clock", etc) will not have the right tokens to approach Burmese, thus being terribly inefficient at it.
Many alternative approaches have emerged as a result: for instance "tokenizer-free tokenizers". One that I really liked was "entropy-based": it monitors the stream of text, and trigger a split whenever the entropy increases too much, i.e. when something "surprising" happens.