new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Jan 9

Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation

Automatic radiology report generation has attracted enormous research interest due to its practical value in reducing the workload of radiologists. However, simultaneously establishing global correspondences between the image (e.g., Chest X-ray) and its related report and local alignments between image patches and keywords remains challenging. To this end, we propose an Unify, Align and then Refine (UAR) approach to learn multi-level cross-modal alignments and introduce three novel modules: Latent Space Unifier (LSU), Cross-modal Representation Aligner (CRA) and Text-to-Image Refiner (TIR). Specifically, LSU unifies multimodal data into discrete tokens, making it flexible to learn common knowledge among modalities with a shared network. The modality-agnostic CRA learns discriminative features via a set of orthonormal basis and a dual-gate mechanism first and then globally aligns visual and textual representations under a triplet contrastive loss. TIR boosts token-level local alignment via calibrating text-to-image attention with a learnable mask. Additionally, we design a two-stage training procedure to make UAR gradually grasp cross-modal alignments at different levels, which imitates radiologists' workflow: writing sentence by sentence first and then checking word by word. Extensive experiments and analyses on IU-Xray and MIMIC-CXR benchmark datasets demonstrate the superiority of our UAR against varied state-of-the-art methods.

  • 6 authors
·
Mar 28, 2023

StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians

3D Gaussian Splatting (3DGS) excels in photorealistic scene reconstruction but struggles with stylized scenarios (e.g., cartoons, games) due to fragmented textures, semantic misalignment, and limited adaptability to abstract aesthetics. We propose StyleMe3D, a holistic framework for 3D GS style transfer that integrates multi-modal style conditioning, multi-level semantic alignment, and perceptual quality enhancement. Our key insights include: (1) optimizing only RGB attributes preserves geometric integrity during stylization; (2) disentangling low-, medium-, and high-level semantics is critical for coherent style transfer; (3) scalability across isolated objects and complex scenes is essential for practical deployment. StyleMe3D introduces four novel components: Dynamic Style Score Distillation (DSSD), leveraging Stable Diffusion's latent space for semantic alignment; Contrastive Style Descriptor (CSD) for localized, content-aware texture transfer; Simultaneously Optimized Scale (SOS) to decouple style details and structural coherence; and 3D Gaussian Quality Assessment (3DG-QA), a differentiable aesthetic prior trained on human-rated data to suppress artifacts and enhance visual harmony. Evaluated on NeRF synthetic dataset (objects) and tandt db (scenes) datasets, StyleMe3D outperforms state-of-the-art methods in preserving geometric details (e.g., carvings on sculptures) and ensuring stylistic consistency across scenes (e.g., coherent lighting in landscapes), while maintaining real-time rendering. This work bridges photorealistic 3D GS and artistic stylization, unlocking applications in gaming, virtual worlds, and digital art.

  • 10 authors
·
Apr 21, 2025 2

Remote Sensing Large Vision-Language Model: Semantic-augmented Multi-level Alignment and Semantic-aware Expert Modeling

Large Vision and Language Models (LVLMs) have shown strong performance across various vision-language tasks in natural image domains. However, their application to remote sensing (RS) remains underexplored due to significant domain differences in visual appearances, object scales, and semantics. These discrepancies hider the effective understanding of RS scenes, which contain rich, multi-level semantic information spanning from coarse-to-fine levels. Hence, it limits the direct adaptation of existing LVLMs to RS imagery. To address this gap, we propose a novel LVLM framework tailored for RS understanding, incorporating two core components: Semantic-augmented Multi-level Alignment and Semantic-aware Expert Modeling. First, to align multi-level visual features, we introduce the retrieval-based Semantic Augmentation Module which enriches the visual features with relevant semantics across fine-to-coarse levels (e.g., object- and scene-level information). It is designed to retrieve relevant semantic cues from a RS semantic knowledge database, followed by aggregation of semantic cues with user query and multi-level visual features, resulting in semantically enriched representation across multiple levels. Second, for Semantic-aware Expert Modeling, we design semantic experts, where each expert is responsible for processing semantic representation at different levels separately. This enables hierarchical semantic understanding from coarse to fine levels. Evaluations across multiple RS tasks-including scene classification and VQA, etc.-demonstrate that the proposed framework achieves consistent improvements across multiple semantic levels. This highlights its capability and effectiveness in bridging the gap between general LVLMs and unique demands of RS-specific vision-language understanding.

  • 4 authors
·
Jun 26, 2025

FaVChat: Hierarchical Prompt-Query Guided Facial Video Understanding with Data-Efficient GRPO

Multi-modal large language models (MLLMs) have shown strong capability in video understanding but still struggle with fine-grained visual comprehension, as pure visual encoders often lose subtle cues essential for precise reasoning. To address this limitation, we propose FaVChat, a Video-MLLM specifically designed for fine-grained facial understanding. FaVChat introduces a multi-level prompt-guided feature extraction mechanism that progressively captures task-relevant information from three complementary stages: low-level transformer layers for textures and motion, medium-level learnable queries for discriminative regions, and high-level adaptive feature weighting for semantic alignment. These enriched features are dynamically fused and fed into the LLM to enable more accurate fine-grained reasoning. To further enhance the model's ability to capture fine-grained facial attributes and maximize the utility of limited data, we propose Date-Efficient GRPO, a novel data-efficient reinforcement learning (RL) algorithm that maximizes the utility of each training sample through per-instance utility estimation and dynamic lifecycle scheduling. Extensive zero-shot evaluations across emotion recognition, explainable reasoning, and textual expression analysis demonstrate that FaVChat achieves finer-grained understanding, stronger accuracy, and better generalization than existing Video-MLLMs, even when trained with only 10K RL samples.

  • 9 authors
·
Mar 12, 2025