Multimodal • Score 85
KGCE: Knowledge-Augmented Dual-Graph Evaluator for Cross-Platform Educational Agent Benchmarking with Multimodal Language Models
arXiv:2601.01366v1 Announce Type: new
Abstract: With the rapid adoption of multimodal large language models (MLMs) in autonomous agents, cross-platform task execution capabilities in educational settings have garnered significant attention. However, existing benchmark frameworks still exhibit notable deficiencies in supporting cross-platform tasks in educational contexts, especially when dealing with school-specific software (such as XiaoYa Intelligent Assistant, HuaShi XiaZi, etc.), where the efficiency of agents often significantly decreases due to a lack of understanding of the structural specifics of these private-domain software. Additionally, current evaluation methods heavily rely on coarse-grained metrics like goal orientation or trajectory matching, making it challenging to capture the detailed execution and efficiency of agents in complex tasks. To address these issues, we propose KGCE (Knowledge-Augmented Dual-Graph Evaluator for Cross-Platform Educational Agent Benchmarking with Multimodal Language Models), a novel benchmarking platform that integrates knowledge base enhancement and a dual-graph evaluation framework. We first constructed a dataset comprising 104 education-related tasks, covering Windows, Android, and cross-platform collaborative tasks. KGCE introduces a dual-graph evaluation framework that decomposes tasks into multiple sub-goals and verifies their completion status, providing fine-grained evaluation metrics. To overcome the execution bottlenecks of existing agents in private-domain tasks, we developed an enhanced agent system incorporating a knowledge base specific to school-specific software. The code can be found at https://github.com/Kinginlife/KGCE.
Fonte: arXiv cs.AI
Multimodal • Score 85
OmniNeuro: Um Framework HCI Multimodal para Feedback Explicável de BCI via IA Generativa e Sonificação
Embora o Deep Learning tenha melhorado a precisão de decodificação de Interfaces Cérebro-Máquina (BCI), a adoção clínica é dificultada pela natureza de 'caixa-preta' desses algoritmos, levando à frustração do usuário e a resultados ruins de neuroplasticidade. Propomos o OmniNeuro, um novo framework HCI que transforma o BCI de um decodificador silencioso em um parceiro de feedback transparente.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
Yuan3.0 Flash: An Open Multimodal Large Language Model for Enterprise Applications
arXiv:2601.01718v1 Announce Type: new
Abstract: We introduce Yuan3.0 Flash, an open-source Mixture-of-Experts (MoE) MultiModal Large Language Model featuring 3.7B activated parameters and 40B total parameters, specifically designed to enhance performance on enterprise-oriented tasks while maintaining competitive capabilities on general-purpose tasks. To address the overthinking phenomenon commonly observed in Large Reasoning Models (LRMs), we propose Reflection-aware Adaptive Policy Optimization (RAPO), a novel RL training algorithm that effectively regulates overthinking behaviors. In enterprise-oriented tasks such as retrieval-augmented generation (RAG), complex table understanding, and summarization, Yuan3.0 Flash consistently achieves superior performance. Moreover, it also demonstrates strong reasoning capabilities in domains such as mathematics, science, etc., attaining accuracy comparable to frontier model while requiring only approximately 1/4 to 1/2 of the average tokens. Yuan3.0 Flash has been fully open-sourced to facilitate further research and real-world deployment: https://github.com/Yuan-lab-LLM/Yuan3.0.
Fonte: arXiv cs.AI
Multimodal • Score 85
MMP-A*: Multimodal Perception Enhanced Incremental Heuristic Search on Path Planning
arXiv:2601.01910v1 Announce Type: new
Abstract: Autonomous path planning requires a synergy between global reasoning and geometric precision, especially in complex or cluttered environments. While classical A* is valued for its optimality, it incurs prohibitive computational and memory costs in large-scale scenarios. Recent attempts to mitigate these limitations by using Large Language Models for waypoint guidance remain insufficient, as they rely only on text-based reasoning without spatial grounding. As a result, such models often produce incorrect waypoints in topologically complex environments with dead ends, and lack the perceptual capacity to interpret ambiguous physical boundaries. These inconsistencies lead to costly corrective expansions and undermine the intended computational efficiency.
We introduce MMP-A*, a multimodal framework that integrates the spatial grounding capabilities of vision-language models with a novel adaptive decay mechanism. By anchoring high-level reasoning in physical geometry, the framework produces coherent waypoint guidance that addresses the limitations of text-only planners. The adaptive decay mechanism dynamically regulates the influence of uncertain waypoints within the heuristic, ensuring geometric validity while substantially reducing memory overhead. To evaluate robustness, we test the framework in challenging environments characterized by severe clutter and topological complexity. Experimental results show that MMP-A* achieves near-optimal trajectories with significantly reduced operational costs, demonstrating its potential as a perception-grounded and computationally efficient paradigm for autonomous navigation.
Fonte: arXiv cs.AI
Multimodal • Score 92
Um modelo unificado de compreensão e geração multimodal para pesquisa científica interdisciplinar
A descoberta científica depende cada vez mais da integração de dados heterogêneos e de alta dimensão entre disciplinas. Apresentamos o FuXi-Uni, um modelo nativo unificado que suporta a compreensão científica e a geração de dados multimodais de alta fidelidade, alinhando tokens científicos interdisciplinares com tokens de linguagem natural e empregando um decodificador científico. Validamos o FuXi-Uni em ciências da Terra e biomedicina.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
Alinhamento Semântico de Grafos de Conhecimento Multilíngues via Projeções Vetoriais Contextualizadas
O artigo apresenta nosso trabalho em um sistema de alinhamento de ontologias cross-linguais que utiliza correspondência de similaridade coseno baseada em embeddings. As entidades da ontologia são enriquecidas contextualmente por meio de descrições criadas com técnicas inovadoras. Avaliamos nosso trabalho na trilha multifarm OAEI-2022, alcançando 71% de F1 score, indicando a eficácia do nosso pipeline de alinhamento.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
OpenSocInt: A Multi-modal Training Environment for Human-Aware Social Navigation
arXiv:2601.01939v1 Announce Type: new
Abstract: In this paper, we introduce OpenSocInt, an open-source software package providing a simulator for multi-modal social interactions and a modular architecture to train social agents. We described the software package and showcased its interest via an experimental protocol based on the task of social navigation. Our framework allows for exploring the use of different perceptual features, their encoding and fusion, as well as the use of different agents. The software is already publicly available under GPL at https://gitlab.inria.fr/robotlearn/OpenSocInt/.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
Learning Speech Representations with Variational Predictive Coding
arXiv:2601.00100v1 Announce Type: cross
Abstract: Despite being the best known objective for learning speech representations, the HuBERT objective has not been further developed and improved. We argue that it is the lack of an underlying principle that stalls the development, and, in this paper, we show that predictive coding under a variational view is the principle behind the HuBERT objective. Due to its generality, our formulation provides opportunities to improve parameterization and optimization, and we show two simple modifications that bring immediate improvements to the HuBERT objective. In addition, the predictive coding formulation has tight connections to various other objectives, such as APC, CPC, wav2vec, and BEST-RQ. Empirically, the improvement in pre-training brings significant improvements to four downstream tasks: phone classification, f0 tracking, speaker recognition, and automatic speech recognition, highlighting the importance of the predictive coding interpretation.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Comparative Efficiency Analysis of Lightweight Transformer Models: A Multi-Domain Empirical Benchmark for Enterprise NLP Deployment
arXiv:2601.00444v1 Announce Type: new
Abstract: In the rapidly evolving landscape of enterprise natural language processing (NLP), the demand for efficient, lightweight models capable of handling multi-domain text automation tasks has intensified. This study conducts a comparative analysis of three prominent lightweight Transformer models - DistilBERT, MiniLM, and ALBERT - across three distinct domains: customer sentiment classification, news topic classification, and toxicity and hate speech detection. Utilizing datasets from IMDB, AG News, and the Measuring Hate Speech corpus, we evaluated performance using accuracy-based metrics including accuracy, precision, recall, and F1-score, as well as efficiency metrics such as model size, inference time, throughput, and memory usage. Key findings reveal that no single model dominates all performance dimensions. ALBERT achieves the highest task-specific accuracy in multiple domains, MiniLM excels in inference speed and throughput, and DistilBERT demonstrates the most consistent accuracy across tasks while maintaining competitive efficiency. All results reflect controlled fine-tuning under fixed enterprise-oriented constraints rather than exhaustive hyperparameter optimization. These results highlight trade-offs between accuracy and efficiency, recommending MiniLM for latency-sensitive enterprise applications, DistilBERT for balanced performance, and ALBERT for resource-constrained environments.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach
arXiv:2601.00388v1 Announce Type: new
Abstract: Recent advances in vision-language models have opened up new possibilities for reasoning-driven image geolocalization. However, existing approaches often rely on synthetic reasoning annotations or external image retrieval, which can limit interpretability and generalizability. In this paper, we present Geo-R, a retrieval-free framework that uncovers structured reasoning paths from existing ground-truth coordinates and optimizes geolocation accuracy via reinforcement learning. We propose the Chain of Region, a rule-based hierarchical reasoning paradigm that generates precise, interpretable supervision by mapping GPS coordinates to geographic entities (e.g., country, province, city) without relying on model-generated or synthetic labels. Building on this, we introduce a lightweight reinforcement learning strategy with coordinate-aligned rewards based on Haversine distance, enabling the model to refine predictions through spatially meaningful feedback. Our approach bridges structured geographic reasoning with direct spatial supervision, yielding improved localization accuracy, stronger generalization, and more transparent inference. Experimental results across multiple benchmarks confirm the effectiveness of Geo-R, establishing a new retrieval-free paradigm for scalable and interpretable image geolocalization. To facilitate further research and ensure reproducibility, both the model and code will be made publicly available.
Fonte: arXiv cs.CL
NLP/LLMs • Score 96
Ajuste Fino de Modelos de Linguagem de Grande Escala para Triagem Automatizada de Depressão em Pidgin Nigeriano: Estudo Piloto GENSCORE
A depressão é um grande contribuinte para a carga de saúde mental na Nigéria, mas a cobertura de triagem é limitada devido ao baixo acesso a clínicos, estigma e barreiras linguísticas. Este estudo apresenta uma abordagem inovadora para triagem automatizada de depressão usando modelos de linguagem de grande escala (LLMs) ajustados para o Pidgin Nigeriano conversacional.
Fonte: arXiv cs.AI
NLP/LLMs • Score 94
Controles de Abstenção Explícita para Confiabilidade Previsível em Respostas a Perguntas em Vídeo
A implementação de modelos de visão-linguagem (VLMs) em situações críticas exige previsões seletivas, onde os sistemas se abstêm quando incertos, evitando erros custosos. Investigamos se a abstenção baseada em confiança oferece controle confiável sobre as taxas de erro em respostas a perguntas em vídeo e se esse controle se mantém robusto sob mudança de distribuição.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
FreeText: Training-Free Text Rendering in Diffusion Transformers via Attention Localization and Spectral Glyph Injection
arXiv:2601.00535v1 Announce Type: new
Abstract: Large-scale text-to-image (T2I) diffusion models excel at open-domain synthesis but still struggle with precise text rendering, especially for multi-line layouts, dense typography, and long-tailed scripts such as Chinese. Prior solutions typically require costly retraining or rigid external layout constraints, which can degrade aesthetics and limit flexibility. We propose \textbf{FreeText}, a training-free, plug-and-play framework that improves text rendering by exploiting intrinsic mechanisms of \emph{Diffusion Transformer (DiT)} models. \textbf{FreeText} decomposes the problem into \emph{where to write} and \emph{what to write}. For \emph{where to write}, we localize writing regions by reading token-wise spatial attribution from endogenous image-to-text attention, using sink-like tokens as stable spatial anchors and topology-aware refinement to produce high-confidence masks. For \emph{what to write}, we introduce Spectral-Modulated Glyph Injection (SGMI), which injects a noise-aligned glyph prior with frequency-domain band-pass modulation to strengthen glyph structure and suppress semantic leakage (rendering the concept instead of the word). Extensive experiments on Qwen-Image, FLUX.1-dev, and SD3 variants across longText-Benchmark, CVTG, and our CLT-Bench show consistent gains in text readability while largely preserving semantic alignment and aesthetic quality, with modest inference overhead.
Fonte: arXiv cs.CV
NLP/LLMs • Score 96
FCMBench: Um Benchmark Multimodal Abrangente de Crédito Financeiro para Aplicações do Mundo Real
À medida que a IA multimodal se torna amplamente utilizada para avaliação de risco de crédito e revisão de documentos, um benchmark específico do domínio é urgentemente necessário. Apresentamos o FCMBench-V1.0, um benchmark multimodal de crédito financeiro em larga escala, cobrindo 18 tipos de certificados principais, com 4.043 imagens em conformidade com a privacidade e 8.446 amostras de QA.
Fonte: arXiv cs.AI
Vision • Score 95
OmniVaT: Single Domain Generalization for Multimodal Visual-Tactile Learning
arXiv:2601.00352v1 Announce Type: new
Abstract: Visual-tactile learning (VTL) enables embodied agents to perceive the physical world by integrating visual (VIS) and tactile (TAC) sensors. However, VTL still suffers from modality discrepancies between VIS and TAC images, as well as domain gaps caused by non-standardized tactile sensors and inconsistent data collection procedures. We formulate these challenges as a new task, termed single domain generalization for multimodal VTL (SDG-VTL). In this paper, we propose an OmniVaT framework that, for the first time, successfully addresses this task. On the one hand, OmniVaT integrates a multimodal fractional Fourier adapter (MFFA) to map VIS and TAC embeddings into a unified embedding-frequency space, thereby effectively mitigating the modality gap without multi-domain training data or careful cross-modal fusion strategies. On the other hand, it also incorporates a discrete tree generation (DTG) module that obtains diverse and reliable multimodal fractional representations through a hierarchical tree structure, thereby enhancing its adaptivity to fluctuating domain shifts in unseen domains. Extensive experiments demonstrate the superior cross-domain generalization performance of OmniVaT on the SDG-VTL task.
Fonte: arXiv cs.CV
Vision • Score 95
A Comprehensive Dataset for Human vs. AI Generated Image Detection
arXiv:2601.00553v1 Announce Type: new
Abstract: Multimodal generative AI systems like Stable Diffusion, DALL-E, and MidJourney have fundamentally changed how synthetic images are created. These tools drive innovation but also enable the spread of misleading content, false information, and manipulated media. As generated images become harder to distinguish from photographs, detecting them has become an urgent priority. To combat this challenge, We release MS COCOAI, a novel dataset for AI generated image detection consisting of 96000 real and synthetic datapoints, built using the MS COCO dataset. To generate synthetic images, we use five generators: Stable Diffusion 3, Stable Diffusion 2.1, SDXL, DALL-E 3, and MidJourney v6. Based on the dataset, we propose two tasks: (1) classifying images as real or generated, and (2) identifying which model produced a given synthetic image. The dataset is available at https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Image_Dataset.
Fonte: arXiv cs.CV
Vision • Score 95
GranAlign: Granularity-Aware Alignment Framework for Zero-Shot Video Moment Retrieval
arXiv:2601.00584v1 Announce Type: new
Abstract: Zero-shot video moment retrieval (ZVMR) is the task of localizing a temporal moment within an untrimmed video using a natural language query without relying on task-specific training data. The primary challenge in this setting lies in the mismatch in semantic granularity between textual queries and visual content. Previous studies in ZVMR have attempted to achieve alignment by leveraging high-quality pre-trained knowledge that represents video and language in a joint space. However, these approaches failed to balance the semantic granularity between the pre-trained knowledge provided by each modality for a given scene. As a result, despite the high quality of each modality's representations, the mismatch in granularity led to inaccurate retrieval. In this paper, we propose a training-free framework, called Granularity-Aware Alignment (GranAlign), that bridges this gap between coarse and fine semantic representations. Our approach introduces two complementary techniques: granularity-based query rewriting to generate varied semantic granularities, and query-aware caption generation to embed query intent into video content. By pairing multi-level queries with both query-agnostic and query-aware captions, we effectively resolve semantic mismatches. As a result, our method sets a new state-of-the-art across all three major benchmarks (QVHighlights, Charades-STA, ActivityNet-Captions), with a notable 3.23% mAP@avg improvement on the challenging QVHighlights dataset.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
From Sight to Insight: Improving Visual Reasoning Capabilities of Multimodal Models via Reinforcement Learning
arXiv:2601.00215v1 Announce Type: cross
Abstract: Reinforcement learning (RL) has emerged as a promising approach for eliciting reasoning chains before generating final answers. However, multimodal large language models (MLLMs) generate reasoning that lacks integration of visual information. This limits their ability to solve problems that demand accurate visual perception, such as visual puzzles. We show that visual perception is the key bottleneck in such tasks: converting images into textual descriptions significantly improves performance, yielding gains of 26.7% for Claude 3.5 and 23.6% for Claude 3.7.
To address this, we investigate reward-driven RL as a mechanism to unlock long visual reasoning in open-source MLLMs without requiring costly supervision. We design and evaluate six reward functions targeting different reasoning aspects, including image understanding, thinking steps, and answer accuracy. Using group relative policy optimization (GRPO), our approach explicitly incentivizes longer, structured reasoning and mitigates bypassing of visual information. Experiments on Qwen-2.5-VL-7B achieve 5.56% improvements over the base model, with consistent gains across both in-domain and out-of-domain settings.
Fonte: arXiv cs.CL
Vision • Score 95
NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
arXiv:2601.00393v1 Announce Type: new
Abstract: In this paper, we propose NeoVerse, a versatile 4D world model that is capable of 4D reconstruction, novel-trajectory video generation, and rich downstream applications. We first identify a common limitation of scalability in current 4D world modeling methods, caused either by expensive and specialized multi-view 4D data or by cumbersome training pre-processing. In contrast, our NeoVerse is built upon a core philosophy that makes the full pipeline scalable to diverse in-the-wild monocular videos. Specifically, NeoVerse features pose-free feed-forward 4D reconstruction, online monocular degradation pattern simulation, and other well-aligned techniques. These designs empower NeoVerse with versatility and generalization to various domains. Meanwhile, NeoVerse achieves state-of-the-art performance in standard reconstruction and generation benchmarks. Our project page is available at https://neoverse-4d.github.io
Fonte: arXiv cs.CV
NLP/LLMs • Score 96
FaithSCAN: Detecção de Alucinações em Uma Única Passagem Baseada em Modelos para Respostas Visuais de Perguntas Fiéis
As alucinações de fidelidade em VQA ocorrem quando modelos de visão-linguagem produzem respostas fluentes, mas visualmente não fundamentadas, comprometendo sua confiabilidade em aplicações críticas de segurança. Propomos o FaithSCAN, uma rede leve que detecta alucinações explorando sinais internos ricos dos VLMs, superando limitações de métodos existentes em eficiência e desempenho de detecção.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding
arXiv:2601.00264v1 Announce Type: new
Abstract: Multimodal learning has revolutionized general domain tasks, yet its application in scientific discovery is hindered by the profound semantic gap between complex scientific imagery and sparse textual descriptions. We present S1-MMAlign, a large-scale, multi-disciplinary multimodal dataset comprising over 15.5 million high-quality image-text pairs derived from 2.5 million open-access scientific papers. Spanning disciplines from physics and biology to engineering, the dataset captures diverse visual modalities including experimental setups, heatmaps, and microscopic imagery. To address the pervasive issue of weak alignment in raw scientific captions, we introduce an AI-ready semantic enhancement pipeline that utilizes the Qwen-VL multimodal large model series to recaption images by synthesizing context from paper abstracts and citation contexts. Technical validation demonstrates that this enhancement significantly improves data quality: SciBERT-based pseudo-perplexity metrics show reduced semantic ambiguity, while CLIP scores indicate an 18.21% improvement in image-text alignment. S1-MMAlign provides a foundational resource for advancing scientific reasoning and cross-modal understanding in the era of AI for Science. The dataset is publicly available at https://huggingface.co/datasets/ScienceOne-AI/S1-MMAlign.
Fonte: arXiv cs.CV
Vision • Score 92
TimeColor: Flexible Reference Colorization via Temporal Concatenation
arXiv:2601.00296v1 Announce Type: new
Abstract: Most colorization models condition only on a single reference, typically the first frame of the scene. However, this approach ignores other sources of conditional data, such as character sheets, background images, or arbitrary colorized frames. We propose TimeColor, a sketch-based video colorization model that supports heterogeneous, variable-count references with the use of explicit per-reference region assignment. TimeColor encodes references as additional latent frames which are concatenated temporally, permitting them to be processed concurrently in each diffusion step while keeping the model's parameter count fixed. TimeColor also uses spatiotemporal correspondence-masked attention to enforce subject-reference binding in addition to modality-disjoint RoPE indexing. These mechanisms mitigate shortcutting and cross-identity palette leakage. Experiments on SAKUGA-42M under both single- and multi-reference protocols show that TimeColor improves color fidelity, identity consistency, and temporal stability over prior baselines.
Fonte: arXiv cs.CV
NLP/LLMs • Score 96
DA-DPO: Otimização de Preferências Consciente da Dificuldade e Custo-Eficiente para Reduzir Alucinações em MLLMs
O Direct Preference Optimization (DPO) demonstrou grande potencial para mitigar alucinações em Multimodal Large Language Models (MLLMs). No entanto, abordagens existentes frequentemente sofrem com overfitting devido ao desequilíbrio de dificuldade nos dados de preferência. Propomos o Difficulty-Aware Direct Preference Optimization (DA-DPO), um framework custo-efetivo que equilibra o processo de aprendizado.
Fonte: arXiv cs.AI
Vision • Score 95
SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting
arXiv:2601.00285v1 Announce Type: new
Abstract: Reconstructing a dynamic target moving over a large area is challenging. Standard approaches for dynamic object reconstruction require dense coverage in both the viewing space and the temporal dimension, typically relying on multi-view videos captured at each time step. However, such setups are only possible in constrained environments. In real-world scenarios, observations are often sparse over time and captured sparsely from diverse viewpoints (e.g., from security cameras), making dynamic reconstruction highly ill-posed. We present SV-GS, a framework that simultaneously estimates a deformation model and the object's motion over time under sparse observations. To initialize SV-GS, we leverage a rough skeleton graph and an initial static reconstruction as inputs to guide motion estimation. (Later, we show that this input requirement can be relaxed.) Our method optimizes a skeleton-driven deformation field composed of a coarse skeleton joint pose estimator and a module for fine-grained deformations. By making only the joint pose estimator time-dependent, our model enables smooth motion interpolation while preserving learned geometric details. Experiments on synthetic datasets show that our method outperforms existing approaches under sparse observations by up to 34% in PSNR, and achieves comparable performance to dense monocular video methods on real-world datasets despite using significantly fewer frames. Moreover, we demonstrate that the input initial static reconstruction can be replaced by a diffusion-based generative prior, making our method more practical for real-world scenarios.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
TotalFM: An Organ-Separated Framework for 3D-CT Vision Foundation Models
arXiv:2601.00260v1 Announce Type: new
Abstract: While foundation models in radiology are expected to be applied to various clinical tasks, computational cost constraints remain a major challenge when training on 3D-CT volumetric data. In this study, we propose TotalFM, a radiological foundation model that efficiently learns the correspondence between 3D-CT images and linguistic expressions based on the concept of organ separation, utilizing a large-scale dataset of 140,000 series. By automating the creation of organ volume and finding-sentence pairs through segmentation techniques and Large Language Model (LLM)-based radiology report processing, and by combining self-supervised pre-training via VideoMAE with contrastive learning using volume-text pairs, we aimed to balance computational efficiency and representation capability. In zero-shot organ-wise lesion classification tasks, the proposed model achieved higher F1 scores in 83% (5/6) of organs compared to CT-CLIP and 64% (9/14) of organs compared to Merlin. These results suggest that the proposed model exhibits high generalization performance in a clinical evaluation setting using actual radiology report sentences. Furthermore, in zero-shot finding-wise lesion classification tasks, our model achieved a higher AUROC in 83% (25/30) of finding categories compared to Merlin. We also confirmed performance comparable to existing Vision-Language Models (VLMs) in radiology report generation tasks. Our results demonstrate that the organ-separated learning framework can serve as a realistic and effective design guideline for the practical implementation of 3D-CT foundation models.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Focal-RegionFace: Generating Fine-Grained Multi-attribute Descriptions for Arbitrarily Selected Face Focal Regions
arXiv:2601.00156v1 Announce Type: new
Abstract: In this paper, we introduce an underexplored problem in facial analysis: generating and recognizing multi-attribute natural language descriptions, containing facial action units (AUs), emotional states, and age estimation, for arbitrarily selected face regions (termed FaceFocalDesc). We argue that the system's ability to focus on individual facial areas leads to better understanding and control. To achieve this capability, we construct a new multi-attribute description dataset for arbitrarily selected face regions, providing rich region-level annotations and natural language descriptions. Further, we propose a fine-tuned vision-language model based on Qwen2.5-VL, called Focal-RegionFace for facial state analysis, which incrementally refines its focus on localized facial features through multiple progressively fine-tuning stages, resulting in interpretable age estimation, FAU and emotion detection. Experimental results show that Focal-RegionFace achieves the best performance on the new benchmark in terms of traditional and widely used metrics, as well as new proposed metrics. This fully verifies its effectiveness and versatility in fine-grained multi-attribute face region-focal analysis scenarios.
Fonte: arXiv cs.CV
RL • Score 95
AEGIS: Exploring the Limit of World Knowledge Capabilities for Unified Mulitmodal Models
arXiv:2601.00561v1 Announce Type: new
Abstract: The capability of Unified Multimodal Models (UMMs) to apply world knowledge across diverse tasks remains a critical, unresolved challenge. Existing benchmarks fall short, offering only siloed, single-task evaluations with limited diagnostic power. To bridge this gap, we propose AEGIS (\emph{i.e.}, \textbf{A}ssessing \textbf{E}diting, \textbf{G}eneration, \textbf{I}nterpretation-Understanding for \textbf{S}uper-intelligence), a comprehensive multi-task benchmark covering visual understanding, generation, editing, and interleaved generation. AEGIS comprises 1,050 challenging, manually-annotated questions spanning 21 topics (including STEM, humanities, daily life, etc.) and 6 reasoning types. To concretely evaluate the performance of UMMs in world knowledge scope without ambiguous metrics, we further propose Deterministic Checklist-based Evaluation (DCE), a protocol that replaces ambiguous prompt-based scoring with atomic ``Y/N'' judgments, to enhance evaluation reliability. Our extensive experiments reveal that most UMMs exhibit severe world knowledge deficits and that performance degrades significantly with complex reasoning. Additionally, simple plug-in reasoning modules can partially mitigate these vulnerabilities, highlighting a promising direction for future research. These results highlight the importance of world-knowledge-based reasoning as a critical frontier for UMMs.
Fonte: arXiv cs.CV
Vision • Score 95
ActErase: A Training-Free Paradigm for Precise Concept Erasure via Activation Patching
arXiv:2601.00267v1 Announce Type: new
Abstract: Recent advances in text-to-image diffusion models have demonstrated remarkable generation capabilities, yet they raise significant concerns regarding safety, copyright, and ethical implications. Existing concept erasure methods address these risks by removing sensitive concepts from pre-trained models, but most of them rely on data-intensive and computationally expensive fine-tuning, which poses a critical limitation. To overcome these challenges, inspired by the observation that the model's activations are predominantly composed of generic concepts, with only a minimal component can represent the target concept, we propose a novel training-free method (ActErase) for efficient concept erasure. Specifically, the proposed method operates by identifying activation difference regions via prompt-pair analysis, extracting target activations and dynamically replacing input activations during forward passes. Comprehensive evaluations across three critical erasure tasks (nudity, artistic style, and object removal) demonstrates that our training-free method achieves state-of-the-art (SOTA) erasure performance, while effectively preserving the model's overall generative capability. Our approach also exhibits strong robustness against adversarial attacks, establishing a new plug-and-play paradigm for lightweight yet effective concept manipulation in diffusion models.
Fonte: arXiv cs.CV
Vision • Score 95
A Spatially Masked Adaptive Gated Network for multimodal post-flood water extent mapping using SAR and incomplete multispectral data
arXiv:2601.00123v1 Announce Type: new
Abstract: Mapping water extent during a flood event is essential for effective disaster management throughout all phases: mitigation, preparedness, response, and recovery. In particular, during the response stage, when timely and accurate information is important, Synthetic Aperture Radar (SAR) data are primarily employed to produce water extent maps. Recently, leveraging the complementary characteristics of SAR and MSI data through a multimodal approach has emerged as a promising strategy for advancing water extent mapping using deep learning models. This approach is particularly beneficial when timely post-flood observations, acquired during or shortly after the flood peak, are limited, as it enables the use of all available imagery for more accurate post-flood water extent mapping. However, the adaptive integration of partially available MSI data into the SAR-based post-flood water extent mapping process remains underexplored. To bridge this research gap, we propose the Spatially Masked Adaptive Gated Network (SMAGNet), a multimodal deep learning model that utilizes SAR data as the primary input for post-flood water extent mapping and integrates complementary MSI data through feature fusion. In experiments on the C2S-MS Floods dataset, SMAGNet consistently outperformed other multimodal deep learning models in prediction performance across varying levels of MSI data availability. Furthermore, we found that even when MSI data were completely missing, the performance of SMAGNet remained statistically comparable to that of a U-Net model trained solely on SAR data. These findings indicate that SMAGNet enhances the model robustness to missing data as well as the applicability of multimodal deep learning in real-world flood management scenarios.
Fonte: arXiv cs.CV
Vision • Score 95
TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model
arXiv:2601.00051v1 Announce Type: new
Abstract: World models aim to endow AI systems with the ability to represent, generate, and interact with dynamic environments in a coherent and temporally consistent manner. While recent video generation models have demonstrated impressive visual quality, they remain limited in real-time interaction, long-horizon consistency, and persistent memory of dynamic scenes, hindering their evolution into practical world models. In this report, we present TeleWorld, a real-time multimodal 4D world modeling framework that unifies video generation, dynamic scene reconstruction, and long-term world memory within a closed-loop system. TeleWorld introduces a novel generation-reconstruction-guidance paradigm, where generated video streams are continuously reconstructed into a dynamic 4D spatio-temporal representation, which in turn guides subsequent generation to maintain spatial, temporal, and physical consistency. To support long-horizon generation with low latency, we employ an autoregressive diffusion-based video model enhanced with Macro-from-Micro Planning (MMPL)--a hierarchical planning method that reduces error accumulation from frame-level to segment-level-alongside efficient Distribution Matching Distillation (DMD), enabling real-time synthesis under practical computational budgets. Our approach achieves seamless integration of dynamic object modeling and static scene representation within a unified 4D framework, advancing world models toward practical, interactive, and computationally accessible systems. Extensive experiments demonstrate that TeleWorld achieves strong performance in both static and dynamic world understanding, long-term consistency, and real-time generation efficiency, positioning it as a practical step toward interactive, memory-enabled world models for multimodal generation and embodied intelligence.
Fonte: arXiv cs.CV
Vision • Score 95
All-in-One Video Restoration under Smoothly Evolving Unknown Weather Degradations
arXiv:2601.00533v1 Announce Type: new
Abstract: All-in-one image restoration aims to recover clean images from diverse unknown degradations using a single model. But extending this task to videos faces unique challenges. Existing approaches primarily focus on frame-wise degradation variation, overlooking the temporal continuity that naturally exists in real-world degradation processes. In practice, degradation types and intensities evolve smoothly over time, and multiple degradations may coexist or transition gradually. In this paper, we introduce the Smoothly Evolving Unknown Degradations (SEUD) scenario, where both the active degradation set and degradation intensity change continuously over time. To support this scenario, we design a flexible synthesis pipeline that generates temporally coherent videos with single, compound, and evolving degradations. To address the challenges in the SEUD scenario, we propose an all-in-One Recurrent Conditional and Adaptive prompting Network (ORCANet). First, a Coarse Intensity Estimation Dehazing (CIED) module estimates haze intensity using physical priors and provides coarse dehazed features as initialization. Second, a Flow Prompt Generation (FPG) module extracts degradation features. FPG generates both static prompts that capture segment-level degradation types and dynamic prompts that adapt to frame-level intensity variations. Furthermore, a label-aware supervision mechanism improves the discriminability of static prompt representations under different degradations. Extensive experiments show that ORCANet achieves superior restoration quality, temporal consistency, and robustness over image and video-based baselines. Code is available at https://github.com/Friskknight/ORCANet-SEUD.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
MotionPhysics: Learnable Motion Distillation for Text-Guided Simulation
arXiv:2601.00504v1 Announce Type: new
Abstract: Accurately simulating existing 3D objects and a wide variety of materials often demands expert knowledge and time-consuming physical parameter tuning to achieve the desired dynamic behavior. We introduce MotionPhysics, an end-to-end differentiable framework that infers plausible physical parameters from a user-provided natural language prompt for a chosen 3D scene of interest, removing the need for guidance from ground-truth trajectories or annotated videos. Our approach first utilizes a multimodal large language model to estimate material parameter values, which are constrained to lie within plausible ranges. We further propose a learnable motion distillation loss that extracts robust motion priors from pretrained video diffusion models while minimizing appearance and geometry inductive biases to guide the simulation. We evaluate MotionPhysics across more than thirty scenarios, including real-world, human-designed, and AI-generated 3D objects, spanning a wide range of materials such as elastic solids, metals, foams, sand, and both Newtonian and non-Newtonian fluids. We demonstrate that MotionPhysics produces visually realistic dynamic simulations guided by natural language, surpassing the state of the art while automatically determining physically plausible parameters. The code and project page are available at: https://wangmiaowei.github.io/MotionPhysics.github.io/.
Fonte: arXiv cs.CV
Vision • Score 95
ReMA: A Training-Free Plug-and-Play Mixing Augmentation for Video Behavior Recognition
arXiv:2601.00311v1 Announce Type: new
Abstract: Video behavior recognition demands stable and discriminative representations under complex spatiotemporal variations. However, prevailing data augmentation strategies for videos remain largely perturbation-driven, often introducing uncontrolled variations that amplify non-discriminative factors, which finally weaken intra-class distributional structure and representation drift with inconsistent gains across temporal scales. To address these problems, we propose Representation-aware Mixing Augmentation (ReMA), a plug-and-play augmentation strategy that formulates mixing as a controlled replacement process to expand representations while preserving class-conditional stability. ReMA integrates two complementary mechanisms. Firstly, the Representation Alignment Mechanism (RAM) performs structured intra-class mixing under distributional alignment constraints, suppressing irrelevant intra-class drift while enhancing statistical reliability. Then, the Dynamic Selection Mechanism (DSM) generates motion-aware spatiotemporal masks to localize perturbations, guiding them away from discrimination-sensitive regions and promoting temporal coherence. By jointly controlling how and where mixing is applied, ReMA improves representation robustness without additional supervision or trainable parameters. Extensive experiments on diverse video behavior benchmarks demonstrate that ReMA consistently enhances generalization and robustness across different spatiotemporal granularities.
Fonte: arXiv cs.CV
Vision • Score 95
Bandidos Contextuais Aditivos Esparsos: Uma Abordagem Não Paramétrica para Tomada de Decisão Online com Covariáveis de Alta Dimensionalidade
Serviços personalizados são centrais para a economia digital atual, e suas decisões sequenciais são frequentemente modeladas como bandidos contextuais. Aplicações modernas enfrentam dois desafios principais: covariáveis de alta dimensionalidade e a necessidade de modelos não paramétricos para capturar relações complexas entre recompensa e covariáveis. Propomos um algoritmo de bandido contextual baseado em um modelo de recompensa aditiva esparsa que aborda ambos os desafios.
Fonte: arXiv stat.ML
Evaluation/Benchmarks • Score 96
Um Modelo de Aprendizado Profundo com Atenção Esparsa Integrando Recursos Multimodais Heterogêneos para o Perfil de Gravidade da Doença de Parkinson
Caracterizar a apresentação heterogênea da doença de Parkinson (DP) requer a integração de marcadores biológicos e clínicos dentro de um framework preditivo unificado. Propomos a Class-Weighted Sparse-Attention Fusion Network (SAFN), um framework de aprendizado profundo interpretável para o perfil multimodal robusto, que supera limitações de interpretabilidade e desequilíbrio de classes.
Fonte: arXiv cs.LG
Vision • Score 95
It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models
arXiv:2601.00090v1 Announce Type: new
Abstract: Contemporary text-to-image models exhibit a surprising degree of mode collapse, as can be seen when sampling several images given the same text prompt. While previous work has attempted to address this issue by steering the model using guidance mechanisms, or by generating a large pool of candidates and refining them, in this work we take a different direction and aim for diversity in generations via noise optimization. Specifically, we show that a simple noise optimization objective can mitigate mode collapse while preserving the fidelity of the base model. We also analyze the frequency characteristics of the noise and show that alternative noise initializations with different frequency profiles can improve both optimization and search. Our experiments demonstrate that noise optimization yields superior results in terms of generation quality and variety.
Fonte: arXiv cs.CV
Vision • Score 96
Detecção Humana em Tempo Real para Sequências de Vídeo Capturadas Aéreas via Modelos Profundos
A detecção humana em vídeos desempenha um papel importante em diversas aplicações do mundo real. Abordagens tradicionais dependem de características feitas à mão, que são específicas para problemas e tarefas. Este trabalho utiliza métodos de aprendizado automático de características, combinando fluxo óptico e três modelos profundos diferentes para detecção humana em vídeos capturados com uma câmera não estática em plataformas aéreas.
Fonte: arXiv cs.LG
MLOps/Systems • Score 96
Avatar Forcing: Geração Interativa de Avatares de Cabeça em Tempo Real para Conversação Natural
A geração de cabeças falantes cria avatares realistas a partir de retratos estáticos para comunicação virtual e criação de conteúdo. No entanto, modelos atuais não transmitem a sensação de comunicação verdadeiramente interativa, gerando respostas unidimensionais que carecem de engajamento emocional. Propomos o Avatar Forcing, um novo framework para geração de avatares que modela interações em tempo real entre usuários e avatares através de difusão forçada.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
CPPO: Contrastive Perception for Vision Language Policy Optimization
arXiv:2601.00501v1 Announce Type: new
Abstract: We introduce CPPO, a Contrastive Perception Policy Optimization method for finetuning vision-language models (VLMs). While reinforcement learning (RL) has advanced reasoning in language models, extending it to multimodal reasoning requires improving both the perception and reasoning aspects. Prior works tackle this challenge mainly with explicit perception rewards, but disentangling perception tokens from reasoning tokens is difficult, requiring extra LLMs, ground-truth data, forced separation of perception from reasoning by policy model, or applying rewards indiscriminately to all output tokens. CPPO addresses this problem by detecting perception tokens via entropy shifts in the model outputs under perturbed input images. CPPO then extends the RL objective function with a Contrastive Perception Loss (CPL) that enforces consistency under information-preserving perturbations and sensitivity under information-removing ones. Experiments show that CPPO surpasses previous perception-rewarding methods, while avoiding extra models, making training more efficient and scalable.
Fonte: arXiv cs.CV
Multimodal • Score 89
uGMM-NN: Rede Neural de Modelo de Mistura Gaussiana Univariada
Este artigo apresenta a Rede Neural de Modelo de Mistura Gaussiana Univariada (uGMM-NN), uma nova arquitetura neural que incorpora raciocínio probabilístico diretamente nas unidades computacionais de redes profundas. Cada nó do uGMM-NN parametriza suas ativações como uma mistura gaussiana univariada, permitindo representações mais ricas e capturando multimodalidade e incerteza.
Fonte: arXiv stat.ML
NLP/LLMs • Score 95
Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark
arXiv:2601.00092v1 Announce Type: new
Abstract: 4D spatial intelligence involves perceiving and processing how objects move or change over time. Humans naturally possess 4D spatial intelligence, supporting a broad spectrum of spatial reasoning abilities. To what extent can Multimodal Large Language Models (MLLMs) achieve human-level 4D spatial intelligence? In this work, we present Spatial4D-Bench, a versatile 4D spatial intelligence benchmark designed to comprehensively assess the 4D spatial reasoning abilities of MLLMs. Unlike existing spatial intelligence benchmarks that are often small-scale or limited in diversity, Spatial4D-Bench provides a large-scale, multi-task evaluation benchmark consisting of ~40,000 question-answer pairs covering 18 well-defined tasks. We systematically organize these tasks into six cognitive categories: object understanding, scene understanding, spatial relationship understanding, spatiotemporal relationship understanding, spatial reasoning and spatiotemporal reasoning. Spatial4D-Bench thereby offers a structured and comprehensive benchmark for evaluating the spatial cognition abilities of MLLMs, covering a broad spectrum of tasks that parallel the versatility of human spatial intelligence. We benchmark various state-of-the-art open-source and proprietary MLLMs on Spatial4D-Bench and reveal their substantial limitations in a wide variety of 4D spatial reasoning aspects, such as route plan, action recognition, and physical plausibility reasoning. We hope that the findings provided in this work offer valuable insights to the community and that our benchmark can facilitate the development of more capable MLLMs toward human-level 4D spatial intelligence. More resources can be found on our project page.
Fonte: arXiv cs.CV
Evaluation/Benchmarks • Score 93
Correspondência de Fluxo Latente para Síntese de Voz Cantante Expressiva
A síntese de voz cantada baseada em autoencoders variacionais condicionais (cVAE) oferece inferência eficiente e alta qualidade de áudio ao aprender um espaço latente condicionado por pontuação e um espaço latente posterior condicionado por gravações. No entanto, a correspondência imperfeita entre as distribuições pode degradar a expressividade fina, como vibrato e micro-prosódia. Propomos o FM-Singer, que introduz a correspondência de fluxo condicional (CFM) no espaço latente.
Fonte: arXiv cs.AI
Vision • Score 95
DepFlow: Disentangled Speech Generation to Mitigate Semantic Bias in Depression Detection
arXiv:2601.00303v1 Announce Type: new
Abstract: Speech is a scalable and non-invasive biomarker for early mental health screening. However, widely used depression datasets like DAIC-WOZ exhibit strong coupling between linguistic sentiment and diagnostic labels, encouraging models to learn semantic shortcuts. As a result, model robustness may be compromised in real-world scenarios, such as Camouflaged Depression, where individuals maintain socially positive or neutral language despite underlying depressive states. To mitigate this semantic bias, we propose DepFlow, a three-stage depression-conditioned text-to-speech framework. First, a Depression Acoustic Encoder learns speaker- and content-invariant depression embeddings through adversarial training, achieving effective disentanglement while preserving depression discriminability (ROC-AUC: 0.693). Second, a flow-matching TTS model with FiLM modulation injects these embeddings into synthesis, enabling control over depressive severity while preserving content and speaker identity. Third, a prototype-based severity mapping mechanism provides smooth and interpretable manipulation across the depression continuum. Using DepFlow, we construct a Camouflage Depression-oriented Augmentation (CDoA) dataset that pairs depressed acoustic patterns with positive/neutral content from a sentiment-stratified text bank, creating acoustic-semantic mismatches underrepresented in natural data. Evaluated across three depression detection architectures, CDoA improves macro-F1 by 9%, 12%, and 5%, respectively, consistently outperforming conventional augmentation strategies in depression Detection. Beyond enhancing robustness, DepFlow provides a controllable synthesis platform for conversational systems and simulation-based evaluation, where real clinical data remains limited by ethical and coverage constraints.
Fonte: arXiv cs.CL
MLOps/Systems • Score 96
Benchmarking de Métodos de Pré-processamento e Integração em Genômica de Células Únicas
A análise de dados de células únicas tem o potencial de revolucionar a medicina personalizada ao caracterizar mudanças moleculares associadas a doenças em nível celular. Este estudo examina um pipeline geral para análise de dados de células únicas, avaliando diferentes métodos de normalização, redução de dimensionalidade e integração, utilizando seis conjuntos de dados variados.
Fonte: arXiv cs.AI
Vision • Score 95
Statistical laws and linguistics inform meaning in naturalistic and fictional conversation
arXiv:2512.18072v1 Announce Type: new
Abstract: Conversation is a cornerstone of social connection and is linked to well-being outcomes. Conversations vary widely in type with some portion generating complex, dynamic stories. One approach to studying how conversations unfold in time is through statistical patterns such as Heaps' law, which holds that vocabulary size scales with document length. Little work on Heaps's law has looked at conversation and considered how language features impact scaling. We measure Heaps' law for conversations recorded in two distinct mediums: 1. Strangers brought together on video chat and 2. Fictional characters in movies. We find that scaling of vocabulary size differs by parts of speech. We discuss these findings through behavioral and linguistic frameworks.
Fonte: arXiv cs.CL
NLP/LLMs • Score 96
Aprendizado por Reforço Estável e Eficiente com Um Único Rollout para Raciocínio Multimodal
O Aprendizado por Reforço com Recompensas Verificáveis (RLVR) se tornou um paradigma chave para melhorar as capacidades de raciocínio de Modelos de Linguagem Multimodal de Grande Escala (MLLMs). Introduzimos o $ extbf{MSSR}$ (Multimodal Stabilized Single-Rollout), um framework RLVR livre de grupos que alcança otimização estável e desempenho eficaz em raciocínio multimodal.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Rumo ao Desaprendizado que Preserva o Raciocínio em Modelos de Linguagem Grande Multimodal
O desaprendizado de máquinas visa apagar dados solicitados de modelos treinados sem re-treinamento completo. Para Modelos de Linguagem Grande Multimodal com Raciocínio (RMLLMs), isso é desafiador, pois etapas intermediárias podem vazar informações sensíveis. Apresentamos o RMLLMU-Bench, o primeiro benchmark para desaprendizado de RMLLM que avalia o vazamento de raciocínio e a retenção de raciocínio.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Supplementary Resources and Analysis for Automatic Speech Recognition Systems Trained on the Loquacious Dataset
arXiv:2512.17915v1 Announce Type: new
Abstract: The recently published Loquacious dataset aims to be a replacement for established English automatic speech recognition (ASR) datasets such as LibriSpeech or TED-Lium. The main goal of the Loquacious dataset is to provide properly defined training and test partitions across many acoustic and language domains, with an open license suitable for both academia and industry. To further promote the benchmarking and usability of this new dataset, we present additional resources in the form of n-gram language models (LMs), a grapheme-to-phoneme (G2P) model and pronunciation lexica, with open and public access. Utilizing those additional resources we show experimental results across a wide range of ASR architectures with different label units and topologies. Our initial experimental results indicate that the Loquacious dataset offers a valuable study case for a variety of common challenges in ASR.
Fonte: arXiv cs.CL
NLP/LLMs • Score 96
ChronoDreamer: Modelo de Mundo Condicionado por Ação como um Simulador Online para Planejamento Robótico
Apresentamos o ChronoDreamer, um modelo de mundo condicionado por ação para manipulação robótica rica em contatos. Com base em um histórico de quadros RGB egocêntricos, mapas de contato, ações e estados das juntas, o ChronoDreamer prevê quadros de vídeo futuros, distribuições de contato e ângulos das juntas através de um transformer espaço-temporal treinado com previsão mascarada no estilo MaskGIT.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
GeoSense-AI: Fast Location Inference from Crisis Microblogs
arXiv:2512.18225v1 Announce Type: new
Abstract: This paper presents an applied AI pipeline for realtime geolocation from noisy microblog streams, unifying statistical hashtag segmentation, part-of-speech-driven proper-noun detection, dependency parsing around disaster lexicons, lightweight named-entity recognition, and gazetteer-grounded disambiguation to infer locations directly from text rather than sparse geotags. The approach operationalizes information extraction under streaming constraints, emphasizing low-latency NLP components and efficient validation against geographic knowledge bases to support situational awareness during emergencies. In head to head comparisons with widely used NER toolkits, the system attains strong F1 while being engineered for orders-of-magnitude faster throughput, enabling deployment in live crisis informatics settings. A production map interface demonstrates end-to-end AI functionality ingest, inference, and visualization--surfacing locational signals at scale for floods, outbreaks, and other fastmoving events. By prioritizing robustness to informal text and streaming efficiency, GeoSense-AI illustrates how domain-tuned NLP and knowledge grounding can elevate emergency response beyond conventional geo-tag reliance.
Fonte: arXiv cs.CL
NLP/LLMs • Score 96
ESearch-R1: Aprendendo Agentes MLLM Conscientes de Custo para Busca Embodida Interativa via Aprendizado por Reforço
Modelos de Linguagem de Grande Escala Multimodal (MLLMs) capacitaram agentes embodidos com habilidades notáveis em planejamento e raciocínio. No entanto, ao enfrentar instruções ambíguas em linguagem natural, os agentes atuais frequentemente falham em equilibrar o alto custo da exploração física com o custo cognitivo da interação humana. Para preencher essa lacuna, propomos o ESearch-R1, um framework de raciocínio embodido consciente de custo.
Fonte: arXiv cs.AI
Vision • Score 96
Rede Bayesiana Multimodal para Avaliação Robusta de Vítimas em Triagem Autônoma
Incidentes de Múltiplas Vítimas podem sobrecarregar sistemas médicos de emergência, e atrasos ou erros na avaliação das vítimas podem resultar em mortes evitáveis. Apresentamos um framework de suporte à decisão que combina saídas de múltiplos modelos de computer vision, estimando sinais de hemorragia severa, dificuldade respiratória, alerta físico ou trauma visível, em uma rede Bayesiana construída inteiramente a partir de regras definidas por especialistas.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
FC-MIR: Um Framework de Consciência de Tela Móvel para Recomendação Consciente de Intenção Baseada em Raciocínio Multimodal de Trajetória Comprimida por Frame
Identificar a intenção do usuário a partir de trajetórias de operação da interface móvel é crucial para avançar na compreensão da UI e habilitar agentes de automação de tarefas. Propomos o framework FC-MIR, que utiliza amostragem de keyframes e concatenação adaptativa para reduzir a redundância visual e aumentar a eficiência da inferência, integrando MLLMs de última geração para sumarização de trajetórias e previsão de intenção.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
LLM-CAS: Perturbação Dinâmica de Neurônios para Correção de Alucinações em Tempo Real
Modelos de linguagem grandes (LLMs) frequentemente geram conteúdo alucinado que carece de fundamentação factual ou contextual, limitando sua confiabilidade em aplicações críticas. O LLM-CAS propõe uma abordagem de correção de alucinações em tempo real como um problema de aprendizado por reforço hierárquico, permitindo correções adaptativas sem modificação permanente de parâmetros.
Fonte: arXiv cs.CL
Vision • Score 95
Data-driven particle dynamics: Structure-preserving coarse-graining for emergent behavior in non-equilibrium systems
arXiv:2508.12569v2 Announce Type: replace-cross
Abstract: Multiscale systems are ubiquitous in science and technology, but are notoriously challenging to simulate as short spatiotemporal scales must be appropriately linked to emergent bulk physics. When expensive high-dimensional dynamical systems are coarse-grained into low-dimensional models, the entropic loss of information leads to emergent physics which are dissipative, history-dependent, and stochastic. To machine learn coarse-grained dynamics from time-series observations of particle trajectories, we propose a framework using the metriplectic bracket formalism that preserves these properties by construction; most notably, the framework guarantees discrete notions of the first and second laws of thermodynamics, conservation of momentum, and a discrete fluctuation-dissipation balance crucial for capturing non-equilibrium statistics. We introduce the mathematical framework abstractly before specializing to a particle discretization. As labels are generally unavailable for entropic state variables, we introduce a novel self-supervised learning strategy to identify emergent structural variables. We validate the method on benchmark systems and demonstrate its utility on two challenging examples: (1) coarse-graining star polymers at challenging levels of coarse-graining while preserving non-equilibrium statistics, and (2) learning models from high-speed video of colloidal suspensions that capture coupling between local rearrangement events and emergent stochastic dynamics. We provide open-source implementations in both PyTorch and LAMMPS, enabling large-scale inference and extensibility to diverse particle-based systems.
Fonte: arXiv stat.ML
NLP/LLMs • Score 96
HARBOR: Modelo Holístico e Adaptativo de Avaliação de Risco para Cuidados de Saúde Comportamental
A avaliação de risco em cuidados de saúde comportamental continua sendo um desafio devido à natureza multimodal dos dados dos pacientes e à dinâmica temporal dos transtornos de humor e afetivos. Neste trabalho, apresentamos o HARBOR, um modelo de linguagem consciente da saúde comportamental projetado para prever um escore de humor e risco discreto, denominado Harbor Risk Score (HRS).
Fonte: arXiv cs.AI
RL • Score 95
Amostragem de distribuições multimodais com pontos de partida aquecidos: Limites não assintóticos para o Reweighted Annealed Leap-Point Sampler
A amostragem de distribuições multimodais é um desafio central na inferência bayesiana e machine learning. Este trabalho introduz o Reweighted ALPS (Re-ALPS), uma versão modificada do Annealed Leap-Point Sampler (ALPS), que elimina a suposição de aproximação gaussiana e apresenta um limite de tempo polinomial em um contexto geral.
Fonte: arXiv stat.ML
RL • Score 96
Treinamento de Modelos de Raciocínio Multimodal Grandes Necessita de Melhores Ideias: Um Framework de Três Estágios para Síntese e Seleção de Longas Cadeias de Pensamento
Modelos de Raciocínio Grandes (LRMs) demonstraram desempenho notável em tarefas complexas de raciocínio por meio de longas Cadeias de Pensamento (CoT). A extensão desses sucessos para raciocínio multimodal é desafiadora devido à complexidade de integrar diversas modalidades de entrada e à escassez de dados de treinamento de alta qualidade. Neste artigo, propomos o SynSelect, um novo framework de Síntese-Seleção em três estágios para gerar dados longos de CoT de alta qualidade voltados para tarefas de raciocínio multimodal.
Fonte: arXiv cs.AI
Evaluation/Benchmarks • Score 93
Comparação Social sem Inferência Explícita dos Valores de Recompensa dos Outros: Uma Abordagem Construtiva Usando um Modelo Generativo Probabilístico
A comparação social — o processo de avaliar as recompensas de um indivíduo em relação às dos outros — desempenha um papel fundamental na cognição social dos primatas. Este estudo investiga se os macacos reconhecem apenas diferenças objetivas de recompensa ou inferem as avaliações subjetivas de recompensa dos outros, utilizando três modelos computacionais com diferentes graus de processamento de informações sociais.
Fonte: arXiv cs.AI
Evaluation/Benchmarks • Score 96
Aprendizado por Transferência Baseado em Clustering para Algoritmo Evolutivo Multimodal Multiobjetivo Dinâmico
A otimização multimodal multiobjetivo dinâmica enfrenta o desafio de rastrear simultaneamente múltiplos conjuntos pareto ótimos equivalentes e manter a diversidade populacional em ambientes variáveis. Este artigo apresenta um novo conjunto de funções de teste e um algoritmo inovador baseado em um mecanismo de resposta dinâmica de Clustering-based Autoencoder, visando melhorar a diversidade e a convergência em algoritmos evolutivos.
Fonte: arXiv cs.AI
MLOps/Systems • Score 96
Mecanismos de Memória Dependentes de Modalidade em Computação Neuromórfica Cross-Modal
As redes neurais spiking (SNNs) com memória prometem computação neuromórfica energeticamente eficiente, mas sua generalização entre modalidades sensoriais permanece inexplorada. Apresentamos o primeiro estudo abrangente de ablação cross-modal dos mecanismos de memória em SNNs, avaliando redes de Hopfield, Redes Recorrentes Gated Hierárquicas (HGRNs) e aprendizado contrastivo supervisionado (SCL) em conjuntos de dados neuromórficos visuais (N-MNIST) e auditivos (SHD).
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
arXiv:2512.17012v1 Announce Type: new
Abstract: Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Auxiliary Descriptive Knowledge for Few-Shot Adaptation of Vision-Language Model
arXiv:2512.17313v1 Announce Type: new
Abstract: Despite the impressive zero-shot capabilities of Vision-Language Models (VLMs), they often struggle in downstream tasks with distribution shifts from the pre-training data. Few-Shot Adaptation (FSA-VLM) has emerged as a key solution, typically using Parameter-Efficient Fine-Tuning (PEFT) to adapt models with minimal data. However, these PEFT methods are constrained by their reliance on fixed, handcrafted prompts, which are often insufficient to understand the semantics of classes. While some studies have proposed leveraging image-induced prompts to provide additional clues for classification, they introduce prohibitive computational overhead at inference. Therefore, we introduce Auxiliary Descriptive Knowledge (ADK), a novel framework that efficiently enriches text representations without compromising efficiency. ADK first leverages a Large Language Model to generate a rich set of descriptive prompts for each class offline. These pre-computed features are then deployed in two ways: (1) as Compositional Knowledge, an averaged representation that provides rich semantics, especially beneficial when class names are ambiguous or unfamiliar to the VLM; and (2) as Instance-Specific Knowledge, where a lightweight, non-parametric attention mechanism dynamically selects the most relevant descriptions for a given image. This approach provides two additional types of knowledge alongside the handcrafted prompt, thereby facilitating category distinction across various domains. Also, ADK acts as a parameter-free, plug-and-play component that enhances existing PEFT methods. Extensive experiments demonstrate that ADK consistently boosts the performance of multiple PEFT baselines, setting a new state-of-the-art across various scenarios.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
CheXPO-v2: Preference Optimization for Chest X-ray VLMs with Knowledge Graph Consistency
arXiv:2512.17213v1 Announce Type: new
Abstract: Medical Vision-Language Models (VLMs) are prone to hallucinations, compromising clinical reliability. While reinforcement learning methods like Group Relative Policy Optimization (GRPO) offer a low-cost alignment solution, their reliance on sparse, outcome-based rewards inadvertently encourages models to "overthink" -- generating verbose, convoluted, and unverifiable Chain-of-Thought reasoning to justify answers. This focus on outcomes obscures factual errors and poses significant safety risks. To address this, we propose CheXPO-v2, a novel alignment framework that shifts from outcome to process supervision. Our core innovation is a Knowledge Graph Consistency Reward mechanism driven by Entity-Relation Matching. By explicitly parsing reasoning steps into structured "Disease, Relation, Anatomy" triplets, we provide fine-grained supervision that penalizes incoherent logic and hallucinations at the atomic level. Integrating this with a hard-example mining strategy, our approach significantly outperforms GRPO and state-of-the-art models on benchmarks like MIMIC-CXR-VQA. Crucially, CheXPO-v2 achieves new state-of-the-art accuracy using only 5k samples, demonstrating exceptional data efficiency while producing clinically sound and verifiable reasoning. The project source code is publicly available at: https://github.com/ecoxial2007/CheX-Phi4MM.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
DAVE: A VLM Vision Encoder for Document Understanding and Web Agents
arXiv:2512.17221v1 Announce Type: new
Abstract: While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial information essential for document understanding and web agents. To bridge this gap, we introduce DAVE, a vision encoder purpose-built for VLMs and tailored for these tasks. Our training pipeline is designed to leverage abundant unlabeled data to bypass the need for costly large-scale annotations for document and web images. We begin with a self-supervised pretraining stage on unlabeled images, followed by a supervised autoregressive pretraining stage, where the model learns tasks like parsing and localization from limited, high-quality data. Within the supervised stage, we adopt two strategies to improve our encoder's alignment with both general visual knowledge and diverse document and web agentic tasks: (i) We introduce a novel model-merging scheme, combining encoders trained with different text decoders to ensure broad compatibility with different web agentic architectures. (ii) We use ensemble training to fuse features from pretrained generalist encoders (e.g., SigLIP2) with our own document and web-specific representations. Extensive experiments on classic document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness of our approach, establishing DAVE as a strong vision encoder for document and web applications.
Fonte: arXiv cs.CV
MLOps/Systems • Score 95
Mapping the Podcast Ecosystem with the Structured Podcast Research Corpus
arXiv:2411.07892v2 Announce Type: replace
Abstract: Podcasts provide highly diverse content to a massive listener base through a unique on-demand modality. However, limited data has prevented large-scale computational analysis of the podcast ecosystem. To fill this gap, we introduce a massive dataset of over 1.1M podcast transcripts that is largely comprehensive of all English language podcasts available through public RSS feeds from May and June of 2020. This data is not limited to text, but rather includes audio features and speaker turns for a subset of 370K episodes, and speaker role inferences and other metadata for all 1.1M episodes. Using this data, we also conduct a foundational investigation into the content, structure, and responsiveness of this ecosystem. Together, our data and analyses open the door to continued computational research of this popular and impactful medium.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering
arXiv:2512.17396v1 Announce Type: cross
Abstract: In this work, we introduce RadImageNet-VQA, a large-scale dataset designed to advance radiologic visual question answering (VQA) on CT and MRI exams. Existing medical VQA datasets are limited in scale, dominated by X-ray imaging or biomedical illustrations, and often prone to text-based shortcuts. RadImageNet-VQA is built from expert-curated annotations and provides 750K images paired with 7.5M question-answer samples. It covers three key tasks - abnormality detection, anatomy recognition, and pathology identification - spanning eight anatomical regions and 97 pathology categories, and supports open-ended, closed-ended, and multiple-choice questions. Extensive experiments show that state-of-the-art vision-language models still struggle with fine-grained pathology identification, particularly in open-ended settings and even after fine-tuning. Text-only analysis further reveals that model performance collapses to near-random without image inputs, confirming that RadImageNet-VQA is free from linguistic shortcuts. The full dataset and benchmark are publicly available at https://huggingface.co/datasets/raidium/RadImageNet-VQA.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Strategic Planning and Rationalizing on Trees Make LLMs Better Debaters
arXiv:2505.14886v2 Announce Type: replace
Abstract: Winning competitive debates requires sophisticated reasoning and argument skills. There are unique challenges in the competitive debate: (1) The time constraints force debaters to make strategic choices about which points to pursue rather than covering all possible arguments; (2) The persuasiveness of the debate relies on the back-and-forth interaction between arguments, which a single final game status cannot evaluate. To address these challenges, we propose TreeDebater, a novel debate framework that excels in competitive debate. We introduce two tree structures: the Rehearsal Tree and Debate Flow Tree. The Rehearsal Tree anticipates the attack and defenses to evaluate the strength of the claim, while the Debate Flow Tree tracks the debate status to identify the active actions. TreeDebater allocates its time budget among candidate actions and uses the speech time controller and feedback from the simulated audience to revise its statement. The human evaluation on both the stage-level and the debate-level comparison shows that our TreeDebater outperforms the state-of-the-art multi-agent debate system, with a +15.6% improvement in stage-level persuasiveness with DeepSeek and +10% debate-level opinion shift win. Further investigation shows that TreeDebater shows better strategies in limiting time to important debate actions, aligning with the strategies of human debate experts.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Generating Completions for Broca's Aphasic Sentences Using Large Language Models
arXiv:2412.17669v2 Announce Type: replace
Abstract: Broca's aphasia is a type of aphasia characterized by non-fluent, effortful and agrammatic speech production with relatively good comprehension. Since traditional aphasia treatment methods are often time-consuming, labour-intensive, and do not reflect real-world conversations, applying natural language processing based approaches such as Large Language Models (LLMs) could potentially contribute to improving existing treatment approaches. To address this issue, we explore the use of sequence-to-sequence LLMs for completing Broca's aphasic sentences. We first generate synthetic Broca's aphasic data using a rule-based system designed to mirror the linguistic characteristics of Broca's aphasic speech. Using this synthetic data (without authentic aphasic samples), we then fine-tune four pre-trained LLMs on the task of completing agrammatic sentences. We evaluate our fine-tuned models on both synthetic and authentic Broca's aphasic data. We demonstrate LLMs' capability for reconstructing agrammatic sentences, with the models showing improved performance with longer input utterances. Our result highlights the LLMs' potential in advancing communication aids for individuals with Broca's aphasia and possibly other clinical populations.
Fonte: arXiv cs.CL
NLP/LLMs • Score 96
Investigando a Inteligência Geral Científica de LLMs com Fluxos de Trabalho Alinhados a Cientistas
Apesar dos avanços em IA científica, falta um framework coerente para Inteligência Geral Científica (SGI) — a capacidade de conceber, investigar e raciocinar autonomamente em domínios científicos. Apresentamos uma definição operacional de SGI baseada no Modelo de Investigação Prática (PIM) e a operacionalizamos através de quatro tarefas alinhadas a cientistas: pesquisa profunda, geração de ideias, experimentos secos/molhados e raciocínio experimental.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
Are Vision Language Models Cross-Cultural Theory of Mind Reasoners?
arXiv:2512.17394v1 Announce Type: new
Abstract: Theory of Mind (ToM) -- the ability to attribute beliefs, desires, and emotions to others -- is fundamental for human social intelligence, yet remains a major challenge for artificial agents. Existing Vision-Language Models (VLMs) are increasingly applied in socially grounded tasks, but their capacity for cross-cultural ToM reasoning is largely unexplored. In this work, we introduce CulturalToM-VQA, a new evaluation benchmark containing 5095 questions designed to probe ToM reasoning across diverse cultural contexts through visual question answering. The dataset captures culturally grounded cues such as rituals, attire, gestures, and interpersonal dynamics, enabling systematic evaluation of ToM reasoning beyond Western-centric benchmarks. Our dataset is built through a VLM-assisted human-in-the-loop pipeline, where human experts first curate culturally rich images across traditions, rituals, and social interactions; a VLM then assist in generating structured ToM-focused scene descriptions, which are refined into question-answer pairs spanning a taxonomy of six ToM tasks and four graded complexity levels. The resulting dataset covers diverse theory of mind facets such as mental state attribution, false belief reasoning, non-literal communication, social norm violations, perspective coordination, and multi-agent reasoning.
Fonte: arXiv cs.CL
Multimodal • Score 95
Peeking Into The Future For Contextual Biasing
arXiv:2512.17657v1 Announce Type: new
Abstract: While end-to-end (E2E) automatic speech recognition (ASR) models excel at general transcription, they struggle to recognize rare or unseen named entities (e.g., contact names, locations), which are critical for downstream applications like virtual assistants. In this paper, we propose a contextual biasing method for attention based encoder decoder (AED) models using a list of candidate named entities. Instead of predicting only the next token, we simultaneously predict multiple future tokens, enabling the model to "peek into the future" and score potential candidate entities in the entity list. Moreover, our approach leverages the multi-token prediction logits directly without requiring additional entity encoders or cross-attention layers, significantly reducing architectural complexity. Experiments on Librispeech demonstrate that our approach achieves up to 50.34% relative improvement in named entity word error rate compared to the baseline AED model.
Fonte: arXiv cs.CL
Vision • Score 95
EMMA: Concept Erasure Benchmark with Comprehensive Semantic Metrics and Diverse Categories
arXiv:2512.17320v1 Announce Type: new
Abstract: The widespread adoption of text-to-image (T2I) generation has raised concerns about privacy, bias, and copyright violations. Concept erasure techniques offer a promising solution by selectively removing undesired concepts from pre-trained models without requiring full retraining. However, these methods are often evaluated on a limited set of concepts, relying on overly simplistic and direct prompts. To test the boundaries of concept erasure techniques, and assess whether they truly remove targeted concepts from model representations, we introduce EMMA, a benchmark that evaluates five key dimensions of concept erasure over 12 metrics. EMMA goes beyond standard metrics like image quality and time efficiency, testing robustness under challenging conditions, including indirect descriptions, visually similar non-target concepts, and potential gender and ethnicity bias, providing a socially aware analysis of method behavior. Using EMMA, we analyze five concept erasure methods across five domains (objects, celebrities, art styles, NSFW, and copyright). Our results show that existing methods struggle with implicit prompts (i.e., generating the erased concept when it is indirectly referenced) and visually similar non-target concepts (i.e., failing to generate non-targeted concepts resembling the erased one), while some amplify gender and ethnicity bias compared to the original model.
Fonte: arXiv cs.CV
NLP/LLMs • Score 96
V-Agent: Um Sistema de Busca de Vídeo Interativo Usando Modelos de Visão-Linguagem
Apresentamos o V-Agent, uma nova plataforma multi-agente projetada para busca avançada de vídeos e conversas interativas entre usuário e sistema. Ao ajustar um modelo de visão-linguagem (VLM) com um pequeno conjunto de dados de preferência de vídeo e aprimorá-lo com um vetor de recuperação de um modelo de recuperação de imagem-texto, superamos as limitações dos sistemas tradicionais de recuperação baseados em texto em cenários multimodais.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
Simulstream: Open-Source Toolkit for Evaluation and Demonstration of Streaming Speech-to-Text Translation Systems
arXiv:2512.17648v1 Announce Type: new
Abstract: Streaming Speech-to-Text Translation (StreamST) requires producing translations concurrently with incoming speech, imposing strict latency constraints and demanding models that balance partial-information decision-making with high translation quality. Research efforts on the topic have so far relied on the SimulEval repository, which is no longer maintained and does not support systems that revise their outputs. In addition, it has been designed for simulating the processing of short segments, rather than long-form audio streams, and it does not provide an easy method to showcase systems in a demo. As a solution, we introduce simulstream, the first open-source framework dedicated to unified evaluation and demonstration of StreamST systems. Designed for long-form speech processing, it supports not only incremental decoding approaches, but also re-translation methods, enabling for their comparison within the same framework both in terms of quality and latency. In addition, it also offers an interactive web interface to demo any system built within the tool.
Fonte: arXiv cs.CL
NLP/LLMs • Score 96
MMRAG-RFT: Ajuste Fino de Reforço em Duas Etapas para Geração Aumentada por Recuperação Multi-modal Explicável
A Geração Aumentada por Recuperação Multi-modal (MMRAG) permite uma geração altamente confiável ao integrar conhecimento externo multi-modal, demonstrando desempenho impressionante em cenários complexos. No entanto, métodos existentes falham em esclarecer a lógica de raciocínio por trás da recuperação e geração de respostas. Propomos a introdução de aprendizado por reforço para aprimorar as capacidades de raciocínio de modelos de linguagem multi-modal.
Fonte: arXiv cs.AI
Vision • Score 95
Endo-SemiS: Rumo à Segmentação de Imagem Semi-Supervisionada Robusta para Vídeo Endoscópico
Neste artigo, apresentamos o Endo-SemiS, um framework de segmentação semi-supervisionada para fornecer segmentação confiável de quadros de vídeo endoscópico com anotação limitada. O Endo-SemiS utiliza 4 estratégias para melhorar o desempenho, aproveitando efetivamente todos os dados disponíveis, especialmente dados não rotulados.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs
arXiv:2512.17319v1 Announce Type: new
Abstract: Multimodal large language models (MLLMs) demonstrate strong perception and reasoning performance on existing remote sensing (RS) benchmarks. However, most prior benchmarks rely on low-resolution imagery, and some high-resolution benchmarks suffer from flawed reasoning-task designs. We show that text-only LLMs can perform competitively with multimodal vision-language models on RS reasoning tasks without access to images, revealing a critical mismatch between current benchmarks and the intended evaluation of visual understanding. To enable faithful assessment, we introduce RSHR-Bench, a super-high-resolution benchmark for RS visual understanding and reasoning. RSHR-Bench contains 5,329 full-scene images with a long side of at least 4,000 pixels, with up to about 3 x 10^8 pixels per image, sourced from widely used RS corpora and UAV collections. We design four task families: multiple-choice VQA, open-ended VQA, image captioning, and single-image evaluation. These tasks cover nine perception categories and four reasoning types, supporting multi-turn and multi-image dialog. To reduce reliance on language priors, we apply adversarial filtering with strong LLMs followed by rigorous human verification. Overall, we construct 3,864 VQA tasks, 3,913 image captioning tasks, and 500 fully human-written or verified single-image evaluation VQA pairs. Evaluations across open-source, closed-source, and RS-specific VLMs reveal persistent performance gaps in super-high-resolution scenarios. Code: https://github.com/Yunkaidang/RSHR
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Imagens Sintéticas Podem Servir como Protótipos de Classe Eficazes e Eficientes?
Modelos de Visão-Linguagem (VLMs) demonstraram forte desempenho em tarefas de classificação de imagens zero-shot. No entanto, métodos existentes, como o Contrastive Language-Image Pre-training (CLIP), dependem de pares de texto-imagem anotados, o que aumenta os custos e requisitos de precisão na preparação de conjuntos de dados de alta qualidade. Apresentamos o framework LGCLIP, que utiliza um Large Language Model (LLM) para gerar prompts específicos de classe, permitindo a síntese de imagens de referência de forma leve e eficiente.
Fonte: arXiv cs.CV
NLP/LLMs • Score 96
InfoTok: Tokenizador de Vídeo Discreto Adaptativo via Compressão Teórica da Informação
A tokenização discreta de vídeo precisa e eficiente é essencial para o processamento de longas sequências de vídeo. Este artigo apresenta o InfoTok, um framework fundamentado para tokenização adaptativa de vídeo, provando que métodos de treinamento existentes são subótimos e introduzindo um novo algoritmo baseado em ELBO que se aproxima da optimalidade teórica.
Fonte: arXiv cs.AI
Vision • Score 95
DESSERT: Diffusion-based Event-driven Single-frame Synthesis via Residual Training
arXiv:2512.17323v1 Announce Type: new
Abstract: Video frame prediction extrapolates future frames from previous frames, but suffers from prediction errors in dynamic scenes due to the lack of information about the next frame. Event cameras address this limitation by capturing per-pixel brightness changes asynchronously with high temporal resolution. Prior research on event-based video frame prediction has leveraged motion information from event data, often by predicting event-based optical flow and reconstructing frames via pixel warping. However, such approaches introduce holes and blurring when pixel displacement is inaccurate. To overcome this limitation, we propose DESSERT, a diffusion-based event-driven single-frame synthesis framework via residual training. Leveraging a pre-trained Stable Diffusion model, our method is trained on inter-frame residuals to ensure temporal consistency. The training pipeline consists of two stages: (1) an Event-to-Residual Alignment Variational Autoencoder (ER-VAE) that aligns the event frame between anchor and target frames with the corresponding residual, and (2) a diffusion model that denoises the residual latent conditioned on event data. Furthermore, we introduce Diverse-Length Temporal (DLT) augmentation, which improves robustness by training on frame segments of varying temporal lengths. Experimental results demonstrate that our method outperforms existing event-based reconstruction, image-based video frame prediction, event-based video frame prediction, and one-sided event-based video frame interpolation methods, producing sharper and more temporally consistent frame synthesis.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Vision-Language Model Guided Image Restoration
arXiv:2512.17292v1 Announce Type: new
Abstract: Many image restoration (IR) tasks require both pixel-level fidelity and high-level semantic understanding to recover realistic photos with fine-grained details. However, previous approaches often struggle to effectively leverage both the visual and linguistic knowledge. Recent efforts have attempted to incorporate Vision-language models (VLMs), which excel at aligning visual and textual features, into universal IR. Nevertheless, these methods fail to utilize the linguistic priors to ensure semantic coherence during the restoration process. To address this issue, in this paper, we propose the Vision-Language Model Guided Image Restoration (VLMIR) framework, which leverages the rich vision-language priors of VLMs, such as CLIP, to enhance IR performance through improved visual perception and semantic understanding. Our approach consists of two stages: VLM-based feature extraction and diffusion-based image restoration. In the first stage, we extract complementary visual and linguistic representations of input images by condensing the visual perception and high-level semantic priors through VLMs. Specifically, we align the embeddings of captions from low-quality and high-quality images using a cosine similarity loss with LoRA fine-tuning, and employ a degradation predictor to decompose degradation and clean image content embeddings. These complementary visual and textual embeddings are then integrated into a diffusion-based model via cross-attention mechanisms for enhanced restoration. Extensive experiments and ablation studies demonstrate that VLMIR achieves superior performance across both universal and degradation-specific IR tasks, underscoring the critical role of integrated visual and linguistic knowledge from VLMs in advancing image restoration capabilities.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Anatomical Region-Guided Contrastive Decoding: A Plug-and-Play Strategy for Mitigating Hallucinations in Medical VLMs
arXiv:2512.17189v1 Announce Type: new
Abstract: Medical Vision-Language Models (MedVLMs) show immense promise in clinical applicability. However, their reliability is hindered by hallucinations, where models often fail to derive answers from visual evidence, instead relying on learned textual priors. Existing mitigation strategies for MedVLMs have distinct limitations: training-based methods rely on costly expert annotations, limiting scalability, while training-free interventions like contrastive decoding, though data-efficient, apply a global, untargeted correction whose effects in complex real-world clinical settings can be unreliable. To address these challenges, we introduce Anatomical Region-Guided Contrastive Decoding (ARCD), a plug-and-play strategy that mitigates hallucinations by providing targeted, region-specific guidance. Our module leverages an anatomical mask to direct a three-tiered contrastive decoding process. By dynamically re-weighting at the token, attention, and logits levels, it verifiably steers the model's focus onto specified regions, reinforcing anatomical understanding and suppressing factually incorrect outputs. Extensive experiments across diverse datasets, including chest X-ray, CT, brain MRI, and ocular ultrasound, demonstrate our method's effectiveness in improving regional understanding, reducing hallucinations, and enhancing overall diagnostic accuracy.
Fonte: arXiv cs.CV
NLP/LLMs • Score 96
Luzes, Câmera, Consistência: Um Pipeline Multietapas para Histórias em Vídeo com IA de Personagens Estáveis
Gerar histórias em vídeo longas e coesas com personagens consistentes é um desafio significativo para a IA atual de texto-para-vídeo. Apresentamos um método que aborda a geração de vídeo de maneira semelhante a um cineasta, utilizando um pipeline que envolve a criação de um roteiro detalhado e a geração de visuais consistentes para cada personagem.
Fonte: arXiv cs.AI
Vision • Score 95
PhysFire-WM: A Physics-Informed World Model for Emulating Fire Spread Dynamics
arXiv:2512.17152v1 Announce Type: new
Abstract: Fine-grained fire prediction plays a crucial role in emergency response. Infrared images and fire masks provide complementary thermal and boundary information, yet current methods are predominantly limited to binary mask modeling with inherent signal sparsity, failing to capture the complex dynamics of fire. While world models show promise in video generation, their physical inconsistencies pose significant challenges for fire forecasting. This paper introduces PhysFire-WM, a Physics-informed World Model for emulating Fire spread dynamics. Our approach internalizes combustion dynamics by encoding structured priors from a Physical Simulator to rectify physical discrepancies, coupled with a Cross-task Collaborative Training strategy (CC-Train) that alleviates the issue of limited information in mask-based modeling. Through parameter sharing and gradient coordination, CC-Train effectively integrates thermal radiation dynamics and spatial boundary delineation, enhancing both physical realism and geometric accuracy. Extensive experiments on a fine-grained multimodal fire dataset demonstrate the superior accuracy of PhysFire-WM in fire spread prediction. Validation underscores the importance of physical priors and cross-task collaboration, providing new insights for applying physics-informed world models to disaster prediction.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos
arXiv:2512.16978v1 Announce Type: new
Abstract: Long-form multimodal video understanding requires integrating vision, speech, and ambient audio with coherent long-range reasoning. Existing benchmarks emphasize either temporal length or multimodal richness, but rarely both and while some incorporate open-ended questions and advanced metrics, they mostly rely on single-score accuracy, obscuring failure modes. We introduce LongShOTBench, a diagnostic benchmark with open-ended, intent-driven questions; single- and multi-turn dialogues; and tasks requiring multimodal reasoning and agentic tool use across video, audio, and speech. Each item includes a reference answer and graded rubric for interpretable, and traceable evaluation. LongShOTBench is produced via a scalable, human-validated pipeline to ensure coverage and reproducibility. All samples in our LongShOTBench are human-verified and corrected. Furthermore, we present LongShOTAgent, an agentic system that analyzes long videos via preprocessing, search, and iterative refinement. On LongShOTBench, state-of-the-art MLLMs show large gaps: Gemini-2.5-Flash achieves 52.95%, open-source models remain below 30%, and LongShOTAgent attains 44.66%. These results underscore the difficulty of real-world long-form video understanding. LongShOTBench provides a practical, reproducible foundation for evaluating and improving MLLMs. All resources are available on GitHub: https://github.com/mbzuai-oryx/longshot.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Learning When to Look: A Disentangled Curriculum for Strategic Perception in Multimodal Reasoning
arXiv:2512.17227v1 Announce Type: new
Abstract: Multimodal Large Language Models (MLLMs) demonstrate significant potential but remain brittle in complex, long-chain visual reasoning tasks. A critical failure mode is "visual forgetting", where models progressively lose visual grounding as reasoning extends, a phenomenon aptly described as "think longer, see less". We posit this failure stems from current training paradigms prematurely entangling two distinct cognitive skills: (1) abstract logical reasoning "how-to-think") and (2) strategic visual perception ("when-to-look"). This creates a foundational cold-start deficiency -- weakening abstract reasoning -- and a strategic perception deficit, as models lack a policy for when to perceive. In this paper, we propose a novel curriculum-based framework to disentangle these skills. First, we introduce a disentangled Supervised Fine-Tuning (SFT) curriculum that builds a robust abstract reasoning backbone on text-only data before anchoring it to vision with a novel Perception-Grounded Chain-of-Thought (PG-CoT) paradigm. Second, we resolve the strategic perception deficit by formulating timing as a reinforcement learning problem. We design a Pivotal Perception Reward that teaches the model when to look by coupling perceptual actions to linguistic markers of cognitive uncertainty (e.g., "wait", "verify"), thereby learning an autonomous grounding policy. Our contributions include the formalization of these two deficiencies and the development of a principled, two-stage framework to address them, transforming the model from a heuristic-driven observer to a strategic, grounded reasoner. \textbf{Code}: \url{https://github.com/gaozilve-max/learning-when-to-look}.
Fonte: arXiv cs.CV
Theory/Optimization • Score 92
Weighted Stochastic Differential Equation to Implement Wasserstein-Fisher-Rao Gradient Flow
arXiv:2512.17878v1 Announce Type: cross
Abstract: Score-based diffusion models currently constitute the state of the art in continuous generative modeling. These methods are typically formulated via overdamped or underdamped Ornstein--Uhlenbeck-type stochastic differential equations, in which sampling is driven by a combination of deterministic drift and Brownian diffusion, resulting in continuous particle trajectories in the ambient space. While such dynamics enjoy exponential convergence guarantees for strongly log-concave target distributions, it is well known that their mixing rates deteriorate exponentially in the presence of nonconvex or multimodal landscapes, such as double-well potentials. Since many practical generative modeling tasks involve highly non-log-concave target distributions, considerable recent effort has been devoted to developing sampling schemes that improve exploration beyond classical diffusion dynamics.
A promising line of work leverages tools from information geometry to augment diffusion-based samplers with controlled mass reweighting mechanisms. This perspective leads naturally to Wasserstein--Fisher--Rao (WFR) geometries, which couple transport in the sample space with vertical (reaction) dynamics on the space of probability measures. In this work, we formulate such reweighting mechanisms through the introduction of explicit correction terms and show how they can be implemented via weighted stochastic differential equations using the Feynman--Kac representation. Our study provides a preliminary but rigorous investigation of WFR-based sampling dynamics, and aims to clarify their geometric and operator-theoretic structure as a foundation for future theoretical and algorithmic developments.
Fonte: arXiv stat.ML
Vision • Score 95
Disentangled representations via score-based variational autoencoders
arXiv:2512.17127v1 Announce Type: new
Abstract: We present the Score-based Autoencoder for Multiscale Inference (SAMI), a method for unsupervised representation learning that combines the theoretical frameworks of diffusion models and VAEs. By unifying their respective evidence lower bounds, SAMI formulates a principled objective that learns representations through score-based guidance of the underlying diffusion process. The resulting representations automatically capture meaningful structure in the data: it recovers ground truth generative factors in our synthetic dataset, learns factorized, semantic latent dimensions from complex natural images, and encodes video sequences into latent trajectories that are straighter than those of alternative encoders, despite training exclusively on static images. Furthermore, SAMI can extract useful representations from pre-trained diffusion models with minimal additional training. Finally, the explicitly probabilistic formulation provides new ways to identify semantically meaningful axes in the absence of supervised labels, and its mathematical exactness allows us to make formal statements about the nature of the learned representation. Overall, these results indicate that implicit structural information in diffusion models can be made explicit and interpretable through synergistic combination with a variational autoencoder.
Fonte: arXiv stat.ML
NLP/LLMs • Score 96
Adversarially Robust Detection of Harmful Online Content: A Computational Design Science Approach
arXiv:2512.17367v1 Announce Type: new
Abstract: Social media platforms are plagued by harmful content such as hate speech, misinformation, and extremist rhetoric. Machine learning (ML) models are widely adopted to detect such content; however, they remain highly vulnerable to adversarial attacks, wherein malicious users subtly modify text to evade detection. Enhancing adversarial robustness is therefore essential, requiring detectors that can defend against diverse attacks (generalizability) while maintaining high overall accuracy. However, simultaneously achieving both optimal generalizability and accuracy is challenging. Following the computational design science paradigm, this study takes a sequential approach that first proposes a novel framework (Large Language Model-based Sample Generation and Aggregation, LLM-SGA) by identifying the key invariances of textual adversarial attacks and leveraging them to ensure that a detector instantiated within the framework has strong generalizability. Second, we instantiate our detector (Adversarially Robust Harmful Online Content Detector, ARHOCD) with three novel design components to improve detection accuracy: (1) an ensemble of multiple base detectors that exploits their complementary strengths; (2) a novel weight assignment method that dynamically adjusts weights based on each sample's predictability and each base detector's capability, with weights initialized using domain knowledge and updated via Bayesian inference; and (3) a novel adversarial training strategy that iteratively optimizes both the base detectors and the weight assignor. We addressed several limitations of existing adversarial robustness enhancement research and empirically evaluated ARHOCD across three datasets spanning hate speech, rumor, and extremist content. Results show that ARHOCD offers strong generalizability and improves detection accuracy under adversarial conditions.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
The Effect of Negation on CLIP in Medical Imaging: Limitations of Contrastive Language-Image Pretraining
arXiv:2512.17121v1 Announce Type: new
Abstract: Large vision-language models like CLIP are increasingly used in medical imaging tasks due to their ability to align images and text without the need for extensive labeled data. This makes them particularly useful for applications like image retrieval, report generation, and classification in clinical settings. A potential issue to this approach is that CLIP-based models often under perform when interpreting negated phrases, which is especially problematic in the context of medical diagnosing. In this study, we evaluate the Stanford AIMI CheXagent model on its ability to correctly retrieve chest X-ray images using prompts with and without negation. The goal of this project is to understand where this model fails and then use it as a base model to improve its retrieval accuracy by fine tuning methods outlined in previous work. Results from this study show improvement in handling of negation in the CLIP model with a slight decrease in accuracy of positive prompt evaluation. Alongside retrieval accuracy, we examined internal model behavior through token attribution, t-SNE projection, and attention-head ablation to better characterize how each fine tuning approach reshaped the text encoders representation of negated clinical language. Through this work, we hope to better understand the internal behavior of CLIP and improve its handling of negation using clinically relevant language for improving its reliability in medical AI devices.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
Atom: Efficient On-Device Video-Language Pipelines Through Modular Reuse
arXiv:2512.17108v1 Announce Type: new
Abstract: Recent advances in video-language models have enabled powerful applications like video retrieval, captioning, and assembly. However, executing such multi-stage pipelines efficiently on mobile devices remains challenging due to redundant model loads and fragmented execution. We introduce Atom, an on-device system that restructures video-language pipelines for fast and efficient execution. Atom decomposes a billion-parameter model into reusable modules, such as the visual encoder and language decoder, and reuses them across subtasks like captioning, reasoning, and indexing. This reuse-centric design eliminates repeated model loading and enables parallel execution, reducing end-to-end latency without sacrificing performance. On commodity smartphones, Atom achieves 27--33% faster execution compared to non-reuse baselines, with only marginal performance drop ($\leq$ 2.3 Recall@1 in retrieval, $\leq$ 1.5 CIDEr in captioning). These results position Atom as a practical, scalable approach for efficient video-language understanding on edge devices.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
Incorporação de Embeddings de Nível de Erro de Ruído para Melhorar a Robustez Assistida por LLM no Reconhecimento de Fala Persa
Os sistemas de Reconhecimento Automático de Fala (ASR) enfrentam degradação significativa de desempenho em ambientes ruidosos, especialmente para idiomas de baixo recurso como o persa. Este estudo apresenta um framework robusto de correção de erros ASR sensível ao ruído, que combina múltiplas hipóteses e modelagem consciente do ruído, demonstrando melhorias substanciais na Taxa de Erro de Palavras (WER).
Fonte: arXiv cs.AI
MLOps/Systems • Score 96
UmniBench: Modelo Unificado de Compreensão e Geração Orientado a Benchmark Omnidimensional
O UmniBench é um benchmark projetado para Modelos Unificados Multimodais (UMMs), permitindo uma avaliação omnidimensional. Ele avalia a capacidade de compreensão, geração e edição em um único processo, utilizando prompts e pares de QA examinados por humanos. O benchmark abrange 13 domínios principais e mais de 200 conceitos, oferecendo uma visão abrangente e objetiva dos modelos unificados.
Fonte: arXiv cs.AI
Vision • Score 95
Homografia Infinita como Condicionamento Robusto para Geração de Vídeo Controlada por Câmera
O progresso recente em modelos de difusão de vídeo gerou interesse crescente na geração de vídeo de nova perspectiva controlada por câmera para cenas dinâmicas. Um desafio chave é garantir a fidelidade à pose da câmera especificada, mantendo a consistência de visualização e raciocinando sobre a geometria ocluída a partir de observações limitadas. Apresentamos o InfCam, um framework de geração de vídeo para vídeo controlado por câmera com alta fidelidade de pose.
Fonte: arXiv cs.CV
Vision • Score 95
ABE-CLIP: Training-Free Attribute Binding Enhancement for Compositional Image-Text Matching
arXiv:2512.17178v1 Announce Type: new
Abstract: Contrastive Language-Image Pretraining (CLIP) has achieved remarkable performance in various multimodal tasks. However, it still struggles with compositional image-text matching, particularly in accurately associating objects with their corresponding attributes, because its inherent global representation often overlooks fine-grained semantics for attribute binding. Existing methods often require additional training or extensive hard negative sampling, yet they frequently show limited generalization to novel compositional concepts and fail to fundamentally address the drawbacks of global representations. In this paper, we propose ABE-CLIP, a novel training-free Attribute Binding Enhancement method designed to strengthen attribute-object binding in CLIP-like models. Specifically, we employ a Semantic Refinement Mechanism to refine token embeddings for both object and attribute phrases in the text, thereby mitigating attribute confusion and improving semantic precision. We further introduce a Local Token-Patch Alignment strategy that computes similarity scores between refined textual tokens and their most relevant image patches. By aggregating localized similarity scores, ABE-CLIP computes the final image-text similarity. Experiments on multiple datasets demonstrate that ABE-CLIP significantly improves attribute-object binding performance, even surpassing methods that require extensive training.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Mitty: Geração de Vídeo de Humano para Robô Baseada em Difusão
Aprender diretamente com vídeos de demonstração humana é um marco importante para o aprendizado de robôs escalável e generalizável. Apresentamos Mitty, um Diffusion Transformer que possibilita o In-Context Learning de vídeo para geração de vídeo Human2Robot, utilizando um modelo de difusão de vídeo pré-treinado e sem a necessidade de rótulos de ação.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Video Detective: Seek Critical Clues Recurrently to Answer Question from Long Videos
arXiv:2512.17229v1 Announce Type: new
Abstract: Long Video Question-Answering (LVQA) presents a significant challenge for Multi-modal Large Language Models (MLLMs) due to immense context and overloaded information, which could also lead to prohibitive memory consumption. While existing methods attempt to address these issues by reducing visual tokens or extending model's context length, they may miss useful information or take considerable computation. In fact, when answering given questions, only a small amount of crucial information is required. Therefore, we propose an efficient question-aware memory mechanism, enabling MLLMs to recurrently seek these critical clues. Our approach, named VideoDetective, simplifies this task by iteratively processing video sub-segments. For each sub-segment, a question-aware compression strategy is employed by introducing a few special memory tokens to achieve purposefully compression. This allows models to effectively seek critical clues while reducing visual tokens. Then, due to history context could have a significant impact, we recurrently aggregate and store these memory tokens to update history context, which would be reused for subsequent sub-segments. Furthermore, to more effectively measure model's long video understanding ability, we introduce GLVC (Grounding Long Video Clues), a long video question-answering dataset, which features grounding critical and concrete clues scattered throughout entire videos. Experimental results demonstrate our method enables MLLMs with limited context length of 32K to efficiently process 100K tokens (3600 frames, an hour-long video sampled at 1fps), requiring only 2 minutes and 37GB GPU memory usage. Evaluation results across multiple long video benchmarks illustrate our method can more effectively seek critical clues from massive information.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Democratizing Pathology Co-Pilots: An Open Pipeline and Dataset for Whole-Slide Vision-Language Modelling
arXiv:2512.17326v1 Announce Type: new
Abstract: Vision-language models (VLMs) have the potential to become co-pilots for pathologists. However, most VLMs either focus on small regions of interest within whole-slide images, provide only static slide-level outputs, or rely on data that is not publicly available, limiting reproducibility. Furthermore, training data containing WSIs paired with detailed clinical reports is scarce, restricting progress toward transparent and generalisable VLMs. We address these limitations with three main contributions. First, we introduce Polysome, a standardised tool for synthetic instruction generation. Second, we apply Polysome to the public HISTAI dataset, generating HISTAI-Instruct, a large whole-slide instruction tuning dataset spanning 24,259 slides and over 1.1 million instruction-response pairs. Finally, we use HISTAI-Instruct to train ANTONI-{\alpha}, a VLM capable of visual-question answering (VQA). We show that ANTONI-{\alpha} outperforms MedGemma on WSI-level VQA tasks of tissue identification, neoplasm detection, and differential diagnosis. We also compare the performance of multiple incarnations of ANTONI-{\alpha} trained with different amounts of data. All methods, data, and code are publicly available.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images
arXiv:2512.17306v1 Announce Type: new
Abstract: Recent advances in large Vision-Language Models (VLMs) have exhibited strong reasoning capabilities on complex visual tasks by thinking with images in their Chain-of-Thought (CoT), which is achieved by actively invoking tools to analyze visual inputs rather than merely perceiving them. However, existing models often struggle to reflect on and correct themselves when attempting incorrect reasoning trajectories. To address this limitation, we propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT. Our pipeline comprises three stages: data construction, cold-start SFT and RL. Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs, where solving each task requires multi-turn tool calls to reach the correct answer. In the SFT stage, we collect tool trajectories as cold-start data, guiding a multi-turn reasoning pattern. In the RL stage, we introduce redundancy-penalized policy optimization, which incentivizes the model to develop a self-reflective reasoning pattern. The basic idea is to impose judgment on reasoning trajectories and penalize those that produce incorrect answers without sufficient multi-scale exploration. Extensive experiments demonstrate that DRIM achieves superior performance on visual understanding benchmarks.
Fonte: arXiv cs.CV