Visão Computacional

Veja os artigos deste label, com traduções para PT-BR.

Ver todos os labels

Artigos

👁️Visão Computacional177 artigo(s) encontrados

Limpar filtro

NLP/LLMs • Score 85

Toward Auditable Neuro-Symbolic Reasoning in Pathology: SQL as an Explicit Trace of Evidence

arXiv:2601.01875v1 Announce Type: new Abstract: Automated pathology image analysis is central to clinical diagnosis, but clinicians still ask which slide features drive a model's decision and why. Vision-language models can produce natural language explanations, but these are often correlational and lack verifiable evidence. In this paper, we introduce an SQL-centered agentic framework that enables both feature measurement and reasoning to be auditable. Specifically, after extracting human-interpretable cellular features, Feature Reasoning Agents compose and execute SQL queries over feature tables to aggregate visual evidence into quantitative findings. A Knowledge Comparison Agent then evaluates these findings against established pathological knowledge, mirroring how pathologists justify diagnoses from measurable observations. Extensive experiments evaluated on two pathology visual question answering datasets demonstrate our method improves interpretability and decision traceability while producing executable SQL traces that link cellular measurements to diagnostic conclusions.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

XAI-MeD: Explainable Knowledge Guided Neuro-Symbolic Framework for Domain Generalization and Rare Class Detection in Medical Imaging

arXiv:2601.02008v1 Announce Type: new Abstract: Explainability domain generalization and rare class reliability are critical challenges in medical AI where deep models often fail under real world distribution shifts and exhibit bias against infrequent clinical conditions This paper introduces XAIMeD an explainable medical AI framework that integrates clinically accurate expert knowledge into deep learning through a unified neuro symbolic architecture XAIMeD is designed to improve robustness under distribution shift enhance rare class sensitivity and deliver transparent clinically aligned interpretations The framework encodes clinical expertise as logical connectives over atomic medical propositions transforming them into machine checkable class specific rules Their diagnostic utility is quantified through weighted feature satisfaction scores enabling a symbolic reasoning branch that complements neural predictions A confidence weighted fusion integrates symbolic and deep outputs while a Hunt inspired adaptive routing mechanism guided by Entropy Imbalance Gain EIG and Rare Class Gini mitigates class imbalance high intra class variability and uncertainty We evaluate XAIMeD across diverse modalities on four challenging tasks i Seizure Onset Zone SOZ localization from rs fMRI ii Diabetic Retinopathy grading across 6 multicenter datasets demonstrate substantial performance improvements including 6 percent gains in cross domain generalization and a 10 percent improved rare class F1 score far outperforming state of the art deep learning baselines Ablation studies confirm that the clinically grounded symbolic components act as effective regularizers ensuring robustness to distribution shifts XAIMeD thus provides a principled clinically faithful and interpretable approach to multimodal medical AI.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

OpenSocInt: A Multi-modal Training Environment for Human-Aware Social Navigation

arXiv:2601.01939v1 Announce Type: new Abstract: In this paper, we introduce OpenSocInt, an open-source software package providing a simulator for multi-modal social interactions and a modular architecture to train social agents. We described the software package and showcased its interest via an experimental protocol based on the task of social navigation. Our framework allows for exploring the use of different perceptual features, their encoding and fusion, as well as the use of different agents. The software is already publicly available under GPL at https://gitlab.inria.fr/robotlearn/OpenSocInt/.

Fonte: arXiv cs.AI

Vision • Score 92

DynaDrag: Dynamic Drag-Style Image Editing by Motion Prediction

arXiv:2601.00542v1 Announce Type: new Abstract: To achieve pixel-level image manipulation, drag-style image editing which edits images using points or trajectories as conditions is attracting widespread attention. Most previous methods follow move-and-track framework, in which miss tracking and ambiguous tracking are unavoidable challenging issues. Other methods under different frameworks suffer from various problems like the huge gap between source image and target edited image as well as unreasonable intermediate point which can lead to low editability. To avoid these problems, we propose DynaDrag, the first dragging method under predict-and-move framework. In DynaDrag, Motion Prediction and Motion Supervision are performed iteratively. In each iteration, Motion Prediction first predicts where the handle points should move, and then Motion Supervision drags them accordingly. We also propose to dynamically adjust the valid handle points to further improve the performance. Experiments on face and human datasets showcase the superiority over previous works.

Fonte: arXiv cs.CV

Vision • Score 96

Do Barro ao Código: Raciocínio Tipológico e Material nas Interpretações de IA das Torres de Pombos Iranianas

Este estudo investiga como sistemas de IA generativa interpretam a inteligência arquitetônica embutida na forma vernacular. Usando a torre de pombos iraniana como estudo de caso, a pesquisa testa três modelos de difusão: Midjourney v6, DALL-E 3 e DreamStudio baseado em Stable Diffusion XL (SDXL), em três estágios de prompt: referencial, adaptativo e especulativo.

Fonte: arXiv cs.AI

Vision • Score 95

Compressed Map Priors for 3D Perception

arXiv:2601.00139v1 Announce Type: new Abstract: Human drivers rarely travel where no person has gone before. After all, thousands of drivers use busy city roads every day, and only one can claim to be the first. The same holds for autonomous computer vision systems. The vast majority of the deployment area of an autonomous vision system will have been visited before. Yet, most autonomous vehicle vision systems act as if they are encountering each location for the first time. In this work, we present Compressed Map Priors (CMP), a simple but effective framework to learn spatial priors from historic traversals. The map priors use a binarized hashmap that requires only $32\text{KB}/\text{km}^2$, a $20\times$ reduction compared to the dense storage. Compressed Map Priors easily integrate into leading 3D perception systems at little to no extra computational costs, and lead to a significant and consistent improvement in 3D object detection on the nuScenes dataset across several architectures.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

Um Modelo de Linguagem Grande Aprimorado por Visão e Conhecimento para Inferência Generalizável do Comportamento de Travessia de Pedestres

Os paradigmas existentes para inferir o comportamento de travessia de pedestres apresentam generalização limitada e desempenho inadequado em novos locais. Este estudo introduz o Pedestrian Crossing LLM (PedX-LLM), um framework aprimorado que transforma a inferência de travessia de padrões específicos do local para raciocínio comportamental generalizável, alcançando 82,0% de precisão balanceada e superando métodos tradicionais.

Fonte: arXiv cs.AI

Vision • Score 89

Transporte Ótimo Sliced em Streaming

O transporte ótimo sliced (SOT), ou distância Wasserstein sliced (SW), é amplamente reconhecido por sua escalabilidade estatística e computacional. Neste trabalho, aprimoramos ainda mais a escalabilidade computacional ao propor o primeiro método para estimar SW a partir de fluxos de amostra, chamado extit{streaming sliced Wasserstein} (Stream-SW).

Fonte: arXiv stat.ML

NLP/LLMs • Score 94

Controles de Abstenção Explícita para Confiabilidade Previsível em Respostas a Perguntas em Vídeo

A implementação de modelos de visão-linguagem (VLMs) em situações críticas exige previsões seletivas, onde os sistemas se abstêm quando incertos, evitando erros custosos. Investigamos se a abstenção baseada em confiança oferece controle confiável sobre as taxas de erro em respostas a perguntas em vídeo e se esse controle se mantém robusto sob mudança de distribuição.

Fonte: arXiv cs.AI

Vision • Score 93

Redes Neurais Espinhadas Personalizadas com Sinapses Ferroelectricas para Processamento de Sinais EEG

As interfaces cérebro-computador (BCIs) baseadas em eletroencefalografia (EEG) são fortemente afetadas por sinais neurais não estacionários, limitando a generalização de modelos independentes de sujeito. Este trabalho demonstra que redes neurais espinhadas (SNNs) podem ser implementadas em dispositivos sinápticos memristivos ferroelectricos para decodificação adaptativa de imagens motoras baseadas em EEG, mesmo sob restrições de dispositivo.

Fonte: arXiv cs.AI

Vision • Score 95

Simulação como Supervisão: Pré-treinamento Mecânico para Descoberta Científica

A modelagem científica enfrenta um trade-off entre a interpretabilidade da teoria mecanicista e o poder preditivo do machine learning. Apresentamos as Simulation-Grounded Neural Networks (SGNNs), um framework que incorpora conhecimento de domínio nos dados de treinamento, permitindo que o modelo aprenda padrões amplos de possibilidade física e seja mais robusto a erros de especificação do modelo.

Fonte: arXiv stat.ML

Vision • Score 95

DepFlow: Disentangled Speech Generation to Mitigate Semantic Bias in Depression Detection

arXiv:2601.00303v1 Announce Type: new Abstract: Speech is a scalable and non-invasive biomarker for early mental health screening. However, widely used depression datasets like DAIC-WOZ exhibit strong coupling between linguistic sentiment and diagnostic labels, encouraging models to learn semantic shortcuts. As a result, model robustness may be compromised in real-world scenarios, such as Camouflaged Depression, where individuals maintain socially positive or neutral language despite underlying depressive states. To mitigate this semantic bias, we propose DepFlow, a three-stage depression-conditioned text-to-speech framework. First, a Depression Acoustic Encoder learns speaker- and content-invariant depression embeddings through adversarial training, achieving effective disentanglement while preserving depression discriminability (ROC-AUC: 0.693). Second, a flow-matching TTS model with FiLM modulation injects these embeddings into synthesis, enabling control over depressive severity while preserving content and speaker identity. Third, a prototype-based severity mapping mechanism provides smooth and interpretable manipulation across the depression continuum. Using DepFlow, we construct a Camouflage Depression-oriented Augmentation (CDoA) dataset that pairs depressed acoustic patterns with positive/neutral content from a sentiment-stratified text bank, creating acoustic-semantic mismatches underrepresented in natural data. Evaluated across three depression detection architectures, CDoA improves macro-F1 by 9%, 12%, and 5%, respectively, consistently outperforming conventional augmentation strategies in depression Detection. Beyond enhancing robustness, DepFlow provides a controllable synthesis platform for conversational systems and simulation-based evaluation, where real clinical data remains limited by ethical and coverage constraints.

Fonte: arXiv cs.CL

Vision • Score 93

Aprendendo a Ser Reproduzível: Design de Função de Perda Personalizada para Redes Neurais Robústas

Para melhorar a reproducibilidade e a confiabilidade de modelos de deep learning, abordamos uma lacuna crítica nas metodologias de treinamento atuais: a falta de mecanismos que garantam desempenho consistente e robusto em diferentes execuções. Nossa análise empírica revela que, mesmo sob condições controladas, a precisão do modelo pode apresentar variabilidade significativa.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

RoLID-11K: A Dashcam Dataset for Small-Object Roadside Litter Detection

arXiv:2601.00398v1 Announce Type: new Abstract: Roadside litter poses environmental, safety and economic challenges, yet current monitoring relies on labour-intensive surveys and public reporting, providing limited spatial coverage. Existing vision datasets for litter detection focus on street-level still images, aerial scenes or aquatic environments, and do not reflect the unique characteristics of dashcam footage, where litter appears extremely small, sparse and embedded in cluttered road-verge backgrounds. We introduce RoLID-11K, the first large-scale dataset for roadside litter detection from dashcams, comprising over 11k annotated images spanning diverse UK driving conditions and exhibiting pronounced long-tail and small-object distributions. We benchmark a broad spectrum of modern detectors, from accuracy-oriented transformer architectures to real-time YOLO models, and analyse their strengths and limitations on this challenging task. Our results show that while CO-DETR and related transformers achieve the best localisation accuracy, real-time models remain constrained by coarse feature hierarchies. RoLID-11K establishes a challenging benchmark for extreme small-object detection in dynamic driving scenes and aims to support the development of scalable, low-cost systems for roadside-litter monitoring. The dataset is available at https://github.com/xq141839/RoLID-11K.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Comparative Efficiency Analysis of Lightweight Transformer Models: A Multi-Domain Empirical Benchmark for Enterprise NLP Deployment

arXiv:2601.00444v1 Announce Type: new Abstract: In the rapidly evolving landscape of enterprise natural language processing (NLP), the demand for efficient, lightweight models capable of handling multi-domain text automation tasks has intensified. This study conducts a comparative analysis of three prominent lightweight Transformer models - DistilBERT, MiniLM, and ALBERT - across three distinct domains: customer sentiment classification, news topic classification, and toxicity and hate speech detection. Utilizing datasets from IMDB, AG News, and the Measuring Hate Speech corpus, we evaluated performance using accuracy-based metrics including accuracy, precision, recall, and F1-score, as well as efficiency metrics such as model size, inference time, throughput, and memory usage. Key findings reveal that no single model dominates all performance dimensions. ALBERT achieves the highest task-specific accuracy in multiple domains, MiniLM excels in inference speed and throughput, and DistilBERT demonstrates the most consistent accuracy across tasks while maintaining competitive efficiency. All results reflect controlled fine-tuning under fixed enterprise-oriented constraints rather than exhaustive hyperparameter optimization. These results highlight trade-offs between accuracy and efficiency, recommending MiniLM for latency-sensitive enterprise applications, DistilBERT for balanced performance, and ALBERT for resource-constrained environments.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Rumo ao Diagnóstico Diferencial Automatizado de Doenças de Pele Usando Deep Learning e Estratégias Conscientes de Imbalance

À medida que as condições dermatológicas se tornam cada vez mais comuns e a disponibilidade de dermatologistas permanece limitada, há uma necessidade crescente de ferramentas inteligentes para apoiar pacientes e clínicos no diagnóstico oportuno e preciso de doenças de pele. Neste projeto, desenvolvemos um modelo baseado em deep learning para a classificação e diagnóstico de condições cutâneas.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach

arXiv:2601.00388v1 Announce Type: new Abstract: Recent advances in vision-language models have opened up new possibilities for reasoning-driven image geolocalization. However, existing approaches often rely on synthetic reasoning annotations or external image retrieval, which can limit interpretability and generalizability. In this paper, we present Geo-R, a retrieval-free framework that uncovers structured reasoning paths from existing ground-truth coordinates and optimizes geolocation accuracy via reinforcement learning. We propose the Chain of Region, a rule-based hierarchical reasoning paradigm that generates precise, interpretable supervision by mapping GPS coordinates to geographic entities (e.g., country, province, city) without relying on model-generated or synthetic labels. Building on this, we introduce a lightweight reinforcement learning strategy with coordinate-aligned rewards based on Haversine distance, enabling the model to refine predictions through spatially meaningful feedback. Our approach bridges structured geographic reasoning with direct spatial supervision, yielding improved localization accuracy, stronger generalization, and more transparent inference. Experimental results across multiple benchmarks confirm the effectiveness of Geo-R, establishing a new retrieval-free paradigm for scalable and interpretable image geolocalization. To facilitate further research and ensure reproducibility, both the model and code will be made publicly available.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations

arXiv:2601.00454v1 Announce Type: new Abstract: Guardrail models are essential for ensuring the safety of Large Language Model (LLM) deployments, but processing full multi-turn conversation histories incurs significant computational cost. We propose Defensive M2S, a training paradigm that fine-tunes guardrail models on Multi-turn to Single-turn (M2S) compressed conversations rather than complete dialogue histories. We provide a formal complexity analysis showing that M2S reduces training cost from $O(n^2)$ to $O(n)$ for $n$-turn conversations. Empirically, on our training dataset (779 samples, avg. 10.6 turns), M2S requires only 169K tokens compared to 15.7M tokens for the multi-turn baseline -- a 93$\times$ reduction. We evaluate Defensive M2S across three guardrail model families (LlamaGuard, Nemotron, Qwen3Guard) and three compression templates (hyphenize, numberize, pythonize) on SafeDialBench, a comprehensive multi-turn jailbreak benchmark. Our best configuration, Qwen3Guard with hyphenize compression, achieves 93.8% attack detection recall while reducing inference tokens by 94.6% (from 3,231 to 173 tokens per conversation). This represents a 38.9 percentage point improvement over the baseline while dramatically reducing both training and inference costs. Our findings demonstrate that M2S compression can serve as an effective efficiency technique for guardrail deployment, enabling scalable safety screening of long multi-turn conversations.

Fonte: arXiv cs.CL

Vision • Score 92

HarmoniAD: Harmonizing Local Structures and Global Semantics for Anomaly Detection

arXiv:2601.00327v1 Announce Type: new Abstract: Anomaly detection is crucial in industrial product quality inspection. Failing to detect tiny defects often leads to serious consequences. Existing methods face a structure-semantics trade-off: structure-oriented models (such as frequency-based filters) are noise-sensitive, while semantics-oriented models (such as CLIP-based encoders) often miss fine details. To address this, we propose HarmoniAD, a frequency-guided dual-branch framework. Features are first extracted by the CLIP image encoder, then transformed into the frequency domain, and finally decoupled into high- and low-frequency paths for complementary modeling of structure and semantics. The high-frequency branch is equipped with a fine-grained structural attention module (FSAM) to enhance textures and edges for detecting small anomalies, while the low-frequency branch uses a global structural context module (GSCM) to capture long-range dependencies and preserve semantic consistency. Together, these branches balance fine detail and global semantics. HarmoniAD further adopts a multi-class joint training strategy, and experiments on MVTec-AD, VisA, and BTAD show state-of-the-art performance with both sensitivity and robustness.

Fonte: arXiv cs.CV

Vision • Score 89

Joint Geometry-Appearance Human Reconstruction in a Unified Latent Space via Bridge Diffusion

arXiv:2601.00328v1 Announce Type: new Abstract: Achieving consistent and high-fidelity geometry and appearance reconstruction of 3D digital humans from a single RGB image is inherently a challenging task. Existing studies typically resort to decoupled pipelines for geometry estimation and appearance synthesis, often hindering unified reconstruction and causing inconsistencies. This paper introduces \textbf{JGA-LBD}, a novel framework that unifies the modeling of geometry and appearance into a joint latent representation and formulates the generation process as bridge diffusion. Observing that directly integrating heterogeneous input conditions (e.g., depth maps, SMPL models) leads to substantial training difficulties, we unify all conditions into the 3D Gaussian representations, which can be further compressed into a unified latent space through a shared sparse variational autoencoder (VAE). Subsequently, the specialized form of bridge diffusion enables to start with a partial observation of the target latent code and solely focuses on inferring the missing components. Finally, a dedicated decoding module extracts the complete 3D human geometric structure and renders novel views from the inferred latent representation. Experiments demonstrate that JGA-LBD outperforms current state-of-the-art approaches in terms of both geometry fidelity and appearance quality, including challenging in-the-wild scenarios. Our code will be made publicly available at https://github.com/haiantyz/JGA-LBD.

Fonte: arXiv cs.CV

Vision • Score 95

BHaRNet: Reliability-Aware Body-Hand Modality Expertized Networks for Fine-grained Skeleton Action Recognition

arXiv:2601.00369v1 Announce Type: new Abstract: Skeleton-based human action recognition (HAR) has achieved remarkable progress with graph-based architectures. However, most existing methods remain body-centric, focusing on large-scale motions while neglecting subtle hand articulations that are crucial for fine-grained recognition. This work presents a probabilistic dual-stream framework that unifies reliability modeling and multi-modal integration, generalizing expertized learning under uncertainty across both intra-skeleton and cross-modal domains. The framework comprises three key components: (1) a calibration-free preprocessing pipeline that removes canonical-space transformations and learns directly from native coordinates; (2) a probabilistic Noisy-OR fusion that stabilizes reliability-aware dual-stream learning without requiring explicit confidence supervision; and (3) an intra- to cross-modal ensemble that couples four skeleton modalities (Joint, Bone, Joint Motion, and Bone Motion) to RGB representations, bridging structural and visual motion cues in a unified cross-modal formulation. Comprehensive evaluations across multiple benchmarks (NTU RGB+D~60/120, PKU-MMD, N-UCLA) and a newly defined hand-centric benchmark exhibit consistent improvements and robustness under noisy and heterogeneous conditions.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning

arXiv:2601.00791v1 Announce Type: cross Abstract: We present a training-free method for detecting valid mathematical reasoning in large language models through spectral analysis of attention patterns. By treating attention matrices as adjacency matrices of dynamic graphs over tokens, we extract four interpretable spectral diagnostics, the Fiedler value (algebraic connectivity), high-frequency energy ratio (HFER), graph signal smoothness, and spectral entropy, that exhibit statistically significant differences between valid and invalid mathematical proofs. Experiments across seven transformer models from four independent architectural families (Meta Llama, Alibaba Qwen, Microsoft Phi, and Mistral AI) demonstrate that this spectral signature produces effect sizes up to Cohen's $d = 3.30$ ($p < 10^{-116}$), enabling 85.0--95.6\% classification accuracy under rigorous evaluation, with calibrated thresholds reaching 93--95\% on the full dataset. The method requires no training data, fine-tuning, or learned classifiers: a single threshold on a spectral metric suffices for high accuracy. Through systematic label correction, we discover that the spectral method detects logical coherence rather than compiler acceptance, identifying mathematically valid proofs that formal verifiers reject due to technical failures. We further identify an architectural dependency: Mistral-7B's Sliding Window Attention shifts the discriminative signal from HFER to late-layer Smoothness ($d = 2.09$, $p_{\text{MW}} = 1.16 \times 10^{-48}$), revealing that attention mechanism design affects which spectral features capture reasoning validity. These findings establish spectral graph analysis as a principled framework for reasoning verification with immediate applications to hallucination detection and AI safety monitoring.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

From Sight to Insight: Improving Visual Reasoning Capabilities of Multimodal Models via Reinforcement Learning

arXiv:2601.00215v1 Announce Type: cross Abstract: Reinforcement learning (RL) has emerged as a promising approach for eliciting reasoning chains before generating final answers. However, multimodal large language models (MLLMs) generate reasoning that lacks integration of visual information. This limits their ability to solve problems that demand accurate visual perception, such as visual puzzles. We show that visual perception is the key bottleneck in such tasks: converting images into textual descriptions significantly improves performance, yielding gains of 26.7% for Claude 3.5 and 23.6% for Claude 3.7. To address this, we investigate reward-driven RL as a mechanism to unlock long visual reasoning in open-source MLLMs without requiring costly supervision. We design and evaluate six reward functions targeting different reasoning aspects, including image understanding, thinking steps, and answer accuracy. Using group relative policy optimization (GRPO), our approach explicitly incentivizes longer, structured reasoning and mitigates bypassing of visual information. Experiments on Qwen-2.5-VL-7B achieve 5.56% improvements over the base model, with consistent gains across both in-domain and out-of-domain settings.

Fonte: arXiv cs.CL

Vision • Score 95

NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos

arXiv:2601.00393v1 Announce Type: new Abstract: In this paper, we propose NeoVerse, a versatile 4D world model that is capable of 4D reconstruction, novel-trajectory video generation, and rich downstream applications. We first identify a common limitation of scalability in current 4D world modeling methods, caused either by expensive and specialized multi-view 4D data or by cumbersome training pre-processing. In contrast, our NeoVerse is built upon a core philosophy that makes the full pipeline scalable to diverse in-the-wild monocular videos. Specifically, NeoVerse features pose-free feed-forward 4D reconstruction, online monocular degradation pattern simulation, and other well-aligned techniques. These designs empower NeoVerse with versatility and generalization to various domains. Meanwhile, NeoVerse achieves state-of-the-art performance in standard reconstruction and generation benchmarks. Our project page is available at https://neoverse-4d.github.io

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Intelligent Traffic Surveillance for Real-Time Vehicle Detection, License Plate Recognition, and Speed Estimation

arXiv:2601.00344v1 Announce Type: new Abstract: Speeding is a major contributor to road fatalities, particularly in developing countries such as Uganda, where road safety infrastructure is limited. This study proposes a real-time intelligent traffic surveillance system tailored to such regions, using computer vision techniques to address vehicle detection, license plate recognition, and speed estimation. The study collected a rich dataset using a speed gun, a Canon Camera, and a mobile phone to train the models. License plate detection using YOLOv8 achieved a mean average precision (mAP) of 97.9%. For character recognition of the detected license plate, the CNN model got a character error rate (CER) of 3.85%, while the transformer model significantly reduced the CER to 1.79%. Speed estimation used source and target regions of interest, yielding a good performance of 10 km/h margin of error. Additionally, a database was established to correlate user information with vehicle detection data, enabling automated ticket issuance via SMS via Africa's Talking API. This system addresses critical traffic management needs in resource-constrained environments and shows potential to reduce road accidents through automated traffic enforcement in developing countries where such interventions are urgently needed.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

FaithSCAN: Detecção de Alucinações em Uma Única Passagem Baseada em Modelos para Respostas Visuais de Perguntas Fiéis

As alucinações de fidelidade em VQA ocorrem quando modelos de visão-linguagem produzem respostas fluentes, mas visualmente não fundamentadas, comprometendo sua confiabilidade em aplicações críticas de segurança. Propomos o FaithSCAN, uma rede leve que detecta alucinações explorando sinais internos ricos dos VLMs, superando limitações de métodos existentes em eficiência e desempenho de detecção.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Application Research of a Deep Learning Model Integrating CycleGAN and YOLO in PCB Infrared Defect Detection

arXiv:2601.00237v1 Announce Type: new Abstract: This paper addresses the critical bottleneck of infrared (IR) data scarcity in Printed Circuit Board (PCB) defect detection by proposing a cross-modal data augmentation framework integrating CycleGAN and YOLOv8. Unlike conventional methods relying on paired supervision, we leverage CycleGAN to perform unpaired image-to-image translation, mapping abundant visible-light PCB images into the infrared domain. This generative process synthesizes high-fidelity pseudo-IR samples that preserve the structural semantics of defects while accurately simulating thermal distribution patterns. Subsequently, we construct a heterogeneous training strategy that fuses generated pseudo-IR data with limited real IR samples to train a lightweight YOLOv8 detector. Experimental results demonstrate that this method effectively enhances feature learning under low-data conditions. The augmented detector significantly outperforms models trained on limited real data alone and approaches the performance benchmarks of fully supervised training, proving the efficacy of pseudo-IR synthesis as a robust augmentation strategy for industrial inspection.

Fonte: arXiv cs.CV

Vision • Score 96

Aprendizado por Reforço Multiagente para Jogos de Liquidez

Este trabalho explora o uso de métodos de enxame na modelagem de liquidez dos mercados financeiros, unindo Jogos de Liquidez e Enxames Racionais. A pesquisa propõe um modelo teórico onde agentes independentes maximizam a liquidez do mercado sem necessidade de coordenação, contribuindo para a eficiência do mercado e a lucratividade individual.

Fonte: arXiv cs.AI

Vision • Score 92

CropNeRF: A Neural Radiance Field-Based Framework for Crop Counting

arXiv:2601.00207v1 Announce Type: new Abstract: Rigorous crop counting is crucial for effective agricultural management and informed intervention strategies. However, in outdoor field environments, partial occlusions combined with inherent ambiguity in distinguishing clustered crops from individual viewpoints poses an immense challenge for image-based segmentation methods. To address these problems, we introduce a novel crop counting framework designed for exact enumeration via 3D instance segmentation. Our approach utilizes 2D images captured from multiple viewpoints and associates independent instance masks for neural radiance field (NeRF) view synthesis. We introduce crop visibility and mask consistency scores, which are incorporated alongside 3D information from a NeRF model. This results in an effective segmentation of crop instances in 3D and highly-accurate crop counts. Furthermore, our method eliminates the dependence on crop-specific parameter tuning. We validate our framework on three agricultural datasets consisting of cotton bolls, apples, and pears, and demonstrate consistent counting performance despite major variations in crop color, shape, and size. A comparative analysis against the state of the art highlights superior performance on crop counting tasks. Lastly, we contribute a cotton plant dataset to advance further research on this topic.

Fonte: arXiv cs.CV

Vision • Score 96

Detecção Adaptativa de Coordenação Causal para Mídias Sociais: Um Framework Guiado por Memória com Aprendizado Semi-Supervisionado

Detectar comportamentos inautênticos coordenados em mídias sociais é um desafio crítico. Propomos o framework Adaptive Causal Coordination Detection (ACCD), que utiliza uma arquitetura progressiva em três estágios para aprender e reter configurações de detecção otimizadas. O ACCD melhora a identificação de relações causais e reduz a necessidade de rotulagem manual, alcançando um F1-score de 87,3% em detecção de ataques coordenados.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

arXiv:2601.00264v1 Announce Type: new Abstract: Multimodal learning has revolutionized general domain tasks, yet its application in scientific discovery is hindered by the profound semantic gap between complex scientific imagery and sparse textual descriptions. We present S1-MMAlign, a large-scale, multi-disciplinary multimodal dataset comprising over 15.5 million high-quality image-text pairs derived from 2.5 million open-access scientific papers. Spanning disciplines from physics and biology to engineering, the dataset captures diverse visual modalities including experimental setups, heatmaps, and microscopic imagery. To address the pervasive issue of weak alignment in raw scientific captions, we introduce an AI-ready semantic enhancement pipeline that utilizes the Qwen-VL multimodal large model series to recaption images by synthesizing context from paper abstracts and citation contexts. Technical validation demonstrates that this enhancement significantly improves data quality: SciBERT-based pseudo-perplexity metrics show reduced semantic ambiguity, while CLIP scores indicate an 18.21% improvement in image-text alignment. S1-MMAlign provides a foundational resource for advancing scientific reasoning and cross-modal understanding in the era of AI for Science. The dataset is publicly available at https://huggingface.co/datasets/ScienceOne-AI/S1-MMAlign.

Fonte: arXiv cs.CV

Vision • Score 92

TimeColor: Flexible Reference Colorization via Temporal Concatenation

arXiv:2601.00296v1 Announce Type: new Abstract: Most colorization models condition only on a single reference, typically the first frame of the scene. However, this approach ignores other sources of conditional data, such as character sheets, background images, or arbitrary colorized frames. We propose TimeColor, a sketch-based video colorization model that supports heterogeneous, variable-count references with the use of explicit per-reference region assignment. TimeColor encodes references as additional latent frames which are concatenated temporally, permitting them to be processed concurrently in each diffusion step while keeping the model's parameter count fixed. TimeColor also uses spatiotemporal correspondence-masked attention to enforce subject-reference binding in addition to modality-disjoint RoPE indexing. These mechanisms mitigate shortcutting and cross-identity palette leakage. Experiments on SAKUGA-42M under both single- and multi-reference protocols show that TimeColor improves color fidelity, identity consistency, and temporal stability over prior baselines.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

O Cavalo de Troia no Vocabulário: Sabotagem Sutil da Composição de LLM

O ecossistema de LLM de pesos abertos é cada vez mais definido por técnicas de composição de modelos que remixam capacidades de diversas fontes. Um pré-requisito crítico para aplicar esses métodos é o transplante de tokenizers, que alinha vocabulários incompatíveis a um espaço de embedding compartilhado. Demonstramos que esse passo de interoperabilidade introduz uma vulnerabilidade na cadeia de suprimentos.

Fonte: arXiv cs.LG

Vision • Score 95

DichroGAN: Towards Restoration of in-air Colours of Seafloor from Satellite Imagery

arXiv:2601.00194v1 Announce Type: new Abstract: Recovering the in-air colours of seafloor from satellite imagery is a challenging task due to the exponential attenuation of light with depth in the water column. In this study, we present DichroGAN, a conditional generative adversarial network (cGAN) designed for this purpose. DichroGAN employs a two-steps simultaneous training: first, two generators utilise a hyperspectral image cube to estimate diffuse and specular reflections, thereby obtaining atmospheric scene radiance. Next, a third generator receives as input the generated scene radiance containing the features of each spectral band, while a fourth generator estimates the underwater light transmission. These generators work together to remove the effects of light absorption and scattering, restoring the in-air colours of seafloor based on the underwater image formation equation. DichroGAN is trained on a compact dataset derived from PRISMA satellite imagery, comprising RGB images paired with their corresponding spectral bands and masks. Extensive experiments on both satellite and underwater datasets demonstrate that DichroGAN achieves competitive performance compared to state-of-the-art underwater restoration techniques.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

DA-DPO: Otimização de Preferências Consciente da Dificuldade e Custo-Eficiente para Reduzir Alucinações em MLLMs

O Direct Preference Optimization (DPO) demonstrou grande potencial para mitigar alucinações em Multimodal Large Language Models (MLLMs). No entanto, abordagens existentes frequentemente sofrem com overfitting devido ao desequilíbrio de dificuldade nos dados de preferência. Propomos o Difficulty-Aware Direct Preference Optimization (DA-DPO), um framework custo-efetivo que equilibra o processo de aprendizado.

Fonte: arXiv cs.AI

Vision • Score 95

SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting

arXiv:2601.00285v1 Announce Type: new Abstract: Reconstructing a dynamic target moving over a large area is challenging. Standard approaches for dynamic object reconstruction require dense coverage in both the viewing space and the temporal dimension, typically relying on multi-view videos captured at each time step. However, such setups are only possible in constrained environments. In real-world scenarios, observations are often sparse over time and captured sparsely from diverse viewpoints (e.g., from security cameras), making dynamic reconstruction highly ill-posed. We present SV-GS, a framework that simultaneously estimates a deformation model and the object's motion over time under sparse observations. To initialize SV-GS, we leverage a rough skeleton graph and an initial static reconstruction as inputs to guide motion estimation. (Later, we show that this input requirement can be relaxed.) Our method optimizes a skeleton-driven deformation field composed of a coarse skeleton joint pose estimator and a module for fine-grained deformations. By making only the joint pose estimator time-dependent, our model enables smooth motion interpolation while preserving learned geometric details. Experiments on synthetic datasets show that our method outperforms existing approaches under sparse observations by up to 34% in PSNR, and achieves comparable performance to dense monocular video methods on real-world datasets despite using significantly fewer frames. Moreover, we demonstrate that the input initial static reconstruction can be replaced by a diffusion-based generative prior, making our method more practical for real-world scenarios.

Fonte: arXiv cs.CV

Vision • Score 96

Campos Cerebrais Neurais: Uma Abordagem Inspirada em NeRF para Gerar Eletrodos de EEG Inexistentes

Os dados de eletroencefalografia (EEG) apresentam desafios únicos de modelagem devido à variação de comprimento, baixíssima relação sinal-ruído e diferenças significativas entre participantes. Este trabalho apresenta um novo método inspirado em Neural Radiance Fields (NeRF) para processar sinais de EEG, permitindo a visualização contínua da atividade cerebral e a simulação de dados de eletrodos inexistentes.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

MotionPhysics: Learnable Motion Distillation for Text-Guided Simulation

arXiv:2601.00504v1 Announce Type: new Abstract: Accurately simulating existing 3D objects and a wide variety of materials often demands expert knowledge and time-consuming physical parameter tuning to achieve the desired dynamic behavior. We introduce MotionPhysics, an end-to-end differentiable framework that infers plausible physical parameters from a user-provided natural language prompt for a chosen 3D scene of interest, removing the need for guidance from ground-truth trajectories or annotated videos. Our approach first utilizes a multimodal large language model to estimate material parameter values, which are constrained to lie within plausible ranges. We further propose a learnable motion distillation loss that extracts robust motion priors from pretrained video diffusion models while minimizing appearance and geometry inductive biases to guide the simulation. We evaluate MotionPhysics across more than thirty scenarios, including real-world, human-designed, and AI-generated 3D objects, spanning a wide range of materials such as elastic solids, metals, foams, sand, and both Newtonian and non-Newtonian fluids. We demonstrate that MotionPhysics produces visually realistic dynamic simulations guided by natural language, surpassing the state of the art while automatically determining physically plausible parameters. The code and project page are available at: https://wangmiaowei.github.io/MotionPhysics.github.io/.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

TotalFM: An Organ-Separated Framework for 3D-CT Vision Foundation Models

arXiv:2601.00260v1 Announce Type: new Abstract: While foundation models in radiology are expected to be applied to various clinical tasks, computational cost constraints remain a major challenge when training on 3D-CT volumetric data. In this study, we propose TotalFM, a radiological foundation model that efficiently learns the correspondence between 3D-CT images and linguistic expressions based on the concept of organ separation, utilizing a large-scale dataset of 140,000 series. By automating the creation of organ volume and finding-sentence pairs through segmentation techniques and Large Language Model (LLM)-based radiology report processing, and by combining self-supervised pre-training via VideoMAE with contrastive learning using volume-text pairs, we aimed to balance computational efficiency and representation capability. In zero-shot organ-wise lesion classification tasks, the proposed model achieved higher F1 scores in 83% (5/6) of organs compared to CT-CLIP and 64% (9/14) of organs compared to Merlin. These results suggest that the proposed model exhibits high generalization performance in a clinical evaluation setting using actual radiology report sentences. Furthermore, in zero-shot finding-wise lesion classification tasks, our model achieved a higher AUROC in 83% (25/30) of finding categories compared to Merlin. We also confirmed performance comparable to existing Vision-Language Models (VLMs) in radiology report generation tasks. Our results demonstrate that the organ-separated learning framework can serve as a realistic and effective design guideline for the practical implementation of 3D-CT foundation models.

Fonte: arXiv cs.CV

Vision • Score 95

TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model

arXiv:2601.00051v1 Announce Type: new Abstract: World models aim to endow AI systems with the ability to represent, generate, and interact with dynamic environments in a coherent and temporally consistent manner. While recent video generation models have demonstrated impressive visual quality, they remain limited in real-time interaction, long-horizon consistency, and persistent memory of dynamic scenes, hindering their evolution into practical world models. In this report, we present TeleWorld, a real-time multimodal 4D world modeling framework that unifies video generation, dynamic scene reconstruction, and long-term world memory within a closed-loop system. TeleWorld introduces a novel generation-reconstruction-guidance paradigm, where generated video streams are continuously reconstructed into a dynamic 4D spatio-temporal representation, which in turn guides subsequent generation to maintain spatial, temporal, and physical consistency. To support long-horizon generation with low latency, we employ an autoregressive diffusion-based video model enhanced with Macro-from-Micro Planning (MMPL)--a hierarchical planning method that reduces error accumulation from frame-level to segment-level-alongside efficient Distribution Matching Distillation (DMD), enabling real-time synthesis under practical computational budgets. Our approach achieves seamless integration of dynamic object modeling and static scene representation within a unified 4D framework, advancing world models toward practical, interactive, and computationally accessible systems. Extensive experiments demonstrate that TeleWorld achieves strong performance in both static and dynamic world understanding, long-term consistency, and real-time generation efficiency, positioning it as a practical step toward interactive, memory-enabled world models for multimodal generation and embodied intelligence.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Focal-RegionFace: Generating Fine-Grained Multi-attribute Descriptions for Arbitrarily Selected Face Focal Regions

arXiv:2601.00156v1 Announce Type: new Abstract: In this paper, we introduce an underexplored problem in facial analysis: generating and recognizing multi-attribute natural language descriptions, containing facial action units (AUs), emotional states, and age estimation, for arbitrarily selected face regions (termed FaceFocalDesc). We argue that the system's ability to focus on individual facial areas leads to better understanding and control. To achieve this capability, we construct a new multi-attribute description dataset for arbitrarily selected face regions, providing rich region-level annotations and natural language descriptions. Further, we propose a fine-tuned vision-language model based on Qwen2.5-VL, called Focal-RegionFace for facial state analysis, which incrementally refines its focus on localized facial features through multiple progressively fine-tuning stages, resulting in interpretable age estimation, FAU and emotion detection. Experimental results show that Focal-RegionFace achieves the best performance on the new benchmark in terms of traditional and widely used metrics, as well as new proposed metrics. This fully verifies its effectiveness and versatility in fine-grained multi-attribute face region-focal analysis scenarios.

Fonte: arXiv cs.CV

Vision • Score 95

ActErase: A Training-Free Paradigm for Precise Concept Erasure via Activation Patching

arXiv:2601.00267v1 Announce Type: new Abstract: Recent advances in text-to-image diffusion models have demonstrated remarkable generation capabilities, yet they raise significant concerns regarding safety, copyright, and ethical implications. Existing concept erasure methods address these risks by removing sensitive concepts from pre-trained models, but most of them rely on data-intensive and computationally expensive fine-tuning, which poses a critical limitation. To overcome these challenges, inspired by the observation that the model's activations are predominantly composed of generic concepts, with only a minimal component can represent the target concept, we propose a novel training-free method (ActErase) for efficient concept erasure. Specifically, the proposed method operates by identifying activation difference regions via prompt-pair analysis, extracting target activations and dynamically replacing input activations during forward passes. Comprehensive evaluations across three critical erasure tasks (nudity, artistic style, and object removal) demonstrates that our training-free method achieves state-of-the-art (SOTA) erasure performance, while effectively preserving the model's overall generative capability. Our approach also exhibits strong robustness against adversarial attacks, establishing a new plug-and-play paradigm for lightweight yet effective concept manipulation in diffusion models.

Fonte: arXiv cs.CV

Vision • Score 95

All-in-One Video Restoration under Smoothly Evolving Unknown Weather Degradations

arXiv:2601.00533v1 Announce Type: new Abstract: All-in-one image restoration aims to recover clean images from diverse unknown degradations using a single model. But extending this task to videos faces unique challenges. Existing approaches primarily focus on frame-wise degradation variation, overlooking the temporal continuity that naturally exists in real-world degradation processes. In practice, degradation types and intensities evolve smoothly over time, and multiple degradations may coexist or transition gradually. In this paper, we introduce the Smoothly Evolving Unknown Degradations (SEUD) scenario, where both the active degradation set and degradation intensity change continuously over time. To support this scenario, we design a flexible synthesis pipeline that generates temporally coherent videos with single, compound, and evolving degradations. To address the challenges in the SEUD scenario, we propose an all-in-One Recurrent Conditional and Adaptive prompting Network (ORCANet). First, a Coarse Intensity Estimation Dehazing (CIED) module estimates haze intensity using physical priors and provides coarse dehazed features as initialization. Second, a Flow Prompt Generation (FPG) module extracts degradation features. FPG generates both static prompts that capture segment-level degradation types and dynamic prompts that adapt to frame-level intensity variations. Furthermore, a label-aware supervision mechanism improves the discriminability of static prompt representations under different degradations. Extensive experiments show that ORCANet achieves superior restoration quality, temporal consistency, and robustness over image and video-based baselines. Code is available at https://github.com/Friskknight/ORCANet-SEUD.

Fonte: arXiv cs.CV

Vision • Score 95

Towards Syn-to-Real IQA: A Novel Perspective on Reshaping Synthetic Data Distributions

arXiv:2601.00225v1 Announce Type: new Abstract: Blind Image Quality Assessment (BIQA) has advanced significantly through deep learning, but the scarcity of large-scale labeled datasets remains a challenge. While synthetic data offers a promising solution, models trained on existing synthetic datasets often show limited generalization ability. In this work, we make a key observation that representations learned from synthetic datasets often exhibit a discrete and clustered pattern that hinders regression performance: features of high-quality images cluster around reference images, while those of low-quality images cluster based on distortion types. Our analysis reveals that this issue stems from the distribution of synthetic data rather than model architecture. Consequently, we introduce a novel framework SynDR-IQA, which reshapes synthetic data distribution to enhance BIQA generalization. Based on theoretical derivations of sample diversity and redundancy's impact on generalization error, SynDR-IQA employs two strategies: distribution-aware diverse content upsampling, which enhances visual diversity while preserving content distribution, and density-aware redundant cluster downsampling, which balances samples by reducing the density of densely clustered areas. Extensive experiments across three cross-dataset settings (synthetic-to-authentic, synthetic-to-algorithmic, and synthetic-to-synthetic) demonstrate the effectiveness of our method. The code is available at https://github.com/Li-aobo/SynDR-IQA.

Fonte: arXiv cs.CV

Vision • Score 95

ReMA: A Training-Free Plug-and-Play Mixing Augmentation for Video Behavior Recognition

arXiv:2601.00311v1 Announce Type: new Abstract: Video behavior recognition demands stable and discriminative representations under complex spatiotemporal variations. However, prevailing data augmentation strategies for videos remain largely perturbation-driven, often introducing uncontrolled variations that amplify non-discriminative factors, which finally weaken intra-class distributional structure and representation drift with inconsistent gains across temporal scales. To address these problems, we propose Representation-aware Mixing Augmentation (ReMA), a plug-and-play augmentation strategy that formulates mixing as a controlled replacement process to expand representations while preserving class-conditional stability. ReMA integrates two complementary mechanisms. Firstly, the Representation Alignment Mechanism (RAM) performs structured intra-class mixing under distributional alignment constraints, suppressing irrelevant intra-class drift while enhancing statistical reliability. Then, the Dynamic Selection Mechanism (DSM) generates motion-aware spatiotemporal masks to localize perturbations, guiding them away from discrimination-sensitive regions and promoting temporal coherence. By jointly controlling how and where mixing is applied, ReMA improves representation robustness without additional supervision or trainable parameters. Extensive experiments on diverse video behavior benchmarks demonstrate that ReMA consistently enhances generalization and robustness across different spatiotemporal granularities.

Fonte: arXiv cs.CV

Vision • Score 95

A Spatially Masked Adaptive Gated Network for multimodal post-flood water extent mapping using SAR and incomplete multispectral data

arXiv:2601.00123v1 Announce Type: new Abstract: Mapping water extent during a flood event is essential for effective disaster management throughout all phases: mitigation, preparedness, response, and recovery. In particular, during the response stage, when timely and accurate information is important, Synthetic Aperture Radar (SAR) data are primarily employed to produce water extent maps. Recently, leveraging the complementary characteristics of SAR and MSI data through a multimodal approach has emerged as a promising strategy for advancing water extent mapping using deep learning models. This approach is particularly beneficial when timely post-flood observations, acquired during or shortly after the flood peak, are limited, as it enables the use of all available imagery for more accurate post-flood water extent mapping. However, the adaptive integration of partially available MSI data into the SAR-based post-flood water extent mapping process remains underexplored. To bridge this research gap, we propose the Spatially Masked Adaptive Gated Network (SMAGNet), a multimodal deep learning model that utilizes SAR data as the primary input for post-flood water extent mapping and integrates complementary MSI data through feature fusion. In experiments on the C2S-MS Floods dataset, SMAGNet consistently outperformed other multimodal deep learning models in prediction performance across varying levels of MSI data availability. Furthermore, we found that even when MSI data were completely missing, the performance of SMAGNet remained statistically comparable to that of a U-Net model trained solely on SAR data. These findings indicate that SMAGNet enhances the model robustness to missing data as well as the applicability of multimodal deep learning in real-world flood management scenarios.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

MethConvTransformer: Um Framework de Deep Learning para Detecção de Doença de Alzheimer em Múltiplos Tecidos

A doença de Alzheimer (AD) é um distúrbio neurodegenerativo multifatorial caracterizado por declínio cognitivo progressivo. O MethConvTransformer é um framework de deep learning baseado em transformer que integra perfis de metilação de DNA de tecidos cerebrais e periféricos, permitindo a descoberta de biomarcadores. O modelo supera as abordagens convencionais de machine learning, oferecendo biomarcadores epigenéticos robustos e interpretabilidade multi-resolução.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

SSI-GAN: Redes Geradoras Adversariais Semi-Supervisionadas Inspiradas no Swin para Classificação de Espículas Neurais

Os mosquitos são os principais agentes transmissores de doenças arbovirais. A classificação manual de seus padrões de espículas neurais é muito trabalhosa e cara. Para resolver a escassez de dados rotulados, propomos uma nova arquitetura de Rede Geradora Adversarial (GAN) chamada SSI-GAN, que alcançou 99,93% de precisão na classificação com apenas 3% de dados rotulados.

Fonte: arXiv cs.AI

Vision • Score 95

Context-Aware Pesticide Recommendation via Few-Shot Pest Recognition for Precision Agriculture

arXiv:2601.00243v1 Announce Type: new Abstract: Effective pest management is crucial for enhancing agricultural productivity, especially for crops such as sugarcane and wheat that are highly vulnerable to pest infestations. Traditional pest management methods depend heavily on manual field inspections and the use of chemical pesticides. These approaches are often costly, time-consuming, labor-intensive, and can have a negative impact on the environment. To overcome these challenges, this study presents a lightweight framework for pest detection and pesticide recommendation, designed for low-resource devices such as smartphones and drones, making it suitable for use by small and marginal farmers. The proposed framework includes two main components. The first is a Pest Detection Module that uses a compact, lightweight convolutional neural network (CNN) combined with prototypical meta-learning to accurately identify pests even when only a few training samples are available. The second is a Pesticide Recommendation Module that incorporates environmental factors like crop type and growth stage to suggest safe and eco-friendly pesticide recommendations. To train and evaluate our framework, a comprehensive pest image dataset was developed by combining multiple publicly available datasets. The final dataset contains samples with different viewing angles, pest sizes, and background conditions to ensure strong generalization. Experimental results show that the proposed lightweight CNN achieves high accuracy, comparable to state-of-the-art models, while significantly reducing computational complexity. The Decision Support System additionally improves pest management by reducing dependence on traditional chemical pesticides and encouraging sustainable practices, demonstrating its potential for real-time applications in precision agriculture.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

Uma Avaliação Empírica de Abordagens Baseadas em LLM para Detecção de Vulnerabilidades de Código: RAG, SFT e Sistemas de Agentes Duplos

O rápido avanço dos Large Language Models (LLMs) apresenta novas oportunidades para a detecção automatizada de vulnerabilidades de software, uma tarefa crucial para a segurança de bases de código modernas. Este artigo apresenta um estudo comparativo sobre a eficácia de técnicas baseadas em LLM para detectar vulnerabilidades de software, avaliando três abordagens: Retrieval-Augmented Generation (RAG), Supervised Fine-Tuning (SFT) e um framework de LLM de Agente Duplo.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

IntraStyler: Exemplar-based Style Synthesis for Cross-modality Domain Adaptation

arXiv:2601.00212v1 Announce Type: new Abstract: Image-level domain alignment is the de facto approach for unsupervised domain adaptation, where unpaired image translation is used to minimize the domain gap. Prior studies mainly focus on the domain shift between the source and target domains, whereas the intra-domain variability remains under-explored. To address the latter, an effective strategy is to diversify the styles of the synthetic target domain data during image translation. However, previous methods typically require intra-domain variations to be pre-specified for style synthesis, which may be impractical. In this paper, we propose an exemplar-based style synthesis method named IntraStyler, which can capture diverse intra-domain styles without any prior knowledge. Specifically, IntraStyler uses an exemplar image to guide the style synthesis such that the output style matches the exemplar style. To extract the style-only features, we introduce a style encoder to learn styles discriminatively based on contrastive learning. We evaluate the proposed method on the largest public dataset for cross-modality domain adaptation, CrossMoDA 2023. Our experiments show the efficacy of our method in controllable style synthesis and the benefits of diverse synthetic data for downstream segmentation. Code is available at https://github.com/han-liu/IntraStyler.

Fonte: arXiv cs.CV

Vision • Score 96

Avaliação de Detectores de Anomalias para Problemas de Classificação Industrial Altamente Desequilibrados Simulados

O machine learning oferece soluções potenciais para problemas atuais em sistemas industriais, como controle de qualidade e manutenção preditiva, mas enfrenta barreiras únicas em aplicações industriais. Este artigo apresenta uma avaliação abrangente de algoritmos de detecção de anomalias usando um conjunto de dados simulado que reflete restrições de engenharia do mundo real.

Fonte: arXiv cs.AI

Vision • Score 95

It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models

arXiv:2601.00090v1 Announce Type: new Abstract: Contemporary text-to-image models exhibit a surprising degree of mode collapse, as can be seen when sampling several images given the same text prompt. While previous work has attempted to address this issue by steering the model using guidance mechanisms, or by generating a large pool of candidates and refining them, in this work we take a different direction and aim for diversity in generations via noise optimization. Specifically, we show that a simple noise optimization objective can mitigate mode collapse while preserving the fidelity of the base model. We also analyze the frequency characteristics of the noise and show that alternative noise initializations with different frequency profiles can improve both optimization and search. Our experiments demonstrate that noise optimization yields superior results in terms of generation quality and variety.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Robust Uncertainty Quantification for Factual Generation of Large Language Models

arXiv:2601.00348v1 Announce Type: new Abstract: The rapid advancement of large language model(LLM) technology has facilitated its integration into various domains of professional and daily life. However, the persistent challenge of LLM hallucination has emerged as a critical limitation, significantly compromising the reliability and trustworthiness of AI-generated content. This challenge has garnered significant attention within the scientific community, prompting extensive research efforts in hallucination detection and mitigation strategies. Current methodological frameworks reveal a critical limitation: traditional uncertainty quantification approaches demonstrate effectiveness primarily within conventional question-answering paradigms, yet exhibit notable deficiencies when confronted with non-canonical or adversarial questioning strategies. This performance gap raises substantial concerns regarding the dependability of LLM responses in real-world applications requiring robust critical thinking capabilities. This study aims to fill this gap by proposing an uncertainty quantification scenario in the task of generating with multiple facts. We have meticulously constructed a set of trap questions contained with fake names. Based on this scenario, we innovatively propose a novel and robust uncertainty quantification method(RU). A series of experiments have been conducted to verify its effectiveness. The results show that the constructed set of trap questions performs excellently. Moreover, when compared with the baseline methods on four different models, our proposed method has demonstrated great performance, with an average increase of 0.1-0.2 in ROCAUC values compared to the best performing baseline method, providing new sights and methods for addressing the hallucination issue of LLMs.

Fonte: arXiv cs.CL

Vision • Score 96

Detecção Humana em Tempo Real para Sequências de Vídeo Capturadas Aéreas via Modelos Profundos

A detecção humana em vídeos desempenha um papel importante em diversas aplicações do mundo real. Abordagens tradicionais dependem de características feitas à mão, que são específicas para problemas e tarefas. Este trabalho utiliza métodos de aprendizado automático de características, combinando fluxo óptico e três modelos profundos diferentes para detecção humana em vídeos capturados com uma câmera não estática em plataformas aéreas.

Fonte: arXiv cs.LG

Vision • Score 95

Boosting Segment Anything Model to Generalize Visually Non-Salient Scenarios

arXiv:2601.00537v1 Announce Type: new Abstract: Segment Anything Model (SAM), known for its remarkable zero-shot segmentation capabilities, has garnered significant attention in the community. Nevertheless, its performance is challenged when dealing with what we refer to as visually non-salient scenarios, where there is low contrast between the foreground and background. In these cases, existing methods often cannot capture accurate contours and fail to produce promising segmentation results. In this paper, we propose Visually Non-Salient SAM (VNS-SAM), aiming to enhance SAM's perception of visually non-salient scenarios while preserving its original zero-shot generalizability. We achieve this by effectively exploiting SAM's low-level features through two designs: Mask-Edge Token Interactive decoder and Non-Salient Feature Mining module. These designs help the SAM decoder gain a deeper understanding of non-salient characteristics with only marginal parameter increments and computational requirements. The additional parameters of VNS-SAM can be optimized within 4 hours, demonstrating its feasibility and practicality. In terms of data, we established VNS-SEG, a unified dataset for various VNS scenarios, with more than 35K images, in contrast to previous single-task adaptations. It is designed to make the model learn more robust VNS features and comprehensively benchmark the model's segmentation performance and generalizability on VNS scenarios. Extensive experiments across various VNS segmentation tasks demonstrate the superior performance of VNS-SAM, particularly under zero-shot settings, highlighting its potential for broad real-world applications. Codes and datasets are publicly available at https://guangqian-guo.github.io/VNS-SAM.

Fonte: arXiv cs.CV

Vision • Score 94

Fluxos de Kernel Orientados a Tarefas: Compressão de Classificação de Rótulos e Filtragem Espectral Laplaciana

Apresentamos uma teoria de aprendizado de características em redes amplas regularizadas por L2, mostrando que o aprendizado supervisionado é inerentemente compressivo. Derivamos uma ODE de kernel que prevê uma evolução espectral de 'preenchimento de água' e provamos que, para qualquer estado estacionário estável, a classificação do kernel é limitada pelo número de classes ($C$).

Fonte: arXiv cs.LG

Vision • Score 96

IMBWatch -- uma abordagem de Rede Neural Gráfica Espacial-Temporal para detectar Negócios de Massagem Ilícitos

Os Negócios de Massagem Ilícitos (IMBs) são uma forma encoberta e persistente de exploração organizada que operam sob a fachada de serviços de bem-estar legítimos. A detecção de IMBs é difícil devido a anúncios digitais codificados e mudanças frequentes de pessoal e locais. Apresentamos o IMBWatch, um framework de rede neural gráfica espacial-temporal (ST-GNN) para a detecção em larga escala de IMBs, que combina operações de convolução gráfica com mecanismos de atenção temporal.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

FCMBench: Um Benchmark Multimodal Abrangente de Crédito Financeiro para Aplicações do Mundo Real

À medida que a IA multimodal se torna amplamente utilizada para avaliação de risco de crédito e revisão de documentos, um benchmark específico do domínio é urgentemente necessário. Apresentamos o FCMBench-V1.0, um benchmark multimodal de crédito financeiro em larga escala, cobrindo 18 tipos de certificados principais, com 4.043 imagens em conformidade com a privacidade e 8.446 amostras de QA.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Efficient Prediction of Dense Visual Embeddings via Distillation and RGB-D Transformers

arXiv:2601.00359v1 Announce Type: new Abstract: In domestic environments, robots require a comprehensive understanding of their surroundings to interact effectively and intuitively with untrained humans. In this paper, we propose DVEFormer - an efficient RGB-D Transformer-based approach that predicts dense text-aligned visual embeddings (DVE) via knowledge distillation. Instead of directly performing classical semantic segmentation with fixed predefined classes, our method uses teacher embeddings from Alpha-CLIP to guide our efficient student model DVEFormer in learning fine-grained pixel-wise embeddings. While this approach still enables classical semantic segmentation, e.g., via linear probing, it further enables flexible text-based querying and other applications, such as creating comprehensive 3D maps. Evaluations on common indoor datasets demonstrate that our approach achieves competitive performance while meeting real-time requirements, operating at 26.3 FPS for the full model and 77.0 FPS for a smaller variant on an NVIDIA Jetson AGX Orin. Additionally, we show qualitative results that highlight the effectiveness and possible use cases in real-world applications. Overall, our method serves as a drop-in replacement for traditional segmentation approaches while enabling flexible natural-language querying and seamless integration into 3D mapping pipelines for mobile robotics.

Fonte: arXiv cs.CV

Vision • Score 96

Engenharia de Recursos Híbridos Otimizada para Detecção de Arritmias Eficiente em Recursos em Sinais de ECG: Um Framework de Otimização

As doenças cardiovasculares, especialmente as arritmias, continuam a ser uma das principais causas de mortalidade global, exigindo monitoramento contínuo via Internet das Coisas Médicas (IoMT). Este estudo propõe um framework centrado em dados e eficiente em recursos que prioriza a engenharia de características em vez da complexidade, alcançando alta precisão diagnóstica com um modelo leve.

Fonte: arXiv cs.LG

Vision • Score 96

Atribuição de Conteúdo Gerado por IA Desconhecida e Consciente

O avanço rápido de modelos generativos fotorealistas tornou crucial atribuir a origem do conteúdo sintético, passando da detecção binária de real ou falso para identificar o modelo específico que produziu uma imagem. Este estudo investiga a distinção de saídas de um modelo gerador-alvo (ex: OpenAI Dalle 3) em relação a outras fontes.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

CPPO: Contrastive Perception for Vision Language Policy Optimization

arXiv:2601.00501v1 Announce Type: new Abstract: We introduce CPPO, a Contrastive Perception Policy Optimization method for finetuning vision-language models (VLMs). While reinforcement learning (RL) has advanced reasoning in language models, extending it to multimodal reasoning requires improving both the perception and reasoning aspects. Prior works tackle this challenge mainly with explicit perception rewards, but disentangling perception tokens from reasoning tokens is difficult, requiring extra LLMs, ground-truth data, forced separation of perception from reasoning by policy model, or applying rewards indiscriminately to all output tokens. CPPO addresses this problem by detecting perception tokens via entropy shifts in the model outputs under perturbed input images. CPPO then extends the RL objective function with a Contrastive Perception Loss (CPL) that enforces consistency under information-preserving perturbations and sensitivity under information-removing ones. Experiments show that CPPO surpasses previous perception-rewarding methods, while avoiding extra models, making training more efficient and scalable.

Fonte: arXiv cs.CV

Vision • Score 96

Um Estudo Comparativo de Estratégias de Adaptação para Modelos Fundamentais de Séries Temporais na Detecção de Anomalias

A detecção de anomalias em séries temporais é essencial para a operação confiável de sistemas complexos, mas a maioria dos métodos existentes requer treinamento extenso e específico. Este estudo investiga se modelos fundamentais de séries temporais (TSFMs), pré-treinados em grandes dados heterogêneos, podem servir como bases universais para detecção de anomalias.

Fonte: arXiv cs.LG

Vision • Score 95

Bandidos Contextuais Aditivos Esparsos: Uma Abordagem Não Paramétrica para Tomada de Decisão Online com Covariáveis de Alta Dimensionalidade

Serviços personalizados são centrais para a economia digital atual, e suas decisões sequenciais são frequentemente modeladas como bandidos contextuais. Aplicações modernas enfrentam dois desafios principais: covariáveis de alta dimensionalidade e a necessidade de modelos não paramétricos para capturar relações complexas entre recompensa e covariáveis. Propomos um algoritmo de bandido contextual baseado em um modelo de recompensa aditiva esparsa que aborda ambos os desafios.

Fonte: arXiv stat.ML

Vision • Score 93

Detecção Inteligente de Falhas no Sistema de Energia Elétrica de Nanosatélites

Este artigo apresenta um novo método de detecção de falhas no sistema de energia elétrica de nanosatélites sem um Subsystema de Controle de Determinação de Atitude (ADCS) em órbita LEO. O sistema é simulado sem falhas com base em uma rede neural, utilizando radiação solar e temperatura da superfície do painel solar como dados de entrada.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

FreeText: Training-Free Text Rendering in Diffusion Transformers via Attention Localization and Spectral Glyph Injection

arXiv:2601.00535v1 Announce Type: new Abstract: Large-scale text-to-image (T2I) diffusion models excel at open-domain synthesis but still struggle with precise text rendering, especially for multi-line layouts, dense typography, and long-tailed scripts such as Chinese. Prior solutions typically require costly retraining or rigid external layout constraints, which can degrade aesthetics and limit flexibility. We propose \textbf{FreeText}, a training-free, plug-and-play framework that improves text rendering by exploiting intrinsic mechanisms of \emph{Diffusion Transformer (DiT)} models. \textbf{FreeText} decomposes the problem into \emph{where to write} and \emph{what to write}. For \emph{where to write}, we localize writing regions by reading token-wise spatial attribution from endogenous image-to-text attention, using sink-like tokens as stable spatial anchors and topology-aware refinement to produce high-confidence masks. For \emph{what to write}, we introduce Spectral-Modulated Glyph Injection (SGMI), which injects a noise-aligned glyph prior with frequency-domain band-pass modulation to strengthen glyph structure and suppress semantic leakage (rendering the concept instead of the word). Extensive experiments on Qwen-Image, FLUX.1-dev, and SD3 variants across longText-Benchmark, CVTG, and our CLT-Bench show consistent gains in text readability while largely preserving semantic alignment and aesthetic quality, with modest inference overhead.

Fonte: arXiv cs.CV

Vision • Score 95

Robust Assembly Progress Estimation via Deep Metric Learning

arXiv:2601.00422v1 Announce Type: new Abstract: In recent years, the advancement of AI technologies has accelerated the development of smart factories. In particular, the automatic monitoring of product assembly progress is crucial for improving operational efficiency, minimizing the cost of discarded parts, and maximizing factory productivity. However, in cases where assembly tasks are performed manually over multiple days, implementing smart factory systems remains a challenge. Previous work has proposed Anomaly Triplet-Net, which estimates assembly progress by applying deep metric learning to the visual features of products. Nevertheless, when visual changes between consecutive tasks are subtle, misclassification often occurs. To address this issue, this paper proposes a robust system for estimating assembly progress, even in cases of occlusion or minimal visual change, using a small-scale dataset. Our method leverages a Quadruplet Loss-based learning approach for anomaly images and introduces a custom data loader that strategically selects training samples to enhance estimation accuracy. We evaluated our approach using a image datasets: captured during desktop PC assembly. The proposed Anomaly Quadruplet-Net outperformed existing methods on the dataset. Specifically, it improved the estimation accuracy by 1.3% and reduced misclassification between adjacent tasks by 1.9% in the desktop PC dataset and demonstrating the effectiveness of the proposed method.

Fonte: arXiv cs.CV

Vision • Score 95

A Cascaded Information Interaction Network for Precise Image Segmentation

arXiv:2601.00562v1 Announce Type: new Abstract: Visual perception plays a pivotal role in enabling autonomous behavior, offering a cost-effective and efficient alternative to complex multi-sensor systems. However, robust segmentation remains a challenge in complex scenarios. To address this, this paper proposes a cascaded convolutional neural network integrated with a novel Global Information Guidance Module. This module is designed to effectively fuse low-level texture details with high-level semantic features across multiple layers, thereby overcoming the inherent limitations of single-scale feature extraction. This architectural innovation significantly enhances segmentation accuracy, particularly in visually cluttered or blurred environments where traditional methods often fail. Experimental evaluations on benchmark image segmentation datasets demonstrate that the proposed framework achieves superior precision, outperforming existing state-of-the-art methods. The results highlight the effectiveness of the approach and its promising potential for deployment in practical robotic applications.

Fonte: arXiv cs.CV

NLP/LLMs • Score 92

Attention to Detail: Global-Local Attention for High-Resolution AI-Generated Image Detection

arXiv:2601.00141v1 Announce Type: new Abstract: The rapid development of generative AI has made AI-generated images increasingly realistic and high-resolution. Most AI-generated image detection architectures typically downsample images before inputting them into models, risking the loss of fine-grained details. This paper presents GLASS (Global-Local Attention with Stratified Sampling), an architecture that combines a globally resized view with multiple randomly sampled local crops. These crops are original-resolution regions efficiently selected through spatially stratified sampling and aggregated using attention-based scoring. GLASS can be integrated into vision models to leverage both global and local information in images of any size. Vision Transformer, ResNet, and ConvNeXt models are used as backbones, and experiments show that GLASS outperforms standard transfer learning by achieving higher predictive performance within feasible computational constraints.

Fonte: arXiv cs.CV

Vision • Score 95

OmniVaT: Single Domain Generalization for Multimodal Visual-Tactile Learning

arXiv:2601.00352v1 Announce Type: new Abstract: Visual-tactile learning (VTL) enables embodied agents to perceive the physical world by integrating visual (VIS) and tactile (TAC) sensors. However, VTL still suffers from modality discrepancies between VIS and TAC images, as well as domain gaps caused by non-standardized tactile sensors and inconsistent data collection procedures. We formulate these challenges as a new task, termed single domain generalization for multimodal VTL (SDG-VTL). In this paper, we propose an OmniVaT framework that, for the first time, successfully addresses this task. On the one hand, OmniVaT integrates a multimodal fractional Fourier adapter (MFFA) to map VIS and TAC embeddings into a unified embedding-frequency space, thereby effectively mitigating the modality gap without multi-domain training data or careful cross-modal fusion strategies. On the other hand, it also incorporates a discrete tree generation (DTG) module that obtains diverse and reliable multimodal fractional representations through a hierarchical tree structure, thereby enhancing its adaptivity to fluctuating domain shifts in unseen domains. Extensive experiments demonstrate the superior cross-domain generalization performance of OmniVaT on the SDG-VTL task.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

Trajectory Guard -- Um Modelo Leve e Consciente de Sequência para Detecção de Anomalias em Tempo Real em AI Agente

Agentes autônomos de LLM geram planos de ação de múltiplos passos que podem falhar devido a desalinhamento contextual ou incoerência estrutural. Métodos existentes de detecção de anomalias não são adequados para esse desafio. Apresentamos o Trajectory Guard, um Autoencoder Recorrente Siamês que aprende alinhamento de tarefa e trajetória, permitindo a detecção unificada de planos incorretos e estruturas de planos malformadas.

Fonte: arXiv cs.LG

Vision • Score 95

A Comprehensive Dataset for Human vs. AI Generated Image Detection

arXiv:2601.00553v1 Announce Type: new Abstract: Multimodal generative AI systems like Stable Diffusion, DALL-E, and MidJourney have fundamentally changed how synthetic images are created. These tools drive innovation but also enable the spread of misleading content, false information, and manipulated media. As generated images become harder to distinguish from photographs, detecting them has become an urgent priority. To combat this challenge, We release MS COCOAI, a novel dataset for AI generated image detection consisting of 96000 real and synthetic datapoints, built using the MS COCO dataset. To generate synthetic images, we use five generators: Stable Diffusion 3, Stable Diffusion 2.1, SDXL, DALL-E 3, and MidJourney v6. Based on the dataset, we propose two tasks: (1) classifying images as real or generated, and (2) identifying which model produced a given synthetic image. The dataset is available at https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Image_Dataset.

Fonte: arXiv cs.CV

Vision • Score 92

Depth-Synergized Mamba Meets Memory Experts for All-Day Image Reflection Separation

arXiv:2601.00322v1 Announce Type: new Abstract: Image reflection separation aims to disentangle the transmission layer and the reflection layer from a blended image. Existing methods rely on limited information from a single image, tending to confuse the two layers when their contrasts are similar, a challenge more severe at night. To address this issue, we propose the Depth-Memory Decoupling Network (DMDNet). It employs the Depth-Aware Scanning (DAScan) to guide Mamba toward salient structures, promoting information flow along semantic coherence to construct stable states. Working in synergy with DAScan, the Depth-Synergized State-Space Model (DS-SSM) modulates the sensitivity of state activations by depth, suppressing the spread of ambiguous features that interfere with layer disentanglement. Furthermore, we introduce the Memory Expert Compensation Module (MECM), leveraging cross-image historical knowledge to guide experts in providing layer-specific compensation. To address the lack of datasets for nighttime reflection separation, we construct the Nighttime Image Reflection Separation (NightIRS) dataset. Extensive experiments demonstrate that DMDNet outperforms state-of-the-art methods in both daytime and nighttime.

Fonte: arXiv cs.CV

Vision • Score 90

Produção de Entropia em Machine Learning Sob Fluxo de Probabilidade de Fokker-Planck

Modelos de machine learning implantados em ambientes não estacionários enfrentam degradação de desempenho devido ao data drift. Embora existam muitas heurísticas de detecção de drift, a maioria carece de uma interpretação dinâmica fundamentada e fornece orientações limitadas sobre como equilibrar a frequência de retraining com o custo operacional. Neste trabalho, propomos um framework de retraining baseado em entropia fundamentado em dinâmicas estocásticas fora do equilíbrio.

Fonte: arXiv cs.LG

Vision • Score 95

GranAlign: Granularity-Aware Alignment Framework for Zero-Shot Video Moment Retrieval

arXiv:2601.00584v1 Announce Type: new Abstract: Zero-shot video moment retrieval (ZVMR) is the task of localizing a temporal moment within an untrimmed video using a natural language query without relying on task-specific training data. The primary challenge in this setting lies in the mismatch in semantic granularity between textual queries and visual content. Previous studies in ZVMR have attempted to achieve alignment by leveraging high-quality pre-trained knowledge that represents video and language in a joint space. However, these approaches failed to balance the semantic granularity between the pre-trained knowledge provided by each modality for a given scene. As a result, despite the high quality of each modality's representations, the mismatch in granularity led to inaccurate retrieval. In this paper, we propose a training-free framework, called Granularity-Aware Alignment (GranAlign), that bridges this gap between coarse and fine semantic representations. Our approach introduces two complementary techniques: granularity-based query rewriting to generate varied semantic granularities, and query-aware caption generation to embed query intent into video content. By pairing multi-level queries with both query-agnostic and query-aware captions, we effectively resolve semantic mismatches. As a result, our method sets a new state-of-the-art across all three major benchmarks (QVHighlights, Charades-STA, ActivityNet-Captions), with a notable 3.23% mAP@avg improvement on the challenging QVHighlights dataset.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Do LLMs Judge Distantly Supervised Named Entity Labels Well? Constructing the JudgeWEL Dataset

arXiv:2601.00411v1 Announce Type: new Abstract: We present judgeWEL, a dataset for named entity recognition (NER) in Luxembourgish, automatically labelled and subsequently verified using large language models (LLM) in a novel pipeline. Building datasets for under-represented languages remains one of the major bottlenecks in natural language processing, where the scarcity of resources and linguistic particularities make large-scale annotation costly and potentially inconsistent. To address these challenges, we propose and evaluate a novel approach that leverages Wikipedia and Wikidata as structured sources of weak supervision. By exploiting internal links within Wikipedia articles, we infer entity types based on their corresponding Wikidata entries, thereby generating initial annotations with minimal human intervention. Because such links are not uniformly reliable, we mitigate noise by employing and comparing several LLMs to identify and retain only high-quality labelled sentences. The resulting corpus is approximately five times larger than the currently available Luxembourgish NER dataset and offers broader and more balanced coverage across entity categories, providing a substantial new resource for multilingual and low-resource NER research.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Detecção de Confundidores Não Observados: Uma Abordagem de Regressão Kernelizada

Detectar confundidores não observados é crucial para inferência causal confiável em estudos observacionais. Métodos existentes exigem suposições de linearidade ou múltiplos ambientes heterogêneos, limitando a aplicabilidade em configurações não lineares de ambiente único. Propomos a Detecção de Confundidores por Regressão Kernelizada (KRCD), um método inovador que utiliza espaços de Hilbert de kernel reprodutivo para modelar dependências complexas.

Fonte: arXiv stat.ML

Vision • Score 95

Estimativa de densidade espectral de séries temporais funcionais em grandes domínios usando deep learning

Derivamos um estimador da densidade espectral de uma série temporal funcional que é a saída de uma rede neural multilayer perceptron. O estimador é motivado por dificuldades na computação de estimadores de densidade espectral existentes para séries temporais de funções definidas em grades muito grandes, como em modelos climáticos e exames médicos.

Fonte: arXiv stat.ML

Vision • Score 93

Detecção de Ameaças Internas Usando GCN e Bi-LSTM com Representações Gráficas Explícitas e Implícitas

A detecção de ameaças internas (ITD) é desafiadora devido à natureza sutil e oculta das atividades maliciosas realizadas por usuários confiáveis. Este artigo propõe um framework ITD pós-hoc que integra representações gráficas explícitas e implícitas com modelagem temporal para capturar padrões complexos de comportamento do usuário.

Fonte: arXiv cs.AI

Vision • Score 96

Grad: Geração de Difusão de Relações Guiadas para Aumento de Grafos na Detecção de Fraude em Grafos

Atualmente, a Detecção de Fraude em Grafos (GFD) em cenários financeiros tornou-se um tópico de pesquisa urgente para proteger a segurança de pagamentos online. Com a evolução das estratégias de camuflagem dos fraudadores, propomos o modelo Grad, que utiliza um módulo de aprendizado contrastivo supervisionado para melhorar a diferença entre fraudes e usuários benignos, gerando relações homofílicas auxiliares.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

ChronoDreamer: Modelo de Mundo Condicionado por Ação como um Simulador Online para Planejamento Robótico

Apresentamos o ChronoDreamer, um modelo de mundo condicionado por ação para manipulação robótica rica em contatos. Com base em um histórico de quadros RGB egocêntricos, mapas de contato, ações e estados das juntas, o ChronoDreamer prevê quadros de vídeo futuros, distribuições de contato e ângulos das juntas através de um transformer espaço-temporal treinado com previsão mascarada no estilo MaskGIT.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

LLM-based Few-Shot Early Rumor Detection with Imitation Agent

arXiv:2512.18352v1 Announce Type: new Abstract: Early Rumor Detection (EARD) aims to identify the earliest point at which a claim can be accurately classified based on a sequence of social media posts. This is especially challenging in data-scarce settings. While Large Language Models (LLMs) perform well in few-shot NLP tasks, they are not well-suited for time-series data and are computationally expensive for both training and inference. In this work, we propose a novel EARD framework that combines an autonomous agent and an LLM-based detection model, where the agent acts as a reliable decision-maker for \textit{early time point determination}, while the LLM serves as a powerful \textit{rumor detector}. This approach offers the first solution for few-shot EARD, necessitating only the training of a lightweight agent and allowing the LLM to remain training-free. Extensive experiments on four real-world datasets show our approach boosts performance across LLMs and surpasses existing EARD methods in accuracy and earliness.

Fonte: arXiv cs.CL

Vision • Score 92

A informação mútua normalizada é uma medida enviesada para classificação e detecção de comunidades

A informação mútua normalizada é amplamente utilizada como uma medida de similaridade para avaliar o desempenho de algoritmos de agrupamento e classificação. Neste artigo, argumentamos que os resultados retornados pela informação mútua normalizada são enviesados por duas razões: ignoram o conteúdo informativo da tabela de contingência e sua normalização simétrica introduz dependência espúria na saída do algoritmo. Apresentamos uma versão modificada da informação mútua que corrige essas falhas.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

GeoSense-AI: Fast Location Inference from Crisis Microblogs

arXiv:2512.18225v1 Announce Type: new Abstract: This paper presents an applied AI pipeline for realtime geolocation from noisy microblog streams, unifying statistical hashtag segmentation, part-of-speech-driven proper-noun detection, dependency parsing around disaster lexicons, lightweight named-entity recognition, and gazetteer-grounded disambiguation to infer locations directly from text rather than sparse geotags. The approach operationalizes information extraction under streaming constraints, emphasizing low-latency NLP components and efficient validation against geographic knowledge bases to support situational awareness during emergencies. In head to head comparisons with widely used NER toolkits, the system attains strong F1 while being engineered for orders-of-magnitude faster throughput, enabling deployment in live crisis informatics settings. A production map interface demonstrates end-to-end AI functionality ingest, inference, and visualization--surfacing locational signals at scale for floods, outbreaks, and other fastmoving events. By prioritizing robustness to informal text and streaming efficiency, GeoSense-AI illustrates how domain-tuned NLP and knowledge grounding can elevate emergency response beyond conventional geo-tag reliance.

Fonte: arXiv cs.CL

Vision • Score 93

Os Salmões Mortos da Interpretabilidade em IA

Neste estudo de neurociência, os autores colocaram um salmão morto em um scanner de MRI e mostraram imagens de humanos em situações sociais, revelando artefatos de 'salmão morto' em análises de IA. Este trabalho propõe uma reinterpretação estatística-causal, tratando explicações de sistemas computacionais como parâmetros de um modelo estatístico, enfatizando a importância de testar hipóteses alternativas.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

NEURO-GUARD: Generalização Neuro-Simbólica e Roteamento Adaptativo Imparcial para Diagnósticos -- IA Médica Explicável

O diagnóstico baseado em imagem, preciso e interpretável, continua sendo um desafio central na IA médica, especialmente em ambientes com dados limitados e decisões clínicas críticas. Apresentamos o NEURO-GUARD, um novo framework guiado por conhecimento que integra Vision Transformers (ViTs) com raciocínio orientado por linguagem, melhorando desempenho e robustez em diferentes domínios.

Fonte: arXiv cs.AI

Vision • Score 96

Rede Bayesiana Multimodal para Avaliação Robusta de Vítimas em Triagem Autônoma

Incidentes de Múltiplas Vítimas podem sobrecarregar sistemas médicos de emergência, e atrasos ou erros na avaliação das vítimas podem resultar em mortes evitáveis. Apresentamos um framework de suporte à decisão que combina saídas de múltiplos modelos de computer vision, estimando sinais de hemorragia severa, dificuldade respiratória, alerta físico ou trauma visível, em uma rede Bayesiana construída inteiramente a partir de regras definidas por especialistas.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

FC-MIR: Um Framework de Consciência de Tela Móvel para Recomendação Consciente de Intenção Baseada em Raciocínio Multimodal de Trajetória Comprimida por Frame

Identificar a intenção do usuário a partir de trajetórias de operação da interface móvel é crucial para avançar na compreensão da UI e habilitar agentes de automação de tarefas. Propomos o framework FC-MIR, que utiliza amostragem de keyframes e concatenação adaptativa para reduzir a redundância visual e aumentar a eficiência da inferência, integrando MLLMs de última geração para sumarização de trajetórias e previsão de intenção.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Geração de Regras Programáticas para Detecção de Falsificação de Documentos Usando Modelos de Linguagem de Grande Escala

A falsificação de documentos representa uma ameaça crescente a processos legais, econômicos e governamentais, exigindo mecanismos de verificação cada vez mais sofisticados. Este trabalho investiga como modelos de linguagem de grande escala (LLMs) podem ser adaptados para gerar verificações de plausibilidade baseadas em regras para detecção de falsificações, utilizando recursos de hardware limitados.

Fonte: arXiv cs.AI

Vision • Score 96

Um Conjunto de Dados e Benchmarks para Detecção de Fibrilação Atrial a partir de Eletrocardiogramas de Pacientes em Unidades de Terapia Intensiva

Objetivo: A fibrilação atrial (AF) é a arritmia cardíaca mais comum entre pacientes em unidades de terapia intensiva (UTI) e pode causar efeitos adversos à saúde. Neste estudo, publicamos um conjunto de dados rotulado da UTI e benchmarks para detecção de AF, comparando modelos de machine learning em três abordagens de inteligência artificial (IA) baseadas em dados.

Fonte: arXiv cs.LG

Vision • Score 96

FedOAED: Denoiser Autoencoder Federado em Dispositivos para Dados Heterogêneos sob Disponibilidade Limitada de Clientes

Nos últimos anos, soluções de machine learning (ML) e deep learning (DL) mostraram seu potencial em diversas aplicações, mas regulamentações rigorosas de compartilhamento de dados, como o GDPR e o HIPAA, dificultaram a implementação de aplicações baseadas em dados. O Federated Learning (FL) surge como uma solução, mas ainda enfrenta desafios relacionados à heterogeneidade. Neste trabalho, apresentamos o FedOAED, um algoritmo inovador que visa mitigar a deriva de clientes e a variância induzida pela participação parcial de clientes.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Context-Aware Initialization for Reducing Generative Path Length in Diffusion Language Models

arXiv:2512.19004v1 Announce Type: new Abstract: Diffusion Large Language Models (DLLMs) enable fully parallel token decoding but often remain impractical at inference time due to the many denoising iterations required to refine an information-free, fully masked initialization into coherent text. Most existing acceleration methods focus on traversing this generative trajectory more efficiently via improved solvers or sampling strategies. We advance a complementary perspective: shorten the trajectory itself by starting closer to the target distribution through context-aware initialization. We propose a training-free interface that injects prompt-conditioned priors from a lightweight auxiliary model into the diffusion initialization, and instantiate it with two mechanisms: discrete token injection and representation-level embedding interpolation. Because injected priors can be imperfect and unmask-only decoding can over-commit early, we also introduce a simple confidence-based remasking mechanism as a form of prior skepticism. Preliminary evidence on GSM8K suggests that context-aware initialization can substantially reduce denoising iterations (about 35\% fewer function evaluations in our setting), while also exposing a key open challenge: naive warm-starting can degrade final accuracy relative to strong diffusion baselines. We use these findings to motivate a research agenda around calibration, revision mechanisms, and representation alignment for reliable warm-started diffusion decoding.

Fonte: arXiv cs.CL

Vision • Score 96

EIA-SEC: Framework Melhorado de Actor-Critic para Controle Colaborativo de Multi-UAV na Agricultura Inteligente

A aplicação generalizada da tecnologia de comunicação sem fio tem promovido o desenvolvimento da agricultura inteligente, onde veículos aéreos não tripulados (UAVs) desempenham um papel multifuncional. Neste trabalho, modelamos um processo de decisão de Markov para resolver o problema de planejamento de trajetória de multi-UAV e propomos o novo framework Elite Imitation Actor-Shared Ensemble Critic (EIA-SEC). Resultados experimentais mostram que o EIA-SEC supera as referências de ponta em desempenho de recompensa, estabilidade de treinamento e velocidade de convergência.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Generalization Gaps in Political Fake News Detection: An Empirical Study on the LIAR Dataset

arXiv:2512.18533v1 Announce Type: new Abstract: The proliferation of linguistically subtle political disinformation poses a significant challenge to automated fact-checking systems. Despite increasing emphasis on complex neural architectures, the empirical limits of text-only linguistic modeling remain underexplored. We present a systematic diagnostic evaluation of nine machine learning algorithms on the LIAR benchmark. By isolating lexical features (Bag-of-Words, TF-IDF) and semantic embeddings (GloVe), we uncover a hard "Performance Ceiling", with fine-grained classification not exceeding a Weighted F1-score of 0.32 across models. Crucially, a simple linear SVM (Accuracy: 0.624) matches the performance of pre-trained Transformers such as RoBERTa (Accuracy: 0.620), suggesting that model capacity is not the primary bottleneck. We further diagnose a massive "Generalization Gap" in tree-based ensembles, which achieve more than 99% training accuracy but collapse to approximately 25% on test data, indicating reliance on lexical memorization rather than semantic inference. Synthetic data augmentation via SMOTE yields no meaningful gains, confirming that the limitation is semantic (feature ambiguity) rather than distributional. These findings indicate that for political fact-checking, increasing model complexity without incorporating external knowledge yields diminishing returns.

Fonte: arXiv cs.CL

Vision • Score 96

NOVA: Descobrindo Transformações Winograd Bem Condicionadas através da Otimização Numérica da Aritmética de Vandermonde

A convolução Winograd é o algoritmo padrão para inferência eficiente, reduzindo a complexidade aritmética em 2,25x para núcleos 3x3. No entanto, enfrenta uma barreira crítica na era moderna da computação de baixa precisão: a instabilidade numérica. Apresentamos o NOVA, um framework de descoberta que otimiza a seleção de pontos Winograd como um problema de otimização contínua.

Fonte: arXiv cs.LG

Vision • Score 96

Mapas auto-organizáveis para avaliação da qualidade da água em reservatórios e lagos: Uma revisão sistemática da literatura

A qualidade da água sustentável é fundamental para o equilíbrio ecológico e a segurança hídrica. Esta revisão examina a aplicação do Self-Organizing Map (SOM), uma técnica de IA não supervisionada, na avaliação da qualidade da água, abordando seleção de parâmetros, estratégias de amostragem e abordagens de agrupamento.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Sobre a Universalidade das Arquiteturas Transformer; Quanta Atenção é Suficiente?

Transformers são cruciais em muitos campos da IA, como modelos de linguagem grandes, visão computacional e aprendizado por reforço. Este trabalho examina a universalidade nos Transformers, revisa progressos recentes e identifica direções chave para pesquisas teóricas futuras.

Fonte: arXiv cs.LG

Vision • Score 95

Data-driven particle dynamics: Structure-preserving coarse-graining for emergent behavior in non-equilibrium systems

arXiv:2508.12569v2 Announce Type: replace-cross Abstract: Multiscale systems are ubiquitous in science and technology, but are notoriously challenging to simulate as short spatiotemporal scales must be appropriately linked to emergent bulk physics. When expensive high-dimensional dynamical systems are coarse-grained into low-dimensional models, the entropic loss of information leads to emergent physics which are dissipative, history-dependent, and stochastic. To machine learn coarse-grained dynamics from time-series observations of particle trajectories, we propose a framework using the metriplectic bracket formalism that preserves these properties by construction; most notably, the framework guarantees discrete notions of the first and second laws of thermodynamics, conservation of momentum, and a discrete fluctuation-dissipation balance crucial for capturing non-equilibrium statistics. We introduce the mathematical framework abstractly before specializing to a particle discretization. As labels are generally unavailable for entropic state variables, we introduce a novel self-supervised learning strategy to identify emergent structural variables. We validate the method on benchmark systems and demonstrate its utility on two challenging examples: (1) coarse-graining star polymers at challenging levels of coarse-graining while preserving non-equilibrium statistics, and (2) learning models from high-speed video of colloidal suspensions that capture coupling between local rearrangement events and emergent stochastic dynamics. We provide open-source implementations in both PyTorch and LAMMPS, enabling large-scale inference and extensibility to diverse particle-based systems.

Fonte: arXiv stat.ML

Vision • Score 96

Detecção de Out-of-Distribution em Complexos Moleculares via Modelos de Difusão para Grafos Irregulares

Modelos preditivos de machine learning geralmente se destacam em dados in-distribution, mas seu desempenho se degrada em entradas out-of-distribution (OOD). Apresentamos um framework probabilístico de detecção OOD para dados complexos de grafos 3D, construído sobre um modelo de difusão que aprende a densidade da distribuição de treinamento de maneira totalmente não supervisionada.

Fonte: arXiv cs.LG

Vision • Score 93

Planejamento de Trajetória para Agricultura Inteligente Baseada em UAV Usando Aprendizado Profundo Triplo Q com Imitacão

Veículos aéreos não tripulados (UAVs) surgiram como uma plataforma auxiliar promissora para a agricultura inteligente, realizando simultaneamente detecção de ervas daninhas, reconhecimento e coleta de dados de sensores sem fio. No entanto, o planejamento de trajetória é desafiador devido à alta incerteza do ambiente e à capacidade limitada da bateria dos UAVs. Propomos um algoritmo inovador de aprendizado profundo triplo Q com imitação (ITDQN) para resolver esses problemas.

Fonte: arXiv cs.LG

Vision • Score 96

Aprendizado de Cláusulas Direcionado por Conflito com Heurísticas VSIDS para Layout Discreto de Instalações

Este artigo estuda o uso de Aprendizado de Cláusulas Direcionado por Conflito (CDCL) com heurísticas VSIDS como um motor computacional para problemas de layout discreto de instalações. O problema de layout é modelado como um problema de atribuição combinatória com uma estrutura lógica densa, resultante de restrições de adjacência, separação e disponibilidade de slots.

Fonte: arXiv cs.AI

Vision • Score 93

Few-Shot Learning de um Modelo de Rede Neural Baseado em Grafo Sem Retropropagação

Propondo uma abordagem estrutural-gráfica para classificar imagens de contorno em um regime de few-shot sem retropropagação, este trabalho apresenta um modelo onde a estrutura é a portadora de explicações. A imagem é codificada como um grafo atribuído, e a generalização é alcançada por meio da formação de atratores de conceito.

Fonte: arXiv cs.AI

Vision • Score 96

Detecção de Drift de Saída Baseada em Agentes para Previsão de Resposta ao Câncer de Mama em um Sistema de Suporte à Decisão Clínica Multissite

Os sistemas modernos de suporte à decisão clínica podem atender simultaneamente várias instituições independentes de imagem médica, mas seu desempenho preditivo pode se degradar entre os sites devido a variações nas populações de pacientes, hardware de imagem e protocolos de aquisição. Propomos um framework baseado em agentes para detectar drift e avaliar sua gravidade em sistemas de IA clínica multissite.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

CORE: Reforço Orientado a Conceitos para Reduzir a Lacuna entre Definição e Aplicação em Raciocínio Matemático

Modelos de linguagem grandes (LLMs) frequentemente resolvem exercícios matemáticos desafiadores, mas falham em aplicar o conceito quando o problema exige compreensão genuína. Apresentamos o CORE (Reforço Orientado a Conceitos), um framework de treinamento de RL que transforma conceitos explícitos em um sinal de supervisão controlável.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Recontextualização Mitiga o Jogo de Especificação sem Modificar a Especificação

Os desenvolvedores frequentemente enfrentam dificuldades para especificar rótulos de treinamento e recompensas corretas. Propomos a recontextualização, que reduz a frequência com que modelos de linguagem 'jogam' com sinais de treinamento, realizando comportamentos inadequados que esses sinais reforçam erroneamente.

Fonte: arXiv cs.AI

Vision • Score 90

Condicionamento de modelos Accept-Desirability no contexto de mudança de crença semelhante ao AGM

Discutimos o condicionamento para modelos Accept-Desirability em um framework de tomada de decisão abstrato, onde recompensas incertas residem em um espaço linear geral. Este ambiente permite unificar probabilidades clássicas e quânticas, estendendo-as a um contexto de probabilidades imprecisas. Introduzimos uma nova regra de condicionamento e associamos um operador de revisão de crença a essa regra.

Fonte: arXiv cs.AI

Vision • Score 95

Statistical laws and linguistics inform meaning in naturalistic and fictional conversation

arXiv:2512.18072v1 Announce Type: new Abstract: Conversation is a cornerstone of social connection and is linked to well-being outcomes. Conversations vary widely in type with some portion generating complex, dynamic stories. One approach to studying how conversations unfold in time is through statistical patterns such as Heaps' law, which holds that vocabulary size scales with document length. Little work on Heaps's law has looked at conversation and considered how language features impact scaling. We measure Heaps' law for conversations recorded in two distinct mediums: 1. Strangers brought together on video chat and 2. Fictional characters in movies. We find that scaling of vocabulary size differs by parts of speech. We discuss these findings through behavioral and linguistic frameworks.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

On Finding Inconsistencies in Documents

arXiv:2512.18601v1 Announce Type: new Abstract: Professionals in academia, law, and finance audit their documents because inconsistencies can result in monetary, reputational, and scientific costs. Language models (LMs) have the potential to dramatically speed up this auditing process. To understand their abilities, we introduce a benchmark, FIND (Finding INconsistencies in Documents), where each example is a document with an inconsistency inserted manually by a domain expert. Despite the documents being long, technical, and complex, the best-performing model (gpt-5) recovered 64% of the inserted inconsistencies. Surprisingly, gpt-5 also found undiscovered inconsistencies present in the original documents. For example, on 50 arXiv papers, we judged 136 out of 196 of the model's suggestions to be legitimate inconsistencies missed by the original authors. However, despite these findings, even the best models miss almost half of the inconsistencies in FIND, demonstrating that inconsistency detection is still a challenging task.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Mantenha Leve! Simplificando o Agrupamento de Imagens Através de Adaptadores Sem Texto

No contexto de modelos pré-treinados, a classificação eficaz pode ser alcançada com camadas de leitura leves. Este trabalho demonstra que, no agrupamento profundo, é possível obter desempenho competitivo com métodos mais complexos utilizando um pipeline de treinamento altamente simplificado e sem texto. Nossa abordagem, Simple Clustering via Pre-trained models (SCP), utiliza representações de características de modelos de visão pré-treinados e pares de dados positivos.

Fonte: arXiv stat.ML

NLP/LLMs • Score 96

Ajuste Fino Eficiente em Parâmetros para HAR: Integrando LoRA e QLoRA em Modelos Transformer

O reconhecimento de atividades humanas (HAR) é uma tarefa fundamental em computação pervasiva. Este trabalho investiga técnicas de ajuste fino eficientes em parâmetros, especificamente Low-Rank Adaptation (LoRA) e Quantized LoRA, como alternativas escaláveis ao ajuste fino completo de modelos para HAR, demonstrando desempenho competitivo com menos parâmetros treináveis e menor uso de memória.

Fonte: arXiv cs.LG

Vision • Score 95

Teste para estrutura latente via a matriz aleatória de Wilcoxon--Wigner de estatísticas de classificação normalizadas

Este artigo considera o problema de testar a estrutura latente em grandes matrizes de dados simétricas. O objetivo é desenvolver uma metodologia estatisticamente fundamentada que seja flexível em sua aplicabilidade, computacionalmente eficiente e insensível a variações extremas nos dados, superando assim as limitações das abordagens existentes.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

Training LLMs with LogicReward for Faithful and Rigorous Reasoning

arXiv:2512.18196v1 Announce Type: new Abstract: Although LLMs exhibit strong reasoning capabilities, existing training methods largely depend on outcome-based feedback, which can produce correct answers with flawed reasoning. Prior work introduces supervision on intermediate steps but still lacks guarantees of logical soundness, which is crucial in high-stakes scenarios where logical consistency is paramount. To address this, we propose LogicReward, a novel reward system that guides model training by enforcing step-level logical correctness with a theorem prover. We further introduce Autoformalization with Soft Unification, which reduces natural language ambiguity and improves formalization quality, enabling more effective use of the theorem prover. An 8B model trained on data constructed with LogicReward surpasses GPT-4o and o4-mini by 11.6\% and 2\% on natural language inference and logical reasoning tasks with simple training procedures. Further analysis shows that LogicReward enhances reasoning faithfulness, improves generalizability to unseen tasks such as math and commonsense reasoning, and provides a reliable reward signal even without ground-truth labels. We will release all data and code at https://llm-symbol.github.io/LogicReward.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Auxiliary Descriptive Knowledge for Few-Shot Adaptation of Vision-Language Model

arXiv:2512.17313v1 Announce Type: new Abstract: Despite the impressive zero-shot capabilities of Vision-Language Models (VLMs), they often struggle in downstream tasks with distribution shifts from the pre-training data. Few-Shot Adaptation (FSA-VLM) has emerged as a key solution, typically using Parameter-Efficient Fine-Tuning (PEFT) to adapt models with minimal data. However, these PEFT methods are constrained by their reliance on fixed, handcrafted prompts, which are often insufficient to understand the semantics of classes. While some studies have proposed leveraging image-induced prompts to provide additional clues for classification, they introduce prohibitive computational overhead at inference. Therefore, we introduce Auxiliary Descriptive Knowledge (ADK), a novel framework that efficiently enriches text representations without compromising efficiency. ADK first leverages a Large Language Model to generate a rich set of descriptive prompts for each class offline. These pre-computed features are then deployed in two ways: (1) as Compositional Knowledge, an averaged representation that provides rich semantics, especially beneficial when class names are ambiguous or unfamiliar to the VLM; and (2) as Instance-Specific Knowledge, where a lightweight, non-parametric attention mechanism dynamically selects the most relevant descriptions for a given image. This approach provides two additional types of knowledge alongside the handcrafted prompt, thereby facilitating category distinction across various domains. Also, ADK acts as a parameter-free, plug-and-play component that enhances existing PEFT methods. Extensive experiments demonstrate that ADK consistently boosts the performance of multiple PEFT baselines, setting a new state-of-the-art across various scenarios.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Rotterdam artery-vein segmentation (RAV) dataset

arXiv:2512.17322v1 Announce Type: new Abstract: Purpose: To provide a diverse, high-quality dataset of color fundus images (CFIs) with detailed artery-vein (A/V) segmentation annotations, supporting the development and evaluation of machine learning algorithms for vascular analysis in ophthalmology. Methods: CFIs were sampled from the longitudinal Rotterdam Study (RS), encompassing a wide range of ages, devices, and capture conditions. Images were annotated using a custom interface that allowed graders to label arteries, veins, and unknown vessels on separate layers, starting from an initial vessel segmentation mask. Connectivity was explicitly verified and corrected using connected component visualization tools. Results: The dataset includes 1024x1024-pixel PNG images in three modalities: original RGB fundus images, contrast-enhanced versions, and RGB-encoded A/V masks. Image quality varied widely, including challenging samples typically excluded by automated quality assessment systems, but judged to contain valuable vascular information. Conclusion: This dataset offers a rich and heterogeneous source of CFIs with high-quality segmentations. It supports robust benchmarking and training of machine learning models under real-world variability in image quality and acquisition settings. Translational Relevance: By including connectivity-validated A/V masks and diverse image conditions, this dataset enables the development of clinically applicable, generalizable machine learning tools for retinal vascular analysis, potentially improving automated screening and diagnosis of systemic and ocular diseases.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

CheXPO-v2: Preference Optimization for Chest X-ray VLMs with Knowledge Graph Consistency

arXiv:2512.17213v1 Announce Type: new Abstract: Medical Vision-Language Models (VLMs) are prone to hallucinations, compromising clinical reliability. While reinforcement learning methods like Group Relative Policy Optimization (GRPO) offer a low-cost alignment solution, their reliance on sparse, outcome-based rewards inadvertently encourages models to "overthink" -- generating verbose, convoluted, and unverifiable Chain-of-Thought reasoning to justify answers. This focus on outcomes obscures factual errors and poses significant safety risks. To address this, we propose CheXPO-v2, a novel alignment framework that shifts from outcome to process supervision. Our core innovation is a Knowledge Graph Consistency Reward mechanism driven by Entity-Relation Matching. By explicitly parsing reasoning steps into structured "Disease, Relation, Anatomy" triplets, we provide fine-grained supervision that penalizes incoherent logic and hallucinations at the atomic level. Integrating this with a hard-example mining strategy, our approach significantly outperforms GRPO and state-of-the-art models on benchmarks like MIMIC-CXR-VQA. Crucially, CheXPO-v2 achieves new state-of-the-art accuracy using only 5k samples, demonstrating exceptional data efficiency while producing clinically sound and verifiable reasoning. The project source code is publicly available at: https://github.com/ecoxial2007/CheX-Phi4MM.

Fonte: arXiv cs.CV

Vision • Score 95

WDFFU-Mamba: A Wavelet-guided Dual-attention Feature Fusion Mamba for Breast Tumor Segmentation in Ultrasound Images

arXiv:2512.17278v1 Announce Type: new Abstract: Breast ultrasound (BUS) image segmentation plays a vital role in assisting clinical diagnosis and early tumor screening. However, challenges such as speckle noise, imaging artifacts, irregular lesion morphology, and blurred boundaries severely hinder accurate segmentation. To address these challenges, this work aims to design a robust and efficient model capable of automatically segmenting breast tumors in BUS images.We propose a novel segmentation network named WDFFU-Mamba, which integrates wavelet-guided enhancement and dual-attention feature fusion within a U-shaped Mamba architecture. A Wavelet-denoised High-Frequency-guided Feature (WHF) module is employed to enhance low-level representations through noise-suppressed high-frequency cues. A Dual Attention Feature Fusion (DAFF) module is also introduced to effectively merge skip-connected and semantic features, improving contextual consistency.Extensive experiments on two public BUS datasets demonstrate that WDFFU-Mamba achieves superior segmentation accuracy, significantly outperforming existing methods in terms of Dice coefficient and 95th percentile Hausdorff Distance (HD95).The combination of wavelet-domain enhancement and attention-based fusion greatly improves both the accuracy and robustness of BUS image segmentation, while maintaining computational efficiency.The proposed WDFFU-Mamba model not only delivers strong segmentation performance but also exhibits desirable generalization ability across datasets, making it a promising solution for real-world clinical applications in breast tumor ultrasound analysis.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

DAVE: A VLM Vision Encoder for Document Understanding and Web Agents

arXiv:2512.17221v1 Announce Type: new Abstract: While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial information essential for document understanding and web agents. To bridge this gap, we introduce DAVE, a vision encoder purpose-built for VLMs and tailored for these tasks. Our training pipeline is designed to leverage abundant unlabeled data to bypass the need for costly large-scale annotations for document and web images. We begin with a self-supervised pretraining stage on unlabeled images, followed by a supervised autoregressive pretraining stage, where the model learns tasks like parsing and localization from limited, high-quality data. Within the supervised stage, we adopt two strategies to improve our encoder's alignment with both general visual knowledge and diverse document and web agentic tasks: (i) We introduce a novel model-merging scheme, combining encoders trained with different text decoders to ensure broad compatibility with different web agentic architectures. (ii) We use ensemble training to fuse features from pretrained generalist encoders (e.g., SigLIP2) with our own document and web-specific representations. Extensive experiments on classic document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness of our approach, establishing DAVE as a strong vision encoder for document and web applications.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Towards Safer Chatbots: Automated Policy Compliance Evaluation of Custom GPTs

arXiv:2502.01436v3 Announce Type: replace Abstract: User-configured chatbots built on top of large language models are increasingly available through centralized marketplaces such as OpenAI's GPT Store. While these platforms enforce usage policies intended to prevent harmful or inappropriate behavior, the scale and opacity of customized chatbots make systematic policy enforcement challenging. As a result, policy-violating chatbots continue to remain publicly accessible despite existing review processes. This paper presents a fully automated method for evaluating the compliance of Custom GPTs with its marketplace usage policy using black-box interaction. The method combines large-scale GPT discovery, policy-driven red-teaming prompts, and automated compliance assessment using an LLM-as-a-judge. We focus on three policy-relevant domains explicitly addressed in OpenAI's usage policies: Romantic, Cybersecurity, and Academic GPTs. We validate our compliance assessment component against a human-annotated ground-truth dataset, achieving an F1 score of 0.975 for binary policy violation detection. We then apply the method in a large-scale empirical study of 782 Custom GPTs retrieved from the GPT Store. The results show that 58.7% of the evaluated GPTs exhibit at least one policy-violating response, with substantial variation across policy domains. A comparison with the base models (GPT-4 and GPT-4o) indicates that most violations originate from model-level behavior, while customization tends to amplify these tendencies rather than create new failure modes. Our findings reveal limitations in current review mechanisms for user-configured chatbots and demonstrate the feasibility of scalable, behavior-based policy compliance evaluation.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering

arXiv:2512.17396v1 Announce Type: cross Abstract: In this work, we introduce RadImageNet-VQA, a large-scale dataset designed to advance radiologic visual question answering (VQA) on CT and MRI exams. Existing medical VQA datasets are limited in scale, dominated by X-ray imaging or biomedical illustrations, and often prone to text-based shortcuts. RadImageNet-VQA is built from expert-curated annotations and provides 750K images paired with 7.5M question-answer samples. It covers three key tasks - abnormality detection, anatomy recognition, and pathology identification - spanning eight anatomical regions and 97 pathology categories, and supports open-ended, closed-ended, and multiple-choice questions. Extensive experiments show that state-of-the-art vision-language models still struggle with fine-grained pathology identification, particularly in open-ended settings and even after fine-tuning. Text-only analysis further reveals that model performance collapses to near-random without image inputs, confirming that RadImageNet-VQA is free from linguistic shortcuts. The full dataset and benchmark are publicly available at https://huggingface.co/datasets/raidium/RadImageNet-VQA.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

PAACE: Um Framework de Engenharia de Contexto Automatizado e Consciente de Planejamento

Modelos de Linguagem de Grande Escala (LLM) estão sendo cada vez mais utilizados em fluxos de trabalho complexos que envolvem planejamento, uso de ferramentas, reflexão e interação com sistemas de conhecimento externos. Este trabalho apresenta o PAACE, um framework unificado para otimizar o estado evolutivo de agentes LLM através de modelagem de relevância de próximas tarefas, análise de estrutura de planejamento, co-refinamento de instruções e compressão que preserva funções.

Fonte: arXiv cs.AI

NLP/LLMs • Score 92

Linear Personality Probing and Steering in LLMs: A Big Five Study

arXiv:2512.17639v1 Announce Type: new Abstract: Large language models (LLMs) exhibit distinct and consistent personalities that greatly impact trust and engagement. While this means that personality frameworks would be highly valuable tools to characterize and control LLMs' behavior, current approaches remain either costly (post-training) or brittle (prompt engineering). Probing and steering via linear directions has recently emerged as a cheap and efficient alternative. In this paper, we investigate whether linear directions aligned with the Big Five personality traits can be used for probing and steering model behavior. Using Llama 3.3 70B, we generate descriptions of 406 fictional characters and their Big Five trait scores. We then prompt the model with these descriptions and questions from the Alpaca questionnaire, allowing us to sample hidden activations that vary along personality traits in known, quantifiable ways. Using linear regression, we learn a set of per-layer directions in activation space, and test their effectiveness for probing and steering model behavior. Our results suggest that linear directions aligned with trait-scores are effective probes for personality detection, while their steering capabilities strongly depend on context, producing reliable effects in forced-choice tasks but limited influence in open-ended generation or when additional context is present in the prompt.

Fonte: arXiv cs.CL

Vision • Score 95

Stakeholder Suite: A Unified AI Framework for Mapping Actors, Topics and Arguments in Public Debates

arXiv:2512.17347v1 Announce Type: new Abstract: Public debates surrounding infrastructure and energy projects involve complex networks of stakeholders, arguments, and evolving narratives. Understanding these dynamics is crucial for anticipating controversies and informing engagement strategies, yet existing tools in media intelligence largely rely on descriptive analytics with limited transparency. This paper presents Stakeholder Suite, a framework deployed in operational contexts for mapping actors, topics, and arguments within public debates. The system combines actor detection, topic modeling, argument extraction and stance classification in a unified pipeline. Tested on multiple energy infrastructure projects as a case study, the approach delivers fine-grained, source-grounded insights while remaining adaptable to diverse domains. The framework achieves strong retrieval precision and stance accuracy, producing arguments judged relevant in 75% of pilot use cases. Beyond quantitative metrics, the tool has proven effective for operational use: helping project teams visualize networks of influence, identify emerging controversies, and support evidence-based decision-making.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Are Vision Language Models Cross-Cultural Theory of Mind Reasoners?

arXiv:2512.17394v1 Announce Type: new Abstract: Theory of Mind (ToM) -- the ability to attribute beliefs, desires, and emotions to others -- is fundamental for human social intelligence, yet remains a major challenge for artificial agents. Existing Vision-Language Models (VLMs) are increasingly applied in socially grounded tasks, but their capacity for cross-cultural ToM reasoning is largely unexplored. In this work, we introduce CulturalToM-VQA, a new evaluation benchmark containing 5095 questions designed to probe ToM reasoning across diverse cultural contexts through visual question answering. The dataset captures culturally grounded cues such as rituals, attire, gestures, and interpersonal dynamics, enabling systematic evaluation of ToM reasoning beyond Western-centric benchmarks. Our dataset is built through a VLM-assisted human-in-the-loop pipeline, where human experts first curate culturally rich images across traditions, rituals, and social interactions; a VLM then assist in generating structured ToM-focused scene descriptions, which are refined into question-answer pairs spanning a taxonomy of six ToM tasks and four graded complexity levels. The resulting dataset covers diverse theory of mind facets such as mental state attribution, false belief reasoning, non-literal communication, social norm violations, perspective coordination, and multi-agent reasoning.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Confidence-Credibility Aware Weighted Ensembles of Small LLMs Outperform Large LLMs in Emotion Detection

arXiv:2512.17630v1 Announce Type: new Abstract: This paper introduces a confidence-weighted, credibility-aware ensemble framework for text-based emotion detection, inspired by Condorcet's Jury Theorem (CJT). Unlike conventional ensembles that often rely on homogeneous architectures, our approach combines architecturally diverse small transformer-based large language models (sLLMs) - BERT, RoBERTa, DistilBERT, DeBERTa, and ELECTRA, each fully fine-tuned for emotion classification. To preserve error diversity, we minimize parameter convergence while taking advantage of the unique biases of each model. A dual-weighted voting mechanism integrates both global credibility (validation F1 score) and local confidence (instance-level probability) to dynamically weight model contributions. Experiments on the DAIR-AI dataset demonstrate that our credibility-confidence ensemble achieves a macro F1 score of 93.5 percent, surpassing state-of-the-art benchmarks and significantly outperforming large-scale LLMs, including Falcon, Mistral, Qwen, and Phi, even after task-specific Low-Rank Adaptation (LoRA). With only 595M parameters in total, our small LLMs ensemble proves more parameter-efficient and robust than models up to 7B parameters, establishing that carefully designed ensembles of small, fine-tuned models can outperform much larger LLMs in specialized natural language processing (NLP) tasks such as emotion detection.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

When F1 Fails: Granularity-Aware Evaluation for Dialogue Topic Segmentation

arXiv:2512.17083v1 Announce Type: cross Abstract: Dialogue topic segmentation supports summarization, retrieval, memory management, and conversational continuity. Despite decades of prior work, evaluation practice in dialogue topic segmentation remains dominated by strict boundary matching and F1-based metrics, even as modern LLM-based conversational systems increasingly rely on segmentation to manage conversation history beyond the model's fixed context window, where unstructured context accumulation degrades efficiency and coherence. This paper introduces an evaluation objective for dialogue topic segmentation that treats boundary density and segment coherence as primary criteria, alongside window-tolerant F1 (W-F1). Through extensive cross-dataset empirical evaluation, we show that reported performance differences across dialogue segmentation benchmarks are driven not by model quality, but by annotation granularity mismatches and sparse boundary labels. This indicates that many reported improvements arise from evaluation artifacts rather than improved boundary detection. We evaluated multiple, structurally distinct dialogue segmentation strategies across eight dialogue datasets spanning task-oriented, open-domain, meeting-style, and synthetic interactions. Across these settings, we observe high segment coherence combined with extreme oversegmentation relative to sparse labels, producing misleadingly low exact-match F1 scores. We show that topic segmentation is best understood as selecting an appropriate granularity rather than predicting a single correct boundary set. We operationalize this view by explicitly separating boundary scoring from boundary selection.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Anatomical Region-Guided Contrastive Decoding: A Plug-and-Play Strategy for Mitigating Hallucinations in Medical VLMs

arXiv:2512.17189v1 Announce Type: new Abstract: Medical Vision-Language Models (MedVLMs) show immense promise in clinical applicability. However, their reliability is hindered by hallucinations, where models often fail to derive answers from visual evidence, instead relying on learned textual priors. Existing mitigation strategies for MedVLMs have distinct limitations: training-based methods rely on costly expert annotations, limiting scalability, while training-free interventions like contrastive decoding, though data-efficient, apply a global, untargeted correction whose effects in complex real-world clinical settings can be unreliable. To address these challenges, we introduce Anatomical Region-Guided Contrastive Decoding (ARCD), a plug-and-play strategy that mitigates hallucinations by providing targeted, region-specific guidance. Our module leverages an anatomical mask to direct a three-tiered contrastive decoding process. By dynamically re-weighting at the token, attention, and logits levels, it verifiably steers the model's focus onto specified regions, reinforcing anatomical understanding and suppressing factually incorrect outputs. Extensive experiments across diverse datasets, including chest X-ray, CT, brain MRI, and ocular ultrasound, demonstrate our method's effectiveness in improving regional understanding, reducing hallucinations, and enhancing overall diagnostic accuracy.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

arXiv:2512.16978v1 Announce Type: new Abstract: Long-form multimodal video understanding requires integrating vision, speech, and ambient audio with coherent long-range reasoning. Existing benchmarks emphasize either temporal length or multimodal richness, but rarely both and while some incorporate open-ended questions and advanced metrics, they mostly rely on single-score accuracy, obscuring failure modes. We introduce LongShOTBench, a diagnostic benchmark with open-ended, intent-driven questions; single- and multi-turn dialogues; and tasks requiring multimodal reasoning and agentic tool use across video, audio, and speech. Each item includes a reference answer and graded rubric for interpretable, and traceable evaluation. LongShOTBench is produced via a scalable, human-validated pipeline to ensure coverage and reproducibility. All samples in our LongShOTBench are human-verified and corrected. Furthermore, we present LongShOTAgent, an agentic system that analyzes long videos via preprocessing, search, and iterative refinement. On LongShOTBench, state-of-the-art MLLMs show large gaps: Gemini-2.5-Flash achieves 52.95%, open-source models remain below 30%, and LongShOTAgent attains 44.66%. These results underscore the difficulty of real-world long-form video understanding. LongShOTBench provides a practical, reproducible foundation for evaluating and improving MLLMs. All resources are available on GitHub: https://github.com/mbzuai-oryx/longshot.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

AVM: Towards Structure-Preserving Neural Response Modeling in the Visual Cortex Across Stimuli and Individuals

arXiv:2512.16948v1 Announce Type: new Abstract: While deep learning models have shown strong performance in simulating neural responses, they often fail to clearly separate stable visual encoding from condition-specific adaptation, which limits their ability to generalize across stimuli and individuals. We introduce the Adaptive Visual Model (AVM), a structure-preserving framework that enables condition-aware adaptation through modular subnetworks, without modifying the core representation. AVM keeps a Vision Transformer-based encoder frozen to capture consistent visual features, while independently trained modulation paths account for neural response variations driven by stimulus content and subject identity. We evaluate AVM in three experimental settings, including stimulus-level variation, cross-subject generalization, and cross-dataset adaptation, all of which involve structured changes in inputs and individuals. Across two large-scale mouse V1 datasets, AVM outperforms the state-of-the-art V1T model by approximately 2% in predictive correlation, demonstrating robust generalization, interpretable condition-wise modulation, and high architectural efficiency. Specifically, AVM achieves a 9.1% improvement in explained variance (FEVE) under the cross-dataset adaptation setting. These results suggest that AVM provides a unified framework for adaptive neural modeling across biological and experimental conditions, offering a scalable solution under structural constraints. Its design may inform future approaches to cortical modeling in both neuroscience and biologically inspired AI systems.

Fonte: arXiv cs.CV

Vision • Score 95

Diagnostic Performance of Universal-Learning Ultrasound AI Across Multiple Organs and Tasks: the UUSIC25 Challenge

arXiv:2512.17279v1 Announce Type: new Abstract: IMPORTANCE: Current ultrasound AI remains fragmented into single-task tools, limiting clinical utility compared to versatile modern ultrasound systems. OBJECTIVE: To evaluate the diagnostic accuracy and efficiency of single general-purpose deep learning models for multi-organ classification and segmentation. DESIGN: The Universal UltraSound Image Challenge 2025 (UUSIC25) involved developing algorithms on 11,644 images (public/private). Evaluation used an independent, multi-center test set of 2,479 images, including data from a center completely unseen during training to assess generalization. OUTCOMES: Diagnostic performance (Dice Similarity Coefficient [DSC]; Area Under the Receiver Operating Characteristic Curve [AUC]) and computational efficiency (inference time, GPU memory). RESULTS: Of 15 valid algorithms, the top model (SMART) achieved a macro-averaged DSC of 0.854 across 5 segmentation tasks and AUC of 0.766 for binary classification. Models showed high capability in segmentation (e.g., fetal head DSC: 0.942) but variability in complex tasks subject to domain shift. Notably, in breast cancer molecular subtyping, the top model's performance dropped from AUC 0.571 (internal) to 0.508 (unseen external center), highlighting generalization challenges. CONCLUSIONS: General-purpose AI models achieve high accuracy and efficiency across multiple tasks using a single architecture. However, performance degradation on unseen data suggests domain generalization is critical for future clinical deployment.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Learning When to Look: A Disentangled Curriculum for Strategic Perception in Multimodal Reasoning

arXiv:2512.17227v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) demonstrate significant potential but remain brittle in complex, long-chain visual reasoning tasks. A critical failure mode is "visual forgetting", where models progressively lose visual grounding as reasoning extends, a phenomenon aptly described as "think longer, see less". We posit this failure stems from current training paradigms prematurely entangling two distinct cognitive skills: (1) abstract logical reasoning "how-to-think") and (2) strategic visual perception ("when-to-look"). This creates a foundational cold-start deficiency -- weakening abstract reasoning -- and a strategic perception deficit, as models lack a policy for when to perceive. In this paper, we propose a novel curriculum-based framework to disentangle these skills. First, we introduce a disentangled Supervised Fine-Tuning (SFT) curriculum that builds a robust abstract reasoning backbone on text-only data before anchoring it to vision with a novel Perception-Grounded Chain-of-Thought (PG-CoT) paradigm. Second, we resolve the strategic perception deficit by formulating timing as a reinforcement learning problem. We design a Pivotal Perception Reward that teaches the model when to look by coupling perceptual actions to linguistic markers of cognitive uncertainty (e.g., "wait", "verify"), thereby learning an autonomous grounding policy. Our contributions include the formalization of these two deficiencies and the development of a principled, two-stage framework to address them, transforming the model from a heuristic-driven observer to a strategic, grounded reasoner. \textbf{Code}: \url{https://github.com/gaozilve-max/learning-when-to-look}.

Fonte: arXiv cs.CV

Vision • Score 96

Bots Não Ficam Parados: Um Estudo Longitudinal sobre Mudança de Comportamento de Bots, Deriva Temporal e Evolução da Estrutura de Recursos

Os bots sociais estão profundamente integrados nas plataformas online para promoção, persuasão e manipulação. A maioria dos sistemas de detecção de bots ainda trata características comportamentais como estáticas, assumindo implicitamente que os bots se comportam de maneira estacionária ao longo do tempo. Este estudo analisa a mudança em sinais comportamentais individuais e suas inter-relações, utilizando 2.615 contas de bots promocionais e 2,8 milhões de tweets.

Fonte: arXiv cs.AI

Vision • Score 96

Dialética para Inteligência Artificial

O artigo investiga se a inteligência artificial pode descobrir conceitos a partir de experiências brutas e sem supervisão humana. Propõe-se uma definição de 'conceito' que vai além de um rótulo de dicionário, considerando a relação estrutural com a experiência total de um agente. A reversibilidade é um aspecto central, permitindo que a existência de conceitos seja uma afirmação estrutural verificável.

Fonte: arXiv cs.AI

Vision • Score 96

Outro Fit Cai: Predição Conformal como um Padrão de Calibração para Machine Learning em Física de Altas Energias

As técnicas de machine learning são essenciais na pesquisa moderna de colisores, mas suas saídas probabilísticas frequentemente carecem de estimativas de incerteza calibradas. A predição conformal (CP) oferece um framework simples e livre de distribuição para calibrar modelos preditivos arbitrários, permitindo uma quantificação rigorosa da incerteza. Neste trabalho, investigamos a CP como uma camada de calibração unificadora para aplicações de machine learning em física de altas energias.

Fonte: arXiv cs.AI

Vision • Score 95

Homografia Infinita como Condicionamento Robusto para Geração de Vídeo Controlada por Câmera

O progresso recente em modelos de difusão de vídeo gerou interesse crescente na geração de vídeo de nova perspectiva controlada por câmera para cenas dinâmicas. Um desafio chave é garantir a fidelidade à pose da câmera especificada, mantendo a consistência de visualização e raciocinando sobre a geometria ocluída a partir de observações limitadas. Apresentamos o InfCam, um framework de geração de vídeo para vídeo controlado por câmera com alta fidelidade de pose.

Fonte: arXiv cs.CV

Vision • Score 95

ABE-CLIP: Training-Free Attribute Binding Enhancement for Compositional Image-Text Matching

arXiv:2512.17178v1 Announce Type: new Abstract: Contrastive Language-Image Pretraining (CLIP) has achieved remarkable performance in various multimodal tasks. However, it still struggles with compositional image-text matching, particularly in accurately associating objects with their corresponding attributes, because its inherent global representation often overlooks fine-grained semantics for attribute binding. Existing methods often require additional training or extensive hard negative sampling, yet they frequently show limited generalization to novel compositional concepts and fail to fundamentally address the drawbacks of global representations. In this paper, we propose ABE-CLIP, a novel training-free Attribute Binding Enhancement method designed to strengthen attribute-object binding in CLIP-like models. Specifically, we employ a Semantic Refinement Mechanism to refine token embeddings for both object and attribute phrases in the text, thereby mitigating attribute confusion and improving semantic precision. We further introduce a Local Token-Patch Alignment strategy that computes similarity scores between refined textual tokens and their most relevant image patches. By aggregating localized similarity scores, ABE-CLIP computes the final image-text similarity. Experiments on multiple datasets demonstrate that ABE-CLIP significantly improves attribute-object binding performance, even surpassing methods that require extensive training.

Fonte: arXiv cs.CV

Vision • Score 95

Rede deformável ciente de distorção em múltiplos níveis para super-resolução de imagens omnidirecionais

Com o aumento da popularidade de aplicações de realidade aumentada e virtual, o processamento de imagens para Imagens Omnidirecionais (ODIs) tem atraído atenção crescente. A Super-Resolução de Imagens Omnidirecionais (ODISR) é uma técnica promissora para melhorar a qualidade visual das ODIs. Propomos uma nova Rede Deformável Ciente de Distorção em Múltiplos Níveis (MDDN) para ODISR, projetada para expandir o alcance de amostragem e o campo receptivo.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Mitty: Geração de Vídeo de Humano para Robô Baseada em Difusão

Aprender diretamente com vídeos de demonstração humana é um marco importante para o aprendizado de robôs escalável e generalizável. Apresentamos Mitty, um Diffusion Transformer que possibilita o In-Context Learning de vídeo para geração de vídeo Human2Robot, utilizando um modelo de difusão de vídeo pré-treinado e sem a necessidade de rótulos de ação.

Fonte: arXiv cs.CV

Vision • Score 95

EMMA: Concept Erasure Benchmark with Comprehensive Semantic Metrics and Diverse Categories

arXiv:2512.17320v1 Announce Type: new Abstract: The widespread adoption of text-to-image (T2I) generation has raised concerns about privacy, bias, and copyright violations. Concept erasure techniques offer a promising solution by selectively removing undesired concepts from pre-trained models without requiring full retraining. However, these methods are often evaluated on a limited set of concepts, relying on overly simplistic and direct prompts. To test the boundaries of concept erasure techniques, and assess whether they truly remove targeted concepts from model representations, we introduce EMMA, a benchmark that evaluates five key dimensions of concept erasure over 12 metrics. EMMA goes beyond standard metrics like image quality and time efficiency, testing robustness under challenging conditions, including indirect descriptions, visually similar non-target concepts, and potential gender and ethnicity bias, providing a socially aware analysis of method behavior. Using EMMA, we analyze five concept erasure methods across five domains (objects, celebrities, art styles, NSFW, and copyright). Our results show that existing methods struggle with implicit prompts (i.e., generating the erased concept when it is indirectly referenced) and visually similar non-target concepts (i.e., failing to generate non-targeted concepts resembling the erased one), while some amplify gender and ethnicity bias compared to the original model.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

V-Agent: Um Sistema de Busca de Vídeo Interativo Usando Modelos de Visão-Linguagem

Apresentamos o V-Agent, uma nova plataforma multi-agente projetada para busca avançada de vídeos e conversas interativas entre usuário e sistema. Ao ajustar um modelo de visão-linguagem (VLM) com um pequeno conjunto de dados de preferência de vídeo e aprimorá-lo com um vetor de recuperação de um modelo de recuperação de imagem-texto, superamos as limitações dos sistemas tradicionais de recuperação baseados em texto em cenários multimodais.

Fonte: arXiv cs.AI

Vision • Score 95

Nem sempre é mais verde do outro lado: Percepção de áreas verdes entre diferentes demografias e personalidades em várias cidades

Quantificar e avaliar a vegetação urbana é crucial para o planejamento e desenvolvimento, refletindo a importância contínua dos espaços verdes para múltiplas dimensões climáticas e de bem-estar nas cidades. Este trabalho mede as diferenças entre percepções subjetivas e medidas objetivas de áreas verdes, utilizando dados de uma pesquisa abrangente com 1.000 pessoas em cinco países.

Fonte: arXiv cs.CV

Vision • Score 95

Endo-SemiS: Rumo à Segmentação de Imagem Semi-Supervisionada Robusta para Vídeo Endoscópico

Neste artigo, apresentamos o Endo-SemiS, um framework de segmentação semi-supervisionada para fornecer segmentação confiável de quadros de vídeo endoscópico com anotação limitada. O Endo-SemiS utiliza 4 estratégias para melhorar o desempenho, aproveitando efetivamente todos os dados disponíveis, especialmente dados não rotulados.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs

arXiv:2512.17319v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) demonstrate strong perception and reasoning performance on existing remote sensing (RS) benchmarks. However, most prior benchmarks rely on low-resolution imagery, and some high-resolution benchmarks suffer from flawed reasoning-task designs. We show that text-only LLMs can perform competitively with multimodal vision-language models on RS reasoning tasks without access to images, revealing a critical mismatch between current benchmarks and the intended evaluation of visual understanding. To enable faithful assessment, we introduce RSHR-Bench, a super-high-resolution benchmark for RS visual understanding and reasoning. RSHR-Bench contains 5,329 full-scene images with a long side of at least 4,000 pixels, with up to about 3 x 10^8 pixels per image, sourced from widely used RS corpora and UAV collections. We design four task families: multiple-choice VQA, open-ended VQA, image captioning, and single-image evaluation. These tasks cover nine perception categories and four reasoning types, supporting multi-turn and multi-image dialog. To reduce reliance on language priors, we apply adversarial filtering with strong LLMs followed by rigorous human verification. Overall, we construct 3,864 VQA tasks, 3,913 image captioning tasks, and 500 fully human-written or verified single-image evaluation VQA pairs. Evaluations across open-source, closed-source, and RS-specific VLMs reveal persistent performance gaps in super-high-resolution scenarios. Code: https://github.com/Yunkaidang/RSHR

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Imagens Sintéticas Podem Servir como Protótipos de Classe Eficazes e Eficientes?

Modelos de Visão-Linguagem (VLMs) demonstraram forte desempenho em tarefas de classificação de imagens zero-shot. No entanto, métodos existentes, como o Contrastive Language-Image Pre-training (CLIP), dependem de pares de texto-imagem anotados, o que aumenta os custos e requisitos de precisão na preparação de conjuntos de dados de alta qualidade. Apresentamos o framework LGCLIP, que utiliza um Large Language Model (LLM) para gerar prompts específicos de classe, permitindo a síntese de imagens de referência de forma leve e eficiente.

Fonte: arXiv cs.CV

Vision • Score 95

Pro-Pose: Unpaired Full-Body Portrait Synthesis via Canonical UV Maps

arXiv:2512.17143v1 Announce Type: new Abstract: Photographs of people taken by professional photographers typically present the person in beautiful lighting, with an interesting pose, and flattering quality. This is unlike common photos people can take of themselves. In this paper, we explore how to create a ``professional'' version of a person's photograph, i.e., in a chosen pose, in a simple environment, with good lighting, and standard black top/bottom clothing. A key challenge is to preserve the person's unique identity, face and body features while transforming the photo. If there would exist a large paired dataset of the same person photographed both ``in the wild'' and by a professional photographer, the problem would potentially be easier to solve. However, such data does not exist, especially for a large variety of identities. To that end, we propose two key insights: 1) Our method transforms the input photo and person's face to a canonical UV space, which is further coupled with reposing methodology to model occlusions and novel view synthesis. Operating in UV space allows us to leverage existing unpaired datasets. 2) We personalize the output photo via multi image finetuning. Our approach yields high-quality, reposed portraits and achieves strong qualitative and quantitative performance on real-world imagery.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

InfoTok: Tokenizador de Vídeo Discreto Adaptativo via Compressão Teórica da Informação

A tokenização discreta de vídeo precisa e eficiente é essencial para o processamento de longas sequências de vídeo. Este artigo apresenta o InfoTok, um framework fundamentado para tokenização adaptativa de vídeo, provando que métodos de treinamento existentes são subótimos e introduzindo um novo algoritmo baseado em ELBO que se aproxima da optimalidade teórica.

Fonte: arXiv cs.AI

Vision • Score 95

Towards Pixel-Wise Anomaly Location for High-Resolution PCBA \\ via Self-Supervised Image Reconstruction

arXiv:2512.17296v1 Announce Type: new Abstract: Automated defect inspection of assembled Printed Circuit Board Assemblies (PCBA) is quite challenging due to the insufficient labeled data, micro-defects with just a few pixels in visually-complex and high-resolution images. To address these challenges, we present HiSIR-Net, a High resolution, Self-supervised Reconstruction framework for pixel-wise PCBA localization. Our design combines two lightweight modules that make this practical on real 4K-resolution boards: (i) a Selective Input-Reconstruction Gate (SIR-Gate) that lets the model decide where to trust reconstruction versus the original input, thereby reducing irrelevant reconstruction artifacts and false alarms; and (ii) a Region-level Optimized Patch Selection (ROPS) scheme with positional cues to select overlapping patch reconstructions coherently across arbitrary resolutions. Organically integrating these mechanisms yields clean, high-resolution anomaly maps with low false positive (FP) rate. To bridge the gap in high-resolution PCBA datasets, we further contribute a self-collected dataset named SIPCBA-500 of 500 images. We conduct extensive experiments on our SIPCBA-500 as well as public benchmarks, demonstrating the superior localization performance of our method while running at practical speed. Full code and dataset will be made available upon acceptance.

Fonte: arXiv cs.CV

Vision • Score 95

DESSERT: Diffusion-based Event-driven Single-frame Synthesis via Residual Training

arXiv:2512.17323v1 Announce Type: new Abstract: Video frame prediction extrapolates future frames from previous frames, but suffers from prediction errors in dynamic scenes due to the lack of information about the next frame. Event cameras address this limitation by capturing per-pixel brightness changes asynchronously with high temporal resolution. Prior research on event-based video frame prediction has leveraged motion information from event data, often by predicting event-based optical flow and reconstructing frames via pixel warping. However, such approaches introduce holes and blurring when pixel displacement is inaccurate. To overcome this limitation, we propose DESSERT, a diffusion-based event-driven single-frame synthesis framework via residual training. Leveraging a pre-trained Stable Diffusion model, our method is trained on inter-frame residuals to ensure temporal consistency. The training pipeline consists of two stages: (1) an Event-to-Residual Alignment Variational Autoencoder (ER-VAE) that aligns the event frame between anchor and target frames with the corresponding residual, and (2) a diffusion model that denoises the residual latent conditioned on event data. Furthermore, we introduce Diverse-Length Temporal (DLT) augmentation, which improves robustness by training on frame segments of varying temporal lengths. Experimental results demonstrate that our method outperforms existing event-based reconstruction, image-based video frame prediction, event-based video frame prediction, and one-sided event-based video frame interpolation methods, producing sharper and more temporally consistent frame synthesis.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Vision-Language Model Guided Image Restoration

arXiv:2512.17292v1 Announce Type: new Abstract: Many image restoration (IR) tasks require both pixel-level fidelity and high-level semantic understanding to recover realistic photos with fine-grained details. However, previous approaches often struggle to effectively leverage both the visual and linguistic knowledge. Recent efforts have attempted to incorporate Vision-language models (VLMs), which excel at aligning visual and textual features, into universal IR. Nevertheless, these methods fail to utilize the linguistic priors to ensure semantic coherence during the restoration process. To address this issue, in this paper, we propose the Vision-Language Model Guided Image Restoration (VLMIR) framework, which leverages the rich vision-language priors of VLMs, such as CLIP, to enhance IR performance through improved visual perception and semantic understanding. Our approach consists of two stages: VLM-based feature extraction and diffusion-based image restoration. In the first stage, we extract complementary visual and linguistic representations of input images by condensing the visual perception and high-level semantic priors through VLMs. Specifically, we align the embeddings of captions from low-quality and high-quality images using a cosine similarity loss with LoRA fine-tuning, and employ a degradation predictor to decompose degradation and clean image content embeddings. These complementary visual and textual embeddings are then integrated into a diffusion-based model via cross-attention mechanisms for enhanced restoration. Extensive experiments and ablation studies demonstrate that VLMIR achieves superior performance across both universal and degradation-specific IR tasks, underscoring the critical role of integrated visual and linguistic knowledge from VLMs in advancing image restoration capabilities.

Fonte: arXiv cs.CV

Vision • Score 95

Interpretable Similarity of Synthetic Image Utility

arXiv:2512.17080v1 Announce Type: new Abstract: Synthetic medical image data can unlock the potential of deep learning (DL)-based clinical decision support (CDS) systems through the creation of large scale, privacy-preserving, training sets. Despite the significant progress in this field, there is still a largely unanswered research question: "How can we quantitatively assess the similarity of a synthetically generated set of images with a set of real images in a given application domain?". Today, answers to this question are mainly provided via user evaluation studies, inception-based measures, and the classification performance achieved on synthetic images. This paper proposes a novel measure to assess the similarity between synthetically generated and real sets of images, in terms of their utility for the development of DL-based CDS systems. Inspired by generalized neural additive models, and unlike inception-based measures, the proposed measure is interpretable (Interpretable Utility Similarity, IUS), explaining why a synthetic dataset could be more useful than another one in the context of a CDS system based on clinically relevant image features. The experimental results on publicly available datasets from various color medical imaging modalities including endoscopic, dermoscopic and fundus imaging, indicate that selecting synthetic images of high utility similarity using IUS can result in relative improvements of up to 54.6% in terms of classification performance. The generality of IUS for synthetic data assessment is demonstrated also for greyscale X-ray and ultrasound imaging modalities. IUS implementation is available at https://github.com/innoisys/ius

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

Luzes, Câmera, Consistência: Um Pipeline Multietapas para Histórias em Vídeo com IA de Personagens Estáveis

Gerar histórias em vídeo longas e coesas com personagens consistentes é um desafio significativo para a IA atual de texto-para-vídeo. Apresentamos um método que aborda a geração de vídeo de maneira semelhante a um cineasta, utilizando um pipeline que envolve a criação de um roteiro detalhado e a geração de visuais consistentes para cada personagem.

Fonte: arXiv cs.AI

Vision • Score 95

PhysFire-WM: A Physics-Informed World Model for Emulating Fire Spread Dynamics

arXiv:2512.17152v1 Announce Type: new Abstract: Fine-grained fire prediction plays a crucial role in emergency response. Infrared images and fire masks provide complementary thermal and boundary information, yet current methods are predominantly limited to binary mask modeling with inherent signal sparsity, failing to capture the complex dynamics of fire. While world models show promise in video generation, their physical inconsistencies pose significant challenges for fire forecasting. This paper introduces PhysFire-WM, a Physics-informed World Model for emulating Fire spread dynamics. Our approach internalizes combustion dynamics by encoding structured priors from a Physical Simulator to rectify physical discrepancies, coupled with a Cross-task Collaborative Training strategy (CC-Train) that alleviates the issue of limited information in mask-based modeling. Through parameter sharing and gradient coordination, CC-Train effectively integrates thermal radiation dynamics and spatial boundary delineation, enhancing both physical realism and geometric accuracy. Extensive experiments on a fine-grained multimodal fire dataset demonstrate the superior accuracy of PhysFire-WM in fire spread prediction. Validation underscores the importance of physical priors and cross-task collaboration, providing new insights for applying physics-informed world models to disaster prediction.

Fonte: arXiv cs.CV

Vision • Score 95

Disentangled representations via score-based variational autoencoders

arXiv:2512.17127v1 Announce Type: new Abstract: We present the Score-based Autoencoder for Multiscale Inference (SAMI), a method for unsupervised representation learning that combines the theoretical frameworks of diffusion models and VAEs. By unifying their respective evidence lower bounds, SAMI formulates a principled objective that learns representations through score-based guidance of the underlying diffusion process. The resulting representations automatically capture meaningful structure in the data: it recovers ground truth generative factors in our synthetic dataset, learns factorized, semantic latent dimensions from complex natural images, and encodes video sequences into latent trajectories that are straighter than those of alternative encoders, despite training exclusively on static images. Furthermore, SAMI can extract useful representations from pre-trained diffusion models with minimal additional training. Finally, the explicitly probabilistic formulation provides new ways to identify semantically meaningful axes in the absence of supervised labels, and its mathematical exactness allows us to make formal statements about the nature of the learned representation. Overall, these results indicate that implicit structural information in diffusion models can be made explicit and interpretable through synergistic combination with a variational autoencoder.

Fonte: arXiv stat.ML

NLP/LLMs • Score 92

Perturb Your Data: Paraphrase-Guided Training Data Watermarking

arXiv:2512.17075v1 Announce Type: new Abstract: Training data detection is critical for enforcing copyright and data licensing, as Large Language Models (LLM) are trained on massive text corpora scraped from the internet. We present SPECTRA, a watermarking approach that makes training data reliably detectable even when it comprises less than 0.001% of the training corpus. SPECTRA works by paraphrasing text using an LLM and assigning a score based on how likely each paraphrase is, according to a separate scoring model. A paraphrase is chosen so that its score closely matches that of the original text, to avoid introducing any distribution shifts. To test whether a suspect model has been trained on the watermarked data, we compare its token probabilities against those of the scoring model. We demonstrate that SPECTRA achieves a consistent p-value gap of over nine orders of magnitude when detecting data used for training versus data not used for training, which is greater than all baselines tested. SPECTRA equips data owners with a scalable, deploy-before-release watermark that survives even large-scale LLM training.

Fonte: arXiv cs.CL

Vision • Score 93

DeepShare: Sharing ReLU Across Channels and Layers for Efficient Private Inference

arXiv:2512.17398v1 Announce Type: new Abstract: Private Inference (PI) uses cryptographic primitives to perform privacy preserving machine learning. In this setting, the owner of the network runs inference on the data of the client without learning anything about the data and without revealing any information about the model. It has been observed that a major computational bottleneck of PI is the calculation of the gate (i.e., ReLU), so a considerable amount of effort have been devoted to reducing the number of ReLUs in a given network. We focus on the DReLU, which is the non-linear step function of the ReLU and show that one DReLU can serve many ReLU operations. We suggest a new activation module where the DReLU operation is only performed on a subset of the channels (Prototype channels), while the rest of the channels (replicate channels) replicates the DReLU of each of their neurons from the corresponding neurons in one of the prototype channels. We then extend this idea to work across different layers. We show that this formulation can drastically reduce the number of DReLU operations in resnet type network. Furthermore, our theoretical analysis shows that this new formulation can solve an extended version of the XOR problem, using just one non-linearity and two neurons, something that traditional formulations and some PI specific methods cannot achieve. We achieve new SOTA results on several classification setups, and achieve SOTA results on image segmentation.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Adversarially Robust Detection of Harmful Online Content: A Computational Design Science Approach

arXiv:2512.17367v1 Announce Type: new Abstract: Social media platforms are plagued by harmful content such as hate speech, misinformation, and extremist rhetoric. Machine learning (ML) models are widely adopted to detect such content; however, they remain highly vulnerable to adversarial attacks, wherein malicious users subtly modify text to evade detection. Enhancing adversarial robustness is therefore essential, requiring detectors that can defend against diverse attacks (generalizability) while maintaining high overall accuracy. However, simultaneously achieving both optimal generalizability and accuracy is challenging. Following the computational design science paradigm, this study takes a sequential approach that first proposes a novel framework (Large Language Model-based Sample Generation and Aggregation, LLM-SGA) by identifying the key invariances of textual adversarial attacks and leveraging them to ensure that a detector instantiated within the framework has strong generalizability. Second, we instantiate our detector (Adversarially Robust Harmful Online Content Detector, ARHOCD) with three novel design components to improve detection accuracy: (1) an ensemble of multiple base detectors that exploits their complementary strengths; (2) a novel weight assignment method that dynamically adjusts weights based on each sample's predictability and each base detector's capability, with weights initialized using domain knowledge and updated via Bayesian inference; and (3) a novel adversarial training strategy that iteratively optimizes both the base detectors and the weight assignor. We addressed several limitations of existing adversarial robustness enhancement research and empirically evaluated ARHOCD across three datasets spanning hate speech, rumor, and extremist content. Results show that ARHOCD offers strong generalizability and improves detection accuracy under adversarial conditions.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

The Effect of Negation on CLIP in Medical Imaging: Limitations of Contrastive Language-Image Pretraining

arXiv:2512.17121v1 Announce Type: new Abstract: Large vision-language models like CLIP are increasingly used in medical imaging tasks due to their ability to align images and text without the need for extensive labeled data. This makes them particularly useful for applications like image retrieval, report generation, and classification in clinical settings. A potential issue to this approach is that CLIP-based models often under perform when interpreting negated phrases, which is especially problematic in the context of medical diagnosing. In this study, we evaluate the Stanford AIMI CheXagent model on its ability to correctly retrieve chest X-ray images using prompts with and without negation. The goal of this project is to understand where this model fails and then use it as a base model to improve its retrieval accuracy by fine tuning methods outlined in previous work. Results from this study show improvement in handling of negation in the CLIP model with a slight decrease in accuracy of positive prompt evaluation. Alongside retrieval accuracy, we examined internal model behavior through token attribution, t-SNE projection, and attention-head ablation to better characterize how each fine tuning approach reshaped the text encoders representation of negated clinical language. Through this work, we hope to better understand the internal behavior of CLIP and improve its handling of negation using clinically relevant language for improving its reliability in medical AI devices.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

arXiv:2512.17131v1 Announce Type: cross Abstract: We propose Generalized Primal Averaging (GPA), an extension of Nesterov's method in its primal averaging formulation that addresses key limitations of recent averaging-based optimizers such as single-worker DiLoCo and Schedule-Free (SF) in the non-distributed setting. These two recent algorithmic approaches improve the performance of base optimizers, such as AdamW, through different iterate averaging strategies. Schedule-Free explicitly maintains a uniform average of past weights, while single-worker DiLoCo performs implicit averaging by periodically aggregating trajectories, called pseudo-gradients, to update the model parameters. However, single-worker DiLoCo's periodic averaging introduces a two-loop structure, increasing its memory requirements and number of hyperparameters. GPA overcomes these limitations by decoupling the interpolation constant in the primal averaging formulation of Nesterov. This decoupling enables GPA to smoothly average iterates at every step, generalizing and improving upon single-worker DiLoCo. Empirically, GPA consistently outperforms single-worker DiLoCo while removing the two-loop structure, simplifying hyperparameter tuning, and reducing its memory overhead to a single additional buffer. On the Llama-160M model, GPA provides a 24.22% speedup in terms of steps to reach the baseline (AdamW's) validation loss. Likewise, GPA achieves speedups of 12% and 27% on small and large batch setups, respectively, to attain AdamW's validation accuracy on the ImageNet ViT workload. Furthermore, we prove that for any base optimizer with regret bounded by $O(\sqrt{T})$, where $T$ is the number of iterations, GPA can match or exceed the convergence guarantee of the original optimizer, depending on the choice of interpolation constants.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Atom: Efficient On-Device Video-Language Pipelines Through Modular Reuse

arXiv:2512.17108v1 Announce Type: new Abstract: Recent advances in video-language models have enabled powerful applications like video retrieval, captioning, and assembly. However, executing such multi-stage pipelines efficiently on mobile devices remains challenging due to redundant model loads and fragmented execution. We introduce Atom, an on-device system that restructures video-language pipelines for fast and efficient execution. Atom decomposes a billion-parameter model into reusable modules, such as the visual encoder and language decoder, and reuses them across subtasks like captioning, reasoning, and indexing. This reuse-centric design eliminates repeated model loading and enables parallel execution, reducing end-to-end latency without sacrificing performance. On commodity smartphones, Atom achieves 27--33% faster execution compared to non-reuse baselines, with only marginal performance drop ($\leq$ 2.3 Recall@1 in retrieval, $\leq$ 1.5 CIDEr in captioning). These results position Atom as a practical, scalable approach for efficient video-language understanding on edge devices.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

FORMSpoT: A Decade of Tree-Level, Country-Scale Forest Monitoring

arXiv:2512.17021v1 Announce Type: new Abstract: The recent decline of the European forest carbon sink highlights the need for spatially explicit and frequently updated forest monitoring tools. Yet, existing satellite-based disturbance products remain too coarse to detect changes at the scale of individual trees, typically below 100 m$^{2}$. Here, we introduce FORMSpoT (Forest Mapping with SPOT Time series), a decade-long (2014-2024) nationwide mapping of forest canopy height at 1.5 m resolution, together with annual disturbance polygons (FORMSpoT-$\Delta$) covering mainland France. Canopy heights were derived from annual SPOT-6/7 composites using a hierarchical transformer model (PVTv2) trained on high-resolution airborne laser scanning (ALS) data. To enable robust change detection across heterogeneous acquisitions, we developed a dedicated post-processing pipeline combining co-registration and spatio-temporal total variation denoising. Validation against ALS revisits across 19 sites and 5,087 National Forest Inventory plots shows that FORMSpoT-$\Delta$ substantially outperforms existing disturbance products. In mountainous forests, where disturbances are small and spatially fragmented, FORMSpoT-$\Delta$ achieves an F1-score of 0.44, representing an order of magnitude higher than existing benchmarks. By enabling tree-level monitoring of forest dynamics at national scale, FORMSpoT-$\Delta$ provides a unique tool to analyze management practices, detect early signals of forest decline, and better quantify carbon losses from subtle disturbances such as thinning or selective logging. These results underscore the critical importance of sustaining very high-resolution satellite missions like SPOT and open-data initiatives such as DINAMIS for monitoring forests under climate change.

Fonte: arXiv cs.CV

Vision • Score 96

Machine Learning Leve e Informado por Física para Previsão de Visibilidade na Aviação em Diversos Regimes Climáticos

A previsão de curto prazo (nowcasting) de eventos de baixa visibilidade e precipitação é crucial para a segurança da aviação e eficiência operacional. Este estudo apresenta um framework leve de gradient boosting (XGBoost) treinado exclusivamente com dados de observação de superfície (METAR) e aprimorado por engenharia de características guiada por princípios termodinâmicos.

Fonte: arXiv cs.LG

Vision • Score 96

BIONIX: Um Braço Protético Sem Fio e de Baixo Custo com Controle de EEG e EMG de Duplo Sinal

As próteses de membro superior acessíveis frequentemente carecem de sistemas de controle intuitivos, limitando a funcionalidade e acessibilidade para amputados em ambientes de baixa renda. Este projeto apresenta um sistema de controle neuro-muscular de baixo custo que integra eletroencefalografia (EEG) e eletromiografia (EMG) para permitir controle em tempo real de um braço protético.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

PILAR: Personalizando Interações em Realidade Aumentada com Explicações Centricas no Humano e Confiáveis Baseadas em LLM para Casos de Uso Diários

Sistemas de realidade aumentada (AR) impulsionados por inteligência artificial (AI) estão se integrando cada vez mais à vida cotidiana, aumentando a necessidade de explicações em tempo real. O PILAR é um novo framework que utiliza um modelo de linguagem grande (LLM) pré-treinado para gerar explicações personalizadas e contextuais, melhorando a experiência do usuário em sistemas AR baseados em AI.

Fonte: arXiv cs.AI

Vision • Score 95

SynergyWarpNet: Attention-Guided Cooperative Warping for Neural Portrait Animation

arXiv:2512.17331v1 Announce Type: new Abstract: Recent advances in neural portrait animation have demonstrated remarked potential for applications in virtual avatars, telepresence, and digital content creation. However, traditional explicit warping approaches often struggle with accurate motion transfer or recovering missing regions, while recent attention-based warping methods, though effective, frequently suffer from high complexity and weak geometric grounding. To address these issues, we propose SynergyWarpNet, an attention-guided cooperative warping framework designed for high-fidelity talking head synthesis. Given a source portrait, a driving image, and a set of reference images, our model progressively refines the animation in three stages. First, an explicit warping module performs coarse spatial alignment between the source and driving image using 3D dense optical flow. Next, a reference-augmented correction module leverages cross-attention across 3D keypoints and texture features from multiple reference images to semantically complete occluded or distorted regions. Finally, a confidence-guided fusion module integrates the warped outputs with spatially-adaptive fusing, using a learned confidence map to balance structural alignment and visual consistency. Comprehensive evaluations on benchmark datasets demonstrate state-of-the-art performance.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

Adversarial VR: Um Testbed Open-Source para Avaliação da Robustez Adversarial na Detecção e Mitigação de Ciberdoença em VR

Métodos automatizados de detecção de ciberdoença baseados em deep learning (DL) podem melhorar o conforto e a interação do usuário. No entanto, esses sistemas são suscetíveis a ataques adversariais, que podem degradar o desempenho do modelo e interromper a experiência imersiva. Este artigo apresenta o Adversarial-VR, um testbed em tempo real para avaliar estratégias de detecção e mitigação de ciberdoença sob condições adversariais.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs

arXiv:2512.17206v1 Announce Type: new Abstract: Exploration capacity shapes both inference-time performance and reinforcement learning (RL) training for large (vision-) language models, as stochastic sampling often yields redundant reasoning paths with little high-level diversity. This paper proposes Reasoning Palette, a novel latent-modulation framework that endows the model with a stochastic latent variable for strategic contextualization, guiding its internal planning prior to token generation. This latent context is inferred from the mean-pooled embedding of a question-answer pair via a variational autoencoder (VAE), where each sampled latent potentially encodes a distinct reasoning context. During inference, a sampled latent is decoded into learnable token prefixes and prepended to the input prompt, modulating the model's internal reasoning trajectory. In this way, the model performs internal sampling over reasoning strategies prior to output generation, which shapes the style and structure of the entire response sequence. A brief supervised fine-tuning (SFT) warm-up phase allows the model to adapt to this latent conditioning. Within RL optimization, Reasoning Palette facilitates structured exploration by enabling on-demand injection for diverse reasoning modes, significantly enhancing exploration efficiency and sustained learning capability. Experiments across multiple reasoning benchmarks demonstrate that our method enables interpretable and controllable control over the (vision-) language model's strategic behavior, thereby achieving consistent performance gains over standard RL methods.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Democratizing Pathology Co-Pilots: An Open Pipeline and Dataset for Whole-Slide Vision-Language Modelling

arXiv:2512.17326v1 Announce Type: new Abstract: Vision-language models (VLMs) have the potential to become co-pilots for pathologists. However, most VLMs either focus on small regions of interest within whole-slide images, provide only static slide-level outputs, or rely on data that is not publicly available, limiting reproducibility. Furthermore, training data containing WSIs paired with detailed clinical reports is scarce, restricting progress toward transparent and generalisable VLMs. We address these limitations with three main contributions. First, we introduce Polysome, a standardised tool for synthetic instruction generation. Second, we apply Polysome to the public HISTAI dataset, generating HISTAI-Instruct, a large whole-slide instruction tuning dataset spanning 24,259 slides and over 1.1 million instruction-response pairs. Finally, we use HISTAI-Instruct to train ANTONI-{\alpha}, a VLM capable of visual-question answering (VQA). We show that ANTONI-{\alpha} outperforms MedGemma on WSI-level VQA tasks of tissue identification, neoplasm detection, and differential diagnosis. We also compare the performance of multiple incarnations of ANTONI-{\alpha} trained with different amounts of data. All methods, data, and code are publicly available.

Fonte: arXiv cs.CV

Vision • Score 95

Any-Optical-Model: Um Modelo de Fundação Universal para Sensoriamento Remoto Óptico

Os satélites ópticos, com suas diversas configurações de bandas e distâncias de amostragem no solo, fornecem evidências indispensáveis para tarefas que vão desde a vigilância de ecossistemas até a resposta a emergências. No entanto, discrepâncias significativas na composição de bandas e na resolução espacial entre diferentes sensores ópticos apresentam grandes desafios para os Modelos de Fundação de Sensoriamento Remoto (RSFMs) existentes.

Fonte: arXiv cs.CV

Vision • Score 95

AnyCXR: Segmentação da Anatomia Humana em Radiografias de Tórax em Qualquer Posição de Aquisição Usando Dados Sintéticos Randomizados de Domínio em Múltiplas Etapas com Anotações Imperfeitas e Aprendizado de Regularização de Anotação Conjunta Condicional

A segmentação anatômica robusta de radiografias de tórax (CXRs) continua desafiadora devido à escassez de anotações abrangentes e à variabilidade substancial das condições de aquisição no mundo real. Propomos o AnyCXR, um framework unificado que permite a segmentação multi-orgânica generalizável em ângulos de projeção arbitrários de CXR usando apenas supervisão sintética.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning

arXiv:2512.17312v1 Announce Type: new Abstract: Recent releases such as o3 highlight human-like "thinking with images" reasoning that combines structured tool use with stepwise verification, yet most open-source approaches still rely on text-only chains, rigid visual schemas, or single-step pipelines, limiting flexibility, interpretability, and transferability on complex tasks. We introduce CodeDance, which explores executable code as a general solver for visual reasoning. Unlike fixed-schema calls (e.g., only predicting bounding-box coordinates), CodeDance defines, composes, and executes code to orchestrate multiple tools, compute intermediate results, and render visual artifacts (e.g., boxes, lines, plots) that support transparent, self-checkable reasoning. To guide this process, we introduce a reward for balanced and adaptive tool-call, which balances exploration with efficiency and mitigates tool overuse. Interestingly, beyond the expected capabilities taught by atomic supervision, we empirically observe novel emergent behaviors during RL training: CodeDance demonstrates novel tool invocations, unseen compositions, and cross-task transfer. These behaviors arise without task-specific fine-tuning, suggesting a general and scalable mechanism of executable visual reasoning. Extensive experiments across reasoning benchmarks (e.g., visual search, math, chart QA) show that CodeDance not only consistently outperforms schema-driven and text-only baselines, but also surpasses advanced closed models such as GPT-4o and larger open-source models.

Fonte: arXiv cs.CV

Vision • Score 93

Fose: Fusão de Difusão em Uma Etapa e Rede de Ponta a Ponta para Pansharpening

O pansharpening é uma tarefa significativa de fusão de imagens que combina imagens multiespectrais de baixa resolução (LRMSI) e imagens panchromáticas de alta resolução (PAN) para obter imagens multiespectrais de alta resolução (HRMSI). Este trabalho propõe uma nova estratégia de treinamento em quatro etapas para obter uma rede leve chamada Fose, que funde um modelo de difusão em uma etapa e um modelo E2E, melhorando significativamente a eficiência do processo.

Fonte: arXiv cs.AI

Vision • Score 92

Comparison of deep learning models: CNN and VGG-16 in identifying pornographic content

arXiv:2512.16947v1 Announce Type: new Abstract: In 2020, a total of 59,741 websites were blocked by the Indonesian government due to containing negative content, including pornography, with 14,266 websites falling into this category. However, these blocked websites could still be accessed by the public using virtual private networks (VPNs). This prompted the research idea to quickly identify pornographic content. This study aims to develop a system capable of identifying websites suspected of containing pornographic image content, using a deep learning approach with convolutional neural network (CNN) and visual geometry group 16 (VGG-16) model. The two models were then explored comprehensively and holistically to determine which model was most effective in detecting pornographic content quickly. Based on the findings of the comparison between testing the CNN and VGG-16 models, research results showed that the best test results were obtained in the eighth experiment using the CNN model at an epoch value level of 50 and a learning rate of 0.001 of 0.9487 or 94.87%. This can be interpreted that the CNN model is more effective in detecting pornographic content quickly and accurately compared to using the VGG-16 model.

Fonte: arXiv cs.CV

Vision • Score 92

MatLat: Material Latent Space for PBR Texture Generation

arXiv:2512.17302v1 Announce Type: new Abstract: We propose a generative framework for producing high-quality PBR textures on a given 3D mesh. As large-scale PBR texture datasets are scarce, our approach focuses on effectively leveraging the embedding space and diffusion priors of pretrained latent image generative models while learning a material latent space, MatLat, through targeted fine-tuning. Unlike prior methods that freeze the embedding network and thus lead to distribution shifts when encoding additional PBR channels and hinder subsequent diffusion training, we fine-tune the pretrained VAE so that new material channels can be incorporated with minimal latent distribution deviation. We further show that correspondence-aware attention alone is insufficient for cross-view consistency unless the latent-to-image mapping preserves locality. To enforce this locality, we introduce a regularization in the VAE fine-tuning that crops latent patches, decodes them, and aligns the corresponding image regions to maintain strong pixel-latent spatial correspondence. Ablation studies and comparison with previous baselines demonstrate that our framework improves PBR texture fidelity and that each component is critical for achieving state-of-the-art performance.

Fonte: arXiv cs.CV

Vision • Score 95

Robust Scene Coordinate Regression via Geometrically-Consistent Global Descriptors

arXiv:2512.17226v1 Announce Type: new Abstract: Recent learning-based visual localization methods use global descriptors to disambiguate visually similar places, but existing approaches often derive these descriptors from geometric cues alone (e.g., covisibility graphs), limiting their discriminative power and reducing robustness in the presence of noisy geometric constraints. We propose an aggregator module that learns global descriptors consistent with both geometrical structure and visual similarity, ensuring that images are close in descriptor space only when they are visually similar and spatially connected. This corrects erroneous associations caused by unreliable overlap scores. Using a batch-mining strategy based solely on the overlap scores and a modified contrastive loss, our method trains without manual place labels and generalizes across diverse environments. Experiments on challenging benchmarks show substantial localization gains in large-scale environments while preserving computational and memory efficiency. Code is available at \href{https://github.com/sontung/robust\_scr}{github.com/sontung/robust\_scr}.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Video Detective: Seek Critical Clues Recurrently to Answer Question from Long Videos

arXiv:2512.17229v1 Announce Type: new Abstract: Long Video Question-Answering (LVQA) presents a significant challenge for Multi-modal Large Language Models (MLLMs) due to immense context and overloaded information, which could also lead to prohibitive memory consumption. While existing methods attempt to address these issues by reducing visual tokens or extending model's context length, they may miss useful information or take considerable computation. In fact, when answering given questions, only a small amount of crucial information is required. Therefore, we propose an efficient question-aware memory mechanism, enabling MLLMs to recurrently seek these critical clues. Our approach, named VideoDetective, simplifies this task by iteratively processing video sub-segments. For each sub-segment, a question-aware compression strategy is employed by introducing a few special memory tokens to achieve purposefully compression. This allows models to effectively seek critical clues while reducing visual tokens. Then, due to history context could have a significant impact, we recurrently aggregate and store these memory tokens to update history context, which would be reused for subsequent sub-segments. Furthermore, to more effectively measure model's long video understanding ability, we introduce GLVC (Grounding Long Video Clues), a long video question-answering dataset, which features grounding critical and concrete clues scattered throughout entire videos. Experimental results demonstrate our method enables MLLMs with limited context length of 32K to efficiently process 100K tokens (3600 frames, an hour-long video sampled at 1fps), requiring only 2 minutes and 37GB GPU memory usage. Evaluation results across multiple long video benchmarks illustrate our method can more effectively seek critical clues from massive information.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images

arXiv:2512.17306v1 Announce Type: new Abstract: Recent advances in large Vision-Language Models (VLMs) have exhibited strong reasoning capabilities on complex visual tasks by thinking with images in their Chain-of-Thought (CoT), which is achieved by actively invoking tools to analyze visual inputs rather than merely perceiving them. However, existing models often struggle to reflect on and correct themselves when attempting incorrect reasoning trajectories. To address this limitation, we propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT. Our pipeline comprises three stages: data construction, cold-start SFT and RL. Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs, where solving each task requires multi-turn tool calls to reach the correct answer. In the SFT stage, we collect tool trajectories as cold-start data, guiding a multi-turn reasoning pattern. In the RL stage, we introduce redundancy-penalized policy optimization, which incentivizes the model to develop a self-reflective reasoning pattern. The basic idea is to impose judgment on reasoning trajectories and penalize those that produce incorrect answers without sufficient multi-scale exploration. Extensive experiments demonstrate that DRIM achieves superior performance on visual understanding benchmarks.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

arXiv:2512.17012v1 Announce Type: new Abstract: Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.

Fonte: arXiv cs.CV