NLP/LLMs • Score 85
Enhancing Temporal Awareness in LLMs for Temporal Point Processes
arXiv:2601.00845v1 Announce Type: new
Abstract: Temporal point processes (TPPs) are crucial for analyzing events over time and are widely used in fields such as finance, healthcare, and social systems. These processes are particularly valuable for understanding how events unfold over time, accounting for their irregularity and dependencies. Despite the success of large language models (LLMs) in sequence modeling, applying them to temporal point processes remains challenging. A key issue is that current methods struggle to effectively capture the complex interaction between temporal information and semantic context, which is vital for accurate event modeling. In this context, we introduce TPP-TAL (Temporal Point Processes with Enhanced Temporal Awareness in LLMs), a novel plug-and-play framework designed to enhance temporal reasoning within LLMs. Rather than using the conventional method of simply concatenating event time and type embeddings, TPP-TAL explicitly aligns temporal dynamics with contextual semantics before feeding this information into the LLM. This alignment allows the model to better perceive temporal dependencies and long-range interactions between events and their surrounding contexts. Through comprehensive experiments on several benchmark datasets, it is shown that TPP-TAL delivers substantial improvements in temporal likelihood estimation and event prediction accuracy, highlighting the importance of enhancing temporal awareness in LLMs for continuous-time event modeling. The code is made available at https://github.com/chenlilil/TPP-TAL
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
Decomposing LLM Self-Correction: The Accuracy-Correction Paradox and Error Depth Hypothesis
arXiv:2601.00828v1 Announce Type: new
Abstract: Large Language Models (LLMs) are widely believed to possess self-correction capabilities, yet recent studies suggest that intrinsic self-correction--where models correct their own outputs without external feedback--remains largely ineffective. In this work, we systematically decompose self-correction into three distinct sub-capabilities: error detection, error localization, and error correction. Through cross-model experiments on GSM8K-Complex (n=500 per model, 346 total errors) with three major LLMs, we uncover a striking Accuracy-Correction Paradox: weaker models (GPT-3.5, 66% accuracy) achieve 1.6x higher intrinsic correction rates than stronger models (DeepSeek, 94% accuracy)--26.8% vs 16.7%. We propose the Error Depth Hypothesis: stronger models make fewer but deeper errors that resist self-correction. Error detection rates vary dramatically across architectures (10% to 82%), yet detection capability does not predict correction success--Claude detects only 10% of errors but corrects 29% intrinsically. Surprisingly, providing error location hints hurts all models. Our findings challenge linear assumptions about model capability and self-improvement, with important implications for the design of self-refinement pipelines.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
Aletheia: Quantifying Cognitive Conviction in Reasoning Models via Regularized Inverse Confusion Matrix
arXiv:2601.01532v1 Announce Type: new
Abstract: In the progressive journey toward Artificial General Intelligence (AGI), current evaluation paradigms face an epistemological crisis. Static benchmarks measure knowledge breadth but fail to quantify the depth of belief. While Simhi et al. (2025) defined the CHOKE phenomenon in standard QA, we extend this framework to quantify "Cognitive Conviction" in System 2 reasoning models. We propose Project Aletheia, a cognitive physics framework that employs Tikhonov Regularization to invert the judge's confusion matrix. To validate this methodology without relying on opaque private data, we implement a Synthetic Proxy Protocol. Our preliminary pilot study on 2025 baselines (e.g., DeepSeek-R1, OpenAI o1) suggests that while reasoning models act as a "cognitive buffer," they may exhibit "Defensive OverThinking" under adversarial pressure. Furthermore, we introduce the Aligned Conviction Score (S_aligned) to verify that conviction does not compromise safety. This work serves as a blueprint for measuring AI scientific integrity.
Fonte: arXiv cs.AI
Applications • Score 85
PsychEval: A Multi-Session and Multi-Therapy Benchmark for High-Realism AI Psychological Counselor
arXiv:2601.01802v2 Announce Type: new
Abstract: To develop a reliable AI for psychological assessment, we introduce \texttt{PsychEval}, a multi-session, multi-therapy, and highly realistic benchmark designed to address three key challenges: \textbf{1) Can we train a highly realistic AI counselor?} Realistic counseling is a longitudinal task requiring sustained memory and dynamic goal tracking. We propose a multi-session benchmark (spanning 6-10 sessions across three distinct stages) that demands critical capabilities such as memory continuity, adaptive reasoning, and longitudinal planning. The dataset is annotated with extensive professional skills, comprising over 677 meta-skills and 4577 atomic skills. \textbf{2) How to train a multi-therapy AI counselor?} While existing models often focus on a single therapy, complex cases frequently require flexible strategies among various therapies. We construct a diverse dataset covering five therapeutic modalities (Psychodynamic, Behaviorism, CBT, Humanistic Existentialist, and Postmodernist) alongside an integrative therapy with a unified three-stage clinical framework across six core psychological topics. \textbf{3) How to systematically evaluate an AI counselor?} We establish a holistic evaluation framework with 18 therapy-specific and therapy-shared metrics across Client-Level and Counselor-Level dimensions. To support this, we also construct over 2,000 diverse client profiles. Extensive experimental analysis fully validates the superior quality and clinical fidelity of our dataset. Crucially, \texttt{PsychEval} transcends static benchmarking to serve as a high-fidelity reinforcement learning environment that enables the self-evolutionary training of clinically responsible and adaptive AI counselors.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
AI Agent Systems: Architectures, Applications, and Evaluation
arXiv:2601.01743v1 Announce Type: new
Abstract: AI agents -- systems that combine foundation models with reasoning, planning, memory, and tool use -- are rapidly becoming a practical interface between natural-language intent and real-world computation. This survey synthesizes the emerging landscape of AI agent architectures across: (i) deliberation and reasoning (e.g., chain-of-thought-style decomposition, self-reflection and verification, and constraint-aware decision making), (ii) planning and control (from reactive policies to hierarchical and multi-step planners), and (iii) tool calling and environment interaction (retrieval, code execution, APIs, and multimodal perception). We organize prior work into a unified taxonomy spanning agent components (policy/LLM core, memory, world models, planners, tool routers, and critics), orchestration patterns (single-agent vs.\ multi-agent; centralized vs.\ decentralized coordination), and deployment settings (offline analysis vs.\ online interactive assistance; safety-critical vs.\ open-ended tasks). We discuss key design trade-offs -- latency vs.\ accuracy, autonomy vs.\ controllability, and capability vs.\ reliability -- and highlight how evaluation is complicated by non-determinism, long-horizon credit assignment, tool and environment variability, and hidden costs such as retries and context growth. Finally, we summarize measurement and benchmarking practices (task suites, human preference and utility metrics, success under constraints, robustness and security) and identify open challenges including verification and guardrails for tool actions, scalable memory and context management, interpretability of agent decisions, and reproducible evaluation under realistic workloads.
Fonte: arXiv cs.AI
RL • Score 85
Reinforcement Learning Enhanced Multi-hop Reasoning for Temporal Knowledge Question Answering
arXiv:2601.01195v1 Announce Type: new
Abstract: Temporal knowledge graph question answering (TKGQA) involves multi-hop reasoning over temporally constrained entity relationships in the knowledge graph to answer a given question. However, at each hop, large language models (LLMs) retrieve subgraphs with numerous temporally similar and semantically complex relations, increasing the risk of suboptimal decisions and error propagation. To address these challenges, we propose the multi-hop reasoning enhanced (MRE) framework, which enhances both forward and backward reasoning to improve the identification of globally optimal reasoning trajectories. Specifically, MRE begins with prompt engineering to guide the LLM in generating diverse reasoning trajectories for a given question. Valid reasoning trajectories are then selected for supervised fine-tuning, serving as a cold-start strategy. Finally, we introduce Tree-Group Relative Policy Optimization (T-GRPO), a recursive, tree-structured learning-by-exploration approach. At each hop, exploration establishes strong causal dependencies on the previous hop, while evaluation is informed by multi-path exploration feedback from subsequent hops. Experimental results on two TKGQA benchmarks indicate that the proposed MRE-based model consistently surpasses state-of-the-art (SOTA) approaches in handling complex multi-hop queries. Further analysis highlights improved interpretability and robustness to noisy temporal annotations.
Fonte: arXiv cs.AI
NLP/LLMs • Score 90
Logics-STEM: Empowering LLM Reasoning via Failure-Driven Post-Training and Document Knowledge Enhancement
arXiv:2601.01562v1 Announce Type: new
Abstract: We present Logics-STEM, a state-of-the-art reasoning model fine-tuned on Logics-STEM-SFT-Dataset, a high-quality and diverse dataset at 10M scale that represents one of the largest-scale open-source long chain-of-thought corpora. Logics-STEM targets reasoning tasks in the domains of Science, Technology, Engineering, and Mathematics (STEM), and exhibits exceptional performance on STEM-related benchmarks with an average improvement of 4.68% over the next-best model at 8B scale. We attribute the gains to our data-algorithm co-design engine, where they are jointly optimized to fit a gold-standard distribution behind reasoning. Data-wise, the Logics-STEM-SFT-Dataset is constructed from a meticulously designed data curation engine with 5 stages to ensure the quality, diversity, and scalability, including annotation, deduplication, decontamination, distillation, and stratified sampling. Algorithm-wise, our failure-driven post-training framework leverages targeted knowledge retrieval and data synthesis around model failure regions in the Supervised Fine-tuning (SFT) stage to effectively guide the second-stage SFT or the reinforcement learning (RL) for better fitting the target distribution. The superior empirical performance of Logics-STEM reveals the vast potential of combining large-scale open-source data with carefully designed synthetic data, underscoring the critical role of data-algorithm co-design in enhancing reasoning capabilities through post-training. We make both the Logics-STEM models (8B and 32B) and the Logics-STEM-SFT-Dataset (10M and downsampled 2.2M versions) publicly available to support future research in the open-source community.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
Temporal Attack Pattern Detection in Multi-Agent AI Workflows: An Open Framework for Training Trace-Based Security Models
arXiv:2601.00848v1 Announce Type: new
Abstract: We present an openly documented methodology for fine-tuning language models to detect temporal attack patterns in multi-agent AI workflows using OpenTelemetry trace analysis. We curate a dataset of 80,851 examples from 18 public cybersecurity sources and 35,026 synthetic OpenTelemetry traces. We apply iterative QLoRA fine-tuning on resource-constrained ARM64 hardware (NVIDIA DGX Spark) through three training iterations with strategic augmentation. Our custom benchmark accuracy improves from 42.86% to 74.29%, a statistically significant 31.4-point gain. Targeted examples addressing specific knowledge gaps outperform indiscriminate scaling. Key contributions include: (1) synthetic trace generation methodology for multi-agent coordination attacks and regulatory violations, (2) empirical evidence that training data composition fundamentally determines behavior, and (3) complete open release of datasets, training scripts, and evaluation benchmarks on HuggingFace. While practical deployment requires human oversight due to false positive rates, this work establishes the first reproducible framework enabling practitioners to build custom agentic security models adapted to their threat landscapes.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
ChaosBench-Logic: A Benchmark for Logical and Symbolic Reasoning on Chaotic Dynamical Systems
arXiv:2601.01982v1 Announce Type: new
Abstract: Large language models (LLMs) excel at natural language tasks but remain brittle in domains requiring precise logical and symbolic reasoning. Chaotic dynamical systems provide an especially demanding test because chaos is deterministic yet often misinterpreted as randomness or complexity. We introduce ChaosBench-Logic, a benchmark that evaluates LLM reasoning across 30 diverse dynamical systems using a unified first-order logic (FOL) ontology. Each system is annotated with truth assignments for 11 semantic predicates, and 621 questions are generated across seven reasoning categories, including multi-hop implications, cross-system analogies, counterfactual reasoning, bias probes, and multi-turn dialogues. We define metrics for logical accuracy, implication consistency, dialogue coherence, and contradiction, and we release an open-source evaluation pipeline. Initial experiments show that frontier LLMs such as GPT-4, Claude 3.5 Sonnet, Gemini 2.5 Flash, and the open-source LLaMA-3 70B achieve 91-94% per-item accuracy, yet still score 0% on compositional items and exhibit fragile global coherence. Dialogue-level accuracy ranges from 53.1% (GPT-4 CoT) to 75.5% (LLaMA-3 zero-shot). ChaosBench-Logic provides a rigorous testbed for diagnosing such failures and a foundation for developing neuro-symbolic approaches that improve scientific reasoning in LLMs.
Fonte: arXiv cs.AI
RL • Score 85
Acelerando a Busca em Árvores de Monte-Carlo com Políticas Posteriores Otimizadas
Introduzimos um algoritmo recursivo de busca em árvore de Monte-Carlo estilo AlphaZero, chamado 'RMCTS'. A vantagem do RMCTS sobre o MCTS-UCB do AlphaZero é a velocidade. O RMCTS explora a árvore de busca de maneira em largura, permitindo que as inferências da rede ocorram em grandes lotes, reduzindo significativamente o custo de latência da GPU.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
Universal Conditional Logic: A Formal Language for Prompt Engineering
arXiv:2601.00880v1 Announce Type: new
Abstract: We present Universal Conditional Logic (UCL), a mathematical framework for prompt optimization that transforms prompt engineering from heuristic practice into systematic optimization. Through systematic evaluation (N=305, 11 models, 4 iterations), we demonstrate significant token reduction (29.8%, t(10)=6.36, p < 0.001, Cohen's d = 2.01) with corresponding cost savings. UCL's structural overhead function O_s(A) explains version-specific performance differences through the Over-Specification Paradox: beyond threshold S* = 0.509, additional specification degrades performance quadratically. Core mechanisms -- indicator functions (I_i in {0,1}), structural overhead (O_s = gamma * sum(ln C_k)), early binding -- are validated. Notably, optimal UCL configuration varies by model architecture -- certain models (e.g., Llama 4 Scout) require version-specific adaptations (V4.1). This work establishes UCL as a calibratable framework for efficient LLM interaction, with model-family-specific optimization as a key research direction.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
AI Agente para Tomada de Decisão de Risco de Crédito Autônoma, Explicável e em Tempo Real
A digitalização significativa dos serviços financeiros gerou uma demanda urgente por sistemas de tomada de decisão de risco de crédito que sejam autônomos, transparentes e em tempo real. Este artigo apresenta um framework de AI Agente, onde agentes de IA avaliam o crédito de forma dinâmica, minimizando a intervenção humana e melhorando a velocidade e transparência das decisões.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
CogCanvas: Extração de Artefatos Fundamentados Verbatim para Longas Conversas com LLM
O resumo de arXiv:2601.00821v2 apresenta o CogCanvas, um framework livre de treinamento que extrai artefatos fundamentados verbatim de conversas longas, superando métodos tradicionais em precisão. Em benchmarks, o CogCanvas alcança a maior precisão entre métodos sem treinamento, destacando-se em tarefas de raciocínio complexo.
Fonte: arXiv cs.AI
Multimodal • Score 85
OmniNeuro: Um Framework HCI Multimodal para Feedback Explicável de BCI via IA Generativa e Sonificação
Embora o Deep Learning tenha melhorado a precisão de decodificação de Interfaces Cérebro-Máquina (BCI), a adoção clínica é dificultada pela natureza de 'caixa-preta' desses algoritmos, levando à frustração do usuário e a resultados ruins de neuroplasticidade. Propomos o OmniNeuro, um novo framework HCI que transforma o BCI de um decodificador silencioso em um parceiro de feedback transparente.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
Can We Trust AI Explanations? Evidence of Systematic Underreporting in Chain-of-Thought Reasoning
arXiv:2601.00830v1 Announce Type: new
Abstract: When AI systems explain their reasoning step-by-step, practitioners often assume these explanations reveal what actually influenced the AI's answer. We tested this assumption by embedding hints into questions and measuring whether models mentioned them. In a study of over 9,000 test cases across 11 leading AI models, we found a troubling pattern: models almost never mention hints spontaneously, yet when asked directly, they admit noticing them. This suggests models see influential information but choose not to report it. Telling models they are being watched does not help. Forcing models to report hints works, but causes them to report hints even when none exist and reduces their accuracy. We also found that hints appealing to user preferences are especially dangerous-models follow them most often while reporting them least. These findings suggest that simply watching AI reasoning is not enough to catch hidden influences.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLMs
arXiv:2601.01836v1 Announce Type: new
Abstract: As large language models are deployed in high-stakes enterprise applications, from healthcare to finance, ensuring adherence to organization-specific policies has become essential. Yet existing safety evaluations focus exclusively on universal harms. We present COMPASS (Company/Organization Policy Alignment Assessment), the first systematic framework for evaluating whether LLMs comply with organizational allowlist and denylist policies. We apply COMPASS to eight diverse industry scenarios, generating and validating 5,920 queries that test both routine compliance and adversarial robustness through strategically designed edge cases. Evaluating seven state-of-the-art models, we uncover a fundamental asymmetry: models reliably handle legitimate requests (>95% accuracy) but catastrophically fail at enforcing prohibitions, refusing only 13-40% of adversarial denylist violations. These results demonstrate that current LLMs lack the robustness required for policy-critical deployments, establishing COMPASS as an essential evaluation framework for organizational AI safety.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
Jenius Agent: Towards Experience-Driven Accuracy Optimization in Real-World Scenarios
arXiv:2601.01857v1 Announce Type: new
Abstract: As agent systems powered by large language models (LLMs) advance, improving the task performance of an autonomous agent, especially in context understanding, tool usage, and response generation, has become increasingly critical. Although prior studies have advanced the overall design of LLM-based agents, systematic optimization of their internal reasoning and tool-use pipelines remains underexplored. This paper introduces an agent framework grounded in real-world practical experience, with three key innovations: (1) an adaptive prompt generation strategy that aligns with the agent's state and task goals to improve reliability and robustness; (2) a context-aware tool orchestration module that performs tool categorization, semantic retrieval, and adaptive invocation based on user intent and context; and (3) a layered memory mechanism that integrates session memory, task history, and external summaries to improve relevance and efficiency through dynamic summarization and compression. An end-to-end framework named Jenius-Agent has been integrated with three key optimizations, including tools based on the Model Context Protocol (MCP), file input/output (I/O), and execution feedback. The experiments show a 20 percent improvement in task accuracy, along with a reduced token cost, response latency, and invocation failures. The framework is already deployed in Jenius (https://www.jenius.cn), providing a lightweight and scalable solution for robust, protocol-compatible autonomous agents.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
Beyond Gemini-3-Pro: Revisiting LLM Routing and Aggregation at Scale
arXiv:2601.01330v1 Announce Type: new
Abstract: Large Language Models (LLMs) have rapidly advanced, with Gemini-3-Pro setting a new performance milestone. In this work, we explore collective intelligence as an alternative to monolithic scaling, and demonstrate that open-source LLMs' collaboration can surpass Gemini-3-Pro. We first revisit LLM routing and aggregation at scale and identify three key bottlenecks: (1) current train-free routers are limited by a query-based paradigm focusing solely on textual similarity; (2) recent aggregation methods remain largely static, failing to select appropriate aggregators for different tasks;(3) the complementarity of routing and aggregation remains underutilized. To address these problems, we introduce JiSi, a novel framework designed to release the full potential of LLMs' collaboration through three innovations: (1) Query-Response Mixed Routing capturing both semantic information and problem difficulty; (2) Support-Set-based Aggregator Selection jointly evaluating the aggregation and domain capacity of aggregators; (3) Adaptive Routing-Aggregation Switch dynamically leveraging the advantages of routing and aggregation. Comprehensive experiments on nine benchmarks demonstrate that JiSi can surpass Gemini-3-Pro with only 47% costs by orchestrating ten open-source LLMs, while outperforming mainstream baselines. It suggests that collective intelligence represents a novel path towards Artificial General Intelligence (AGI).
Fonte: arXiv cs.AI
RL • Score 85
Regularização de Ações de Ordem Superior em Aprendizado por Reforço Profundo: Do Controle Contínuo à Gestão de Energia em Edifícios
Agentes de aprendizado por reforço profundo frequentemente apresentam comportamentos de controle erráticos e de alta frequência que dificultam a implementação no mundo real devido ao consumo excessivo de energia e desgaste mecânico. Investigamos sistematicamente a regularização da suavidade das ações através de penalidades de derivadas de ordem superior, validando teoricamente em benchmarks de controle contínuo e na prática em gestão de energia em edifícios.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
Reading Between the Lines: Deconfounding Causal Estimates using Text Embeddings and Deep Learning
arXiv:2601.01511v1 Announce Type: new
Abstract: Estimating causal treatment effects in observational settings is frequently compromised by selection bias arising from unobserved confounders. While traditional econometric methods struggle when these confounders are orthogonal to structured covariates, high-dimensional unstructured text often contains rich proxies for these latent variables. This study proposes a Neural Network-Enhanced Double Machine Learning (DML) framework designed to leverage text embeddings for causal identification. Using a rigorous synthetic benchmark, we demonstrate that unstructured text embeddings capture critical confounding information that is absent from structured tabular data. However, we show that standard tree-based DML estimators retain substantial bias (+24%) due to their inability to model the continuous topology of embedding manifolds. In contrast, our deep learning approach reduces bias to -0.86% with optimized architectures, effectively recovering the ground-truth causal parameter. These findings suggest that deep learning architectures are essential for satisfying the unconfoundedness assumption when conditioning on high-dimensional natural language data
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
Can Large Language Models Solve Engineering Equations? A Systematic Comparison of Direct Prediction and Solver-Assisted Approaches
arXiv:2601.01774v1 Announce Type: new
Abstract: Transcendental equations requiring iterative numerical solution pervade engineering practice, from fluid mechanics friction factor calculations to orbital position determination. We systematically evaluate whether Large Language Models can solve these equations through direct numerical prediction or whether a hybrid architecture combining LLM symbolic manipulation with classical iterative solvers proves more effective. Testing six state-of-the-art models (GPT-5.1, GPT-5.2, Gemini-3-Flash, Gemini-2.5-Lite, Claude-Sonnet-4.5, Claude-Opus-4.5) on 100 problems spanning seven engineering domains, we compare direct prediction against solver-assisted computation where LLMs formulate governing equations and provide initial conditions while Newton-Raphson iteration performs numerical solution. Direct prediction yields mean relative errors of 0.765 to 1.262 across models, while solver-assisted computation achieves 0.225 to 0.301, representing error reductions of 67.9% to 81.8%. Domain-specific analysis reveals dramatic improvements in Electronics (93.1%) due to exponential equation sensitivity, contrasted with modest gains in Fluid Mechanics (7.2%) where LLMs exhibit effective pattern recognition. These findings establish that contemporary LLMs excel at symbolic manipulation and domain knowledge retrieval but struggle with precision-critical iterative arithmetic, suggesting their optimal deployment as intelligent interfaces to classical numerical solvers rather than standalone computational engines.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
Theory Trace Card: Theory-Driven Socio-Cognitive Evaluation of LLMs
arXiv:2601.01878v1 Announce Type: new
Abstract: Socio-cognitive benchmarks for large language models (LLMs) often fail to predict real-world behavior, even when models achieve high benchmark scores. Prior work has attributed this evaluation-deployment gap to problems of measurement and validity. While these critiques are insightful, we argue that they overlook a more fundamental issue: many socio-cognitive evaluations proceed without an explicit theoretical specification of the target capability, leaving the assumptions linking task performance to competence implicit. Without this theoretical grounding, benchmarks that exercise only narrow subsets of a capability are routinely misinterpreted as evidence of broad competence: a gap that creates a systemic validity illusion by masking the failure to evaluate the capability's other essential dimensions. To address this gap, we make two contributions. First, we diagnose and formalize this theory gap as a foundational failure that undermines measurement and enables systematic overgeneralization of benchmark results. Second, we introduce the Theory Trace Card (TTC), a lightweight documentation artifact designed to accompany socio-cognitive evaluations, which explicitly outlines the theoretical basis of an evaluation, the components of the target capability it exercises, its operationalization, and its limitations. We argue that TTCs enhance the interpretability and reuse of socio-cognitive evaluations by making explicit the full validity chain, which links theory, task operationalization, scoring, and limitations, without modifying benchmarks or requiring agreement on a single theory.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
MathLedger: Um Substrato de Aprendizado Verificável com Feedback Atestado por Ledger
Os sistemas de IA contemporâneos alcançam desempenho extraordinário, mas permanecem opacos e não verificáveis, criando uma crise de confiança para implantações críticas de segurança. Apresentamos o MathLedger, um substrato para cognição de máquina verificável que integra verificação formal, atestação criptográfica e dinâmicas de aprendizado em um único loop epistêmico.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
Toward Auditable Neuro-Symbolic Reasoning in Pathology: SQL as an Explicit Trace of Evidence
arXiv:2601.01875v1 Announce Type: new
Abstract: Automated pathology image analysis is central to clinical diagnosis, but clinicians still ask which slide features drive a model's decision and why. Vision-language models can produce natural language explanations, but these are often correlational and lack verifiable evidence. In this paper, we introduce an SQL-centered agentic framework that enables both feature measurement and reasoning to be auditable. Specifically, after extracting human-interpretable cellular features, Feature Reasoning Agents compose and execute SQL queries over feature tables to aggregate visual evidence into quantitative findings. A Knowledge Comparison Agent then evaluates these findings against established pathological knowledge, mirroring how pathologists justify diagnoses from measurable observations. Extensive experiments evaluated on two pathology visual question answering datasets demonstrate our method improves interpretability and decision traceability while producing executable SQL traces that link cellular measurements to diagnostic conclusions.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
Comentário sobre: Seu Cérebro no ChatGPT: Acumulação de Dívida Cognitiva ao Usar um Assistente de IA para Tarefas de Redação de Ensaios
O trabalho recentemente publicado intitulado Seu Cérebro no ChatGPT: Acumulação de Dívida Cognitiva ao Usar um Assistente de IA para Tarefas de Redação de Ensaios, de Kosmyna et al. (2025), gerou um debate intenso sobre inteligência artificial (IA) e desempenho humano. Parabenizamos Kosmyna et al. pela pesquisa importante e pela coleta de um conjunto de dados valioso. Oferecemos comentários construtivos para melhorar a prontidão do manuscrito para publicação revisada por pares.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
Clinical Knowledge Graph Construction and Evaluation with Multi-LLMs via Retrieval-Augmented Generation
arXiv:2601.01844v1 Announce Type: new
Abstract: Large language models (LLMs) offer new opportunities for constructing knowledge graphs (KGs) from unstructured clinical narratives. However, existing approaches often rely on structured inputs and lack robust validation of factual accuracy and semantic consistency, limitations that are especially problematic in oncology. We introduce an end-to-end framework for clinical KG construction and evaluation directly from free text using multi-agent prompting and a schema-constrained Retrieval-Augmented Generation (KG-RAG) strategy. Our pipeline integrates (1) prompt-driven entity, attribute, and relation extraction; (2) entropy-based uncertainty scoring; (3) ontology-aligned RDF/OWL schema generation; and (4) multi-LLM consensus validation for hallucination detection and semantic refinement. Beyond static graph construction, the framework supports continuous refinement and self-supervised evaluation, enabling iterative improvement of graph quality. Applied to two oncology cohorts (PDAC and BRCA), our method produces interpretable, SPARQL-compatible, and clinically grounded knowledge graphs without relying on gold-standard annotations. Experimental results demonstrate consistent gains in precision, relevance, and ontology compliance over baseline methods.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
CaveAgent: Transforming LLMs into Stateful Runtime Operators
arXiv:2601.01569v1 Announce Type: new
Abstract: LLM-based agents are increasingly capable of complex task execution, yet current agentic systems remain constrained by text-centric paradigms. Traditional approaches rely on procedural JSON-based function calling, which often struggles with long-horizon tasks due to fragile multi-turn dependencies and context drift. In this paper, we present CaveAgent, a framework that transforms the paradigm from "LLM-as-Text-Generator" to "LLM-as-Runtime-Operator." We introduce a Dual-stream Context Architecture that decouples state management into a lightweight semantic stream for reasoning and a persistent, deterministic Python Runtime stream for execution. In addition to leveraging code generation to efficiently resolve interdependent sub-tasks (e.g., loops, conditionals) in a single step, we introduce \textit{Stateful Runtime Management} in CaveAgent. Distinct from existing code-based approaches that remain text-bound and lack the support for external object injection and retrieval, CaveAgent injects, manipulates, and retrieves complex Python objects (e.g., DataFrames, database connections) that persist across turns. This persistence mechanism acts as a high-fidelity external memory to eliminate context drift, avoid catastrophic forgetting, while ensuring that processed data flows losslessly to downstream applications. Comprehensive evaluations on Tau$^2$-bench, BFCL and various case studies across representative SOTA LLMs demonstrate CaveAgent's superiority. Specifically, our framework achieves a 10.5\% success rate improvement on retail tasks and reduces total token consumption by 28.4\% in multi-turn scenarios. On data-intensive tasks, direct variable storage and retrieval reduces token consumption by 59\%, allowing CaveAgent to handle large-scale data that causes context overflow failures in both JSON-based and Code-based agents.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
Improving Behavioral Alignment in LLM Social Simulations via Context Formation and Navigation
arXiv:2601.01546v1 Announce Type: new
Abstract: Large language models (LLMs) are increasingly used to simulate human behavior in experimental settings, but they systematically diverge from human decisions in complex decision-making environments, where participants must anticipate others' actions and form beliefs based on observed behavior. We propose a two-stage framework for improving behavioral alignment. The first stage, context formation, explicitly specifies the experimental design to establish an accurate representation of the decision task and its context. The second stage, context navigation, guides the reasoning process within that representation to make decisions. We validate this framework through a focal replication of a sequential purchasing game with quality signaling (Kremer and Debo, 2016), extending to a crowdfunding game with costly signaling (Cason et al., 2025) and a demand-estimation task (Gui and Toubia, 2025) to test generalizability across decision environments. Across four state-of-the-art (SOTA) models (GPT-4o, GPT-5, Claude-4.0-Sonnet-Thinking, DeepSeek-R1), we find that complex decision-making environments require both stages to achieve behavioral alignment with human benchmarks, whereas the simpler demand-estimation task requires only context formation. Our findings clarify when each stage is necessary and provide a systematic approach for designing and diagnosing LLM social simulations as complements to human subjects in behavioral research.
Fonte: arXiv cs.AI
Multimodal • Score 85
KGCE: Knowledge-Augmented Dual-Graph Evaluator for Cross-Platform Educational Agent Benchmarking with Multimodal Language Models
arXiv:2601.01366v1 Announce Type: new
Abstract: With the rapid adoption of multimodal large language models (MLMs) in autonomous agents, cross-platform task execution capabilities in educational settings have garnered significant attention. However, existing benchmark frameworks still exhibit notable deficiencies in supporting cross-platform tasks in educational contexts, especially when dealing with school-specific software (such as XiaoYa Intelligent Assistant, HuaShi XiaZi, etc.), where the efficiency of agents often significantly decreases due to a lack of understanding of the structural specifics of these private-domain software. Additionally, current evaluation methods heavily rely on coarse-grained metrics like goal orientation or trajectory matching, making it challenging to capture the detailed execution and efficiency of agents in complex tasks. To address these issues, we propose KGCE (Knowledge-Augmented Dual-Graph Evaluator for Cross-Platform Educational Agent Benchmarking with Multimodal Language Models), a novel benchmarking platform that integrates knowledge base enhancement and a dual-graph evaluation framework. We first constructed a dataset comprising 104 education-related tasks, covering Windows, Android, and cross-platform collaborative tasks. KGCE introduces a dual-graph evaluation framework that decomposes tasks into multiple sub-goals and verifies their completion status, providing fine-grained evaluation metrics. To overcome the execution bottlenecks of existing agents in private-domain tasks, we developed an enhanced agent system incorporating a knowledge base specific to school-specific software. The code can be found at https://github.com/Kinginlife/KGCE.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
Alinhamento Semântico de Grafos de Conhecimento Multilíngues via Projeções Vetoriais Contextualizadas
O artigo apresenta nosso trabalho em um sistema de alinhamento de ontologias cross-linguais que utiliza correspondência de similaridade coseno baseada em embeddings. As entidades da ontologia são enriquecidas contextualmente por meio de descrições criadas com técnicas inovadoras. Avaliamos nosso trabalho na trilha multifarm OAEI-2022, alcançando 71% de F1 score, indicando a eficácia do nosso pipeline de alinhamento.
Fonte: arXiv cs.AI
MLOps/Systems • Score 85
A New Benchmark for the Appropriate Evaluation of RTL Code Optimization
arXiv:2601.01765v1 Announce Type: new
Abstract: The rapid progress of artificial intelligence increasingly relies on efficient integrated circuit (IC) design. Recent studies have explored the use of large language models (LLMs) for generating Register Transfer Level (RTL) code, but existing benchmarks mainly evaluate syntactic correctness rather than optimization quality in terms of power, performance, and area (PPA). This work introduces RTL-OPT, a benchmark for assessing the capability of LLMs in RTL optimization. RTL-OPT contains 36 handcrafted digital designs that cover diverse implementation categories including combinational logic, pipelined datapaths, finite state machines, and memory interfaces. Each task provides a pair of RTL codes, a suboptimal version and a human-optimized reference that reflects industry-proven optimization patterns not captured by conventional synthesis tools. Furthermore, RTL-OPT integrates an automated evaluation framework to verify functional correctness and quantify PPA improvements, enabling standardized and meaningful assessment of generative models for hardware design optimization.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
ElecTwit: A Framework for Studying Persuasion in Multi-Agent Social Systems
arXiv:2601.00994v1 Announce Type: new
Abstract: This paper introduces ElecTwit, a simulation framework designed to study persuasion within multi-agent systems, specifically emulating the interactions on social media platforms during a political election. By grounding our experiments in a realistic environment, we aimed to overcome the limitations of game-based simulations often used in prior research. We observed the comprehensive use of 25 specific persuasion techniques across most tested LLMs, encompassing a wider range than previously reported. The variations in technique usage and overall persuasion output between models highlight how different model architectures and training can impact the dynamics in realistic social simulations. Additionally, we observed unique phenomena such as "kernel of truth" messages and spontaneous developments with an "ink" obsession, where agents collectively demanded written proof. Our study provides a foundation for evaluating persuasive LLM agents in real-world contexts, ensuring alignment and preventing dangerous outcomes.
Fonte: arXiv cs.AI
Theory/Optimization • Score 85
CNC-TP: Classifier Nominal Concept Based on Top-Pertinent Attributes
arXiv:2601.01976v1 Announce Type: new
Abstract: Knowledge Discovery in Databases (KDD) aims to exploit the vast amounts of data generated daily across various domains of computer applications. Its objective is to extract hidden and meaningful knowledge from datasets through a structured process comprising several key steps: data selection, preprocessing, transformation, data mining, and visualization. Among the core data mining techniques are classification and clustering. Classification involves predicting the class of new instances using a classifier trained on labeled data. Several approaches have been proposed in the literature, including Decision Tree Induction, Bayesian classifiers, Nearest Neighbor search, Neural Networks, Support Vector Machines, and Formal Concept Analysis (FCA). The last one is recognized as an effective approach for interpretable and explainable learning. It is grounded in the mathematical structure of the concept lattice, which enables the generation of formal concepts and the discovery of hidden relationships among them. In this paper, we present a state-of-theart review of FCA-based classifiers. We explore various methods for computing closure operators from nominal data and introduce a novel approach for constructing a partial concept lattice that focuses on the most relevant concepts. Experimental results are provided to demonstrate the efficiency of the proposed method.
Fonte: arXiv cs.AI
Evaluation/Benchmarks • Score 89
LooC: Effective Low-Dimensional Codebook for Compositional Vector Quantization
arXiv:2601.00222v1 Announce Type: new
Abstract: Vector quantization (VQ) is a prevalent and fundamental technique that discretizes continuous feature vectors by approximating them using a codebook. As the diversity and complexity of data and models continue to increase, there is an urgent need for high-capacity, yet more compact VQ methods. This paper aims to reconcile this conflict by presenting a new approach called LooC, which utilizes an effective Low-dimensional codebook for Compositional vector quantization. Firstly, LooC introduces a parameter-efficient codebook by reframing the relationship between codevectors and feature vectors, significantly expanding its solution space. Instead of individually matching codevectors with feature vectors, LooC treats them as lower-dimensional compositional units within feature vectors and combines them, resulting in a more compact codebook with improved performance. Secondly, LooC incorporates a parameter-free extrapolation-by-interpolation mechanism to enhance and smooth features during the VQ process, which allows for better preservation of details and fidelity in feature approximation. The design of LooC leads to full codebook usage, effectively utilizing the compact codebook while avoiding the problem of collapse. Thirdly, LooC can serve as a plug-and-play module for existing methods for different downstream tasks based on VQ. Finally, extensive evaluations on different tasks, datasets, and architectures demonstrate that LooC outperforms existing VQ methods, achieving state-of-the-art performance with a significantly smaller codebook.
Fonte: arXiv cs.CV
Vision • Score 95
GranAlign: Granularity-Aware Alignment Framework for Zero-Shot Video Moment Retrieval
arXiv:2601.00584v1 Announce Type: new
Abstract: Zero-shot video moment retrieval (ZVMR) is the task of localizing a temporal moment within an untrimmed video using a natural language query without relying on task-specific training data. The primary challenge in this setting lies in the mismatch in semantic granularity between textual queries and visual content. Previous studies in ZVMR have attempted to achieve alignment by leveraging high-quality pre-trained knowledge that represents video and language in a joint space. However, these approaches failed to balance the semantic granularity between the pre-trained knowledge provided by each modality for a given scene. As a result, despite the high quality of each modality's representations, the mismatch in granularity led to inaccurate retrieval. In this paper, we propose a training-free framework, called Granularity-Aware Alignment (GranAlign), that bridges this gap between coarse and fine semantic representations. Our approach introduces two complementary techniques: granularity-based query rewriting to generate varied semantic granularities, and query-aware caption generation to embed query intent into video content. By pairing multi-level queries with both query-agnostic and query-aware captions, we effectively resolve semantic mismatches. As a result, our method sets a new state-of-the-art across all three major benchmarks (QVHighlights, Charades-STA, ActivityNet-Captions), with a notable 3.23% mAP@avg improvement on the challenging QVHighlights dataset.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Intelligent Traffic Surveillance for Real-Time Vehicle Detection, License Plate Recognition, and Speed Estimation
arXiv:2601.00344v1 Announce Type: new
Abstract: Speeding is a major contributor to road fatalities, particularly in developing countries such as Uganda, where road safety infrastructure is limited. This study proposes a real-time intelligent traffic surveillance system tailored to such regions, using computer vision techniques to address vehicle detection, license plate recognition, and speed estimation. The study collected a rich dataset using a speed gun, a Canon Camera, and a mobile phone to train the models. License plate detection using YOLOv8 achieved a mean average precision (mAP) of 97.9%. For character recognition of the detected license plate, the CNN model got a character error rate (CER) of 3.85%, while the transformer model significantly reduced the CER to 1.79%. Speed estimation used source and target regions of interest, yielding a good performance of 10 km/h margin of error. Additionally, a database was established to correlate user information with vehicle detection data, enabling automated ticket issuance via SMS via Africa's Talking API. This system addresses critical traffic management needs in resource-constrained environments and shows potential to reduce road accidents through automated traffic enforcement in developing countries where such interventions are urgently needed.
Fonte: arXiv cs.CV
NLP/LLMs • Score 96
FlashInfer-Bench: Construindo o Ciclo Virtuoso para Sistemas LLM Impulsionados por IA
Avanços recentes mostram que modelos de linguagem de grande escala (LLMs) podem atuar como agentes autônomos capazes de gerar kernels de GPU, mas integrar esses kernels gerados por IA em sistemas de inferência do mundo real continua sendo um desafio. O FlashInfer-Bench aborda essa lacuna ao estabelecer um framework padronizado e de ciclo fechado que conecta geração de kernels, benchmarking e implantação.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
FaithSCAN: Detecção de Alucinações em Uma Única Passagem Baseada em Modelos para Respostas Visuais de Perguntas Fiéis
As alucinações de fidelidade em VQA ocorrem quando modelos de visão-linguagem produzem respostas fluentes, mas visualmente não fundamentadas, comprometendo sua confiabilidade em aplicações críticas de segurança. Propomos o FaithSCAN, uma rede leve que detecta alucinações explorando sinais internos ricos dos VLMs, superando limitações de métodos existentes em eficiência e desempenho de detecção.
Fonte: arXiv cs.AI
Vision • Score 95
NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
arXiv:2601.00393v1 Announce Type: new
Abstract: In this paper, we propose NeoVerse, a versatile 4D world model that is capable of 4D reconstruction, novel-trajectory video generation, and rich downstream applications. We first identify a common limitation of scalability in current 4D world modeling methods, caused either by expensive and specialized multi-view 4D data or by cumbersome training pre-processing. In contrast, our NeoVerse is built upon a core philosophy that makes the full pipeline scalable to diverse in-the-wild monocular videos. Specifically, NeoVerse features pose-free feed-forward 4D reconstruction, online monocular degradation pattern simulation, and other well-aligned techniques. These designs empower NeoVerse with versatility and generalization to various domains. Meanwhile, NeoVerse achieves state-of-the-art performance in standard reconstruction and generation benchmarks. Our project page is available at https://neoverse-4d.github.io
Fonte: arXiv cs.CV
Vision • Score 92
CropNeRF: A Neural Radiance Field-Based Framework for Crop Counting
arXiv:2601.00207v1 Announce Type: new
Abstract: Rigorous crop counting is crucial for effective agricultural management and informed intervention strategies. However, in outdoor field environments, partial occlusions combined with inherent ambiguity in distinguishing clustered crops from individual viewpoints poses an immense challenge for image-based segmentation methods. To address these problems, we introduce a novel crop counting framework designed for exact enumeration via 3D instance segmentation. Our approach utilizes 2D images captured from multiple viewpoints and associates independent instance masks for neural radiance field (NeRF) view synthesis. We introduce crop visibility and mask consistency scores, which are incorporated alongside 3D information from a NeRF model. This results in an effective segmentation of crop instances in 3D and highly-accurate crop counts. Furthermore, our method eliminates the dependence on crop-specific parameter tuning. We validate our framework on three agricultural datasets consisting of cotton bolls, apples, and pears, and demonstrate consistent counting performance despite major variations in crop color, shape, and size. A comparative analysis against the state of the art highlights superior performance on crop counting tasks. Lastly, we contribute a cotton plant dataset to advance further research on this topic.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Memory Bank Compression for Continual Adaptation of Large Language Models
arXiv:2601.00756v1 Announce Type: cross
Abstract: Large Language Models (LLMs) have become a mainstay for many everyday applications. However, as data evolve their knowledge quickly becomes outdated. Continual learning aims to update LLMs with new information without erasing previously acquired knowledge. Although methods such as full fine-tuning can incorporate new data, they are computationally expensive and prone to catastrophic forgetting, where prior knowledge is overwritten. Memory-augmented approaches address this by equipping LLMs with a memory bank, that is an external memory module which stores information for future use. However, these methods face a critical limitation, in particular, the memory bank constantly grows in the real-world scenario when large-scale data streams arrive. In this paper, we propose MBC, a model that compresses the memory bank through a codebook optimization strategy during online adaptation learning. To ensure stable learning, we also introduce an online resetting mechanism that prevents codebook collapse. In addition, we employ Key-Value Low-Rank Adaptation in the attention layers of the LLM, enabling efficient utilization of the compressed memory representations. Experiments with benchmark question-answering datasets demonstrate that MBC reduces the memory bank size to 0.3% when compared against the most competitive baseline, while maintaining high retention accuracy during online adaptation learning. Our code is publicly available at https://github.com/Thomkat/MBC.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Application Research of a Deep Learning Model Integrating CycleGAN and YOLO in PCB Infrared Defect Detection
arXiv:2601.00237v1 Announce Type: new
Abstract: This paper addresses the critical bottleneck of infrared (IR) data scarcity in Printed Circuit Board (PCB) defect detection by proposing a cross-modal data augmentation framework integrating CycleGAN and YOLOv8. Unlike conventional methods relying on paired supervision, we leverage CycleGAN to perform unpaired image-to-image translation, mapping abundant visible-light PCB images into the infrared domain. This generative process synthesizes high-fidelity pseudo-IR samples that preserve the structural semantics of defects while accurately simulating thermal distribution patterns. Subsequently, we construct a heterogeneous training strategy that fuses generated pseudo-IR data with limited real IR samples to train a lightweight YOLOv8 detector. Experimental results demonstrate that this method effectively enhances feature learning under low-data conditions. The augmented detector significantly outperforms models trained on limited real data alone and approaches the performance benchmarks of fully supervised training, proving the efficacy of pseudo-IR synthesis as a robust augmentation strategy for industrial inspection.
Fonte: arXiv cs.CV
Vision • Score 95
DichroGAN: Towards Restoration of in-air Colours of Seafloor from Satellite Imagery
arXiv:2601.00194v1 Announce Type: new
Abstract: Recovering the in-air colours of seafloor from satellite imagery is a challenging task due to the exponential attenuation of light with depth in the water column. In this study, we present DichroGAN, a conditional generative adversarial network (cGAN) designed for this purpose. DichroGAN employs a two-steps simultaneous training: first, two generators utilise a hyperspectral image cube to estimate diffuse and specular reflections, thereby obtaining atmospheric scene radiance. Next, a third generator receives as input the generated scene radiance containing the features of each spectral band, while a fourth generator estimates the underwater light transmission. These generators work together to remove the effects of light absorption and scattering, restoring the in-air colours of seafloor based on the underwater image formation equation. DichroGAN is trained on a compact dataset derived from PRISMA satellite imagery, comprising RGB images paired with their corresponding spectral bands and masks. Extensive experiments on both satellite and underwater datasets demonstrate that DichroGAN achieves competitive performance compared to state-of-the-art underwater restoration techniques.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding
arXiv:2601.00264v1 Announce Type: new
Abstract: Multimodal learning has revolutionized general domain tasks, yet its application in scientific discovery is hindered by the profound semantic gap between complex scientific imagery and sparse textual descriptions. We present S1-MMAlign, a large-scale, multi-disciplinary multimodal dataset comprising over 15.5 million high-quality image-text pairs derived from 2.5 million open-access scientific papers. Spanning disciplines from physics and biology to engineering, the dataset captures diverse visual modalities including experimental setups, heatmaps, and microscopic imagery. To address the pervasive issue of weak alignment in raw scientific captions, we introduce an AI-ready semantic enhancement pipeline that utilizes the Qwen-VL multimodal large model series to recaption images by synthesizing context from paper abstracts and citation contexts. Technical validation demonstrates that this enhancement significantly improves data quality: SciBERT-based pseudo-perplexity metrics show reduced semantic ambiguity, while CLIP scores indicate an 18.21% improvement in image-text alignment. S1-MMAlign provides a foundational resource for advancing scientific reasoning and cross-modal understanding in the era of AI for Science. The dataset is publicly available at https://huggingface.co/datasets/ScienceOne-AI/S1-MMAlign.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Focal-RegionFace: Generating Fine-Grained Multi-attribute Descriptions for Arbitrarily Selected Face Focal Regions
arXiv:2601.00156v1 Announce Type: new
Abstract: In this paper, we introduce an underexplored problem in facial analysis: generating and recognizing multi-attribute natural language descriptions, containing facial action units (AUs), emotional states, and age estimation, for arbitrarily selected face regions (termed FaceFocalDesc). We argue that the system's ability to focus on individual facial areas leads to better understanding and control. To achieve this capability, we construct a new multi-attribute description dataset for arbitrarily selected face regions, providing rich region-level annotations and natural language descriptions. Further, we propose a fine-tuned vision-language model based on Qwen2.5-VL, called Focal-RegionFace for facial state analysis, which incrementally refines its focus on localized facial features through multiple progressively fine-tuning stages, resulting in interpretable age estimation, FAU and emotion detection. Experimental results show that Focal-RegionFace achieves the best performance on the new benchmark in terms of traditional and widely used metrics, as well as new proposed metrics. This fully verifies its effectiveness and versatility in fine-grained multi-attribute face region-focal analysis scenarios.
Fonte: arXiv cs.CV
Vision • Score 95
SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting
arXiv:2601.00285v1 Announce Type: new
Abstract: Reconstructing a dynamic target moving over a large area is challenging. Standard approaches for dynamic object reconstruction require dense coverage in both the viewing space and the temporal dimension, typically relying on multi-view videos captured at each time step. However, such setups are only possible in constrained environments. In real-world scenarios, observations are often sparse over time and captured sparsely from diverse viewpoints (e.g., from security cameras), making dynamic reconstruction highly ill-posed. We present SV-GS, a framework that simultaneously estimates a deformation model and the object's motion over time under sparse observations. To initialize SV-GS, we leverage a rough skeleton graph and an initial static reconstruction as inputs to guide motion estimation. (Later, we show that this input requirement can be relaxed.) Our method optimizes a skeleton-driven deformation field composed of a coarse skeleton joint pose estimator and a module for fine-grained deformations. By making only the joint pose estimator time-dependent, our model enables smooth motion interpolation while preserving learned geometric details. Experiments on synthetic datasets show that our method outperforms existing approaches under sparse observations by up to 34% in PSNR, and achieves comparable performance to dense monocular video methods on real-world datasets despite using significantly fewer frames. Moreover, we demonstrate that the input initial static reconstruction can be replaced by a diffusion-based generative prior, making our method more practical for real-world scenarios.
Fonte: arXiv cs.CV
Vision • Score 95
ActErase: A Training-Free Paradigm for Precise Concept Erasure via Activation Patching
arXiv:2601.00267v1 Announce Type: new
Abstract: Recent advances in text-to-image diffusion models have demonstrated remarkable generation capabilities, yet they raise significant concerns regarding safety, copyright, and ethical implications. Existing concept erasure methods address these risks by removing sensitive concepts from pre-trained models, but most of them rely on data-intensive and computationally expensive fine-tuning, which poses a critical limitation. To overcome these challenges, inspired by the observation that the model's activations are predominantly composed of generic concepts, with only a minimal component can represent the target concept, we propose a novel training-free method (ActErase) for efficient concept erasure. Specifically, the proposed method operates by identifying activation difference regions via prompt-pair analysis, extracting target activations and dynamically replacing input activations during forward passes. Comprehensive evaluations across three critical erasure tasks (nudity, artistic style, and object removal) demonstrates that our training-free method achieves state-of-the-art (SOTA) erasure performance, while effectively preserving the model's overall generative capability. Our approach also exhibits strong robustness against adversarial attacks, establishing a new plug-and-play paradigm for lightweight yet effective concept manipulation in diffusion models.
Fonte: arXiv cs.CV
Vision • Score 95
Towards Syn-to-Real IQA: A Novel Perspective on Reshaping Synthetic Data Distributions
arXiv:2601.00225v1 Announce Type: new
Abstract: Blind Image Quality Assessment (BIQA) has advanced significantly through deep learning, but the scarcity of large-scale labeled datasets remains a challenge. While synthetic data offers a promising solution, models trained on existing synthetic datasets often show limited generalization ability. In this work, we make a key observation that representations learned from synthetic datasets often exhibit a discrete and clustered pattern that hinders regression performance: features of high-quality images cluster around reference images, while those of low-quality images cluster based on distortion types. Our analysis reveals that this issue stems from the distribution of synthetic data rather than model architecture. Consequently, we introduce a novel framework SynDR-IQA, which reshapes synthetic data distribution to enhance BIQA generalization. Based on theoretical derivations of sample diversity and redundancy's impact on generalization error, SynDR-IQA employs two strategies: distribution-aware diverse content upsampling, which enhances visual diversity while preserving content distribution, and density-aware redundant cluster downsampling, which balances samples by reducing the density of densely clustered areas. Extensive experiments across three cross-dataset settings (synthetic-to-authentic, synthetic-to-algorithmic, and synthetic-to-synthetic) demonstrate the effectiveness of our method. The code is available at https://github.com/Li-aobo/SynDR-IQA.
Fonte: arXiv cs.CV
Vision • Score 95
A Spatially Masked Adaptive Gated Network for multimodal post-flood water extent mapping using SAR and incomplete multispectral data
arXiv:2601.00123v1 Announce Type: new
Abstract: Mapping water extent during a flood event is essential for effective disaster management throughout all phases: mitigation, preparedness, response, and recovery. In particular, during the response stage, when timely and accurate information is important, Synthetic Aperture Radar (SAR) data are primarily employed to produce water extent maps. Recently, leveraging the complementary characteristics of SAR and MSI data through a multimodal approach has emerged as a promising strategy for advancing water extent mapping using deep learning models. This approach is particularly beneficial when timely post-flood observations, acquired during or shortly after the flood peak, are limited, as it enables the use of all available imagery for more accurate post-flood water extent mapping. However, the adaptive integration of partially available MSI data into the SAR-based post-flood water extent mapping process remains underexplored. To bridge this research gap, we propose the Spatially Masked Adaptive Gated Network (SMAGNet), a multimodal deep learning model that utilizes SAR data as the primary input for post-flood water extent mapping and integrates complementary MSI data through feature fusion. In experiments on the C2S-MS Floods dataset, SMAGNet consistently outperformed other multimodal deep learning models in prediction performance across varying levels of MSI data availability. Furthermore, we found that even when MSI data were completely missing, the performance of SMAGNet remained statistically comparable to that of a U-Net model trained solely on SAR data. These findings indicate that SMAGNet enhances the model robustness to missing data as well as the applicability of multimodal deep learning in real-world flood management scenarios.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark
arXiv:2601.00092v1 Announce Type: new
Abstract: 4D spatial intelligence involves perceiving and processing how objects move or change over time. Humans naturally possess 4D spatial intelligence, supporting a broad spectrum of spatial reasoning abilities. To what extent can Multimodal Large Language Models (MLLMs) achieve human-level 4D spatial intelligence? In this work, we present Spatial4D-Bench, a versatile 4D spatial intelligence benchmark designed to comprehensively assess the 4D spatial reasoning abilities of MLLMs. Unlike existing spatial intelligence benchmarks that are often small-scale or limited in diversity, Spatial4D-Bench provides a large-scale, multi-task evaluation benchmark consisting of ~40,000 question-answer pairs covering 18 well-defined tasks. We systematically organize these tasks into six cognitive categories: object understanding, scene understanding, spatial relationship understanding, spatiotemporal relationship understanding, spatial reasoning and spatiotemporal reasoning. Spatial4D-Bench thereby offers a structured and comprehensive benchmark for evaluating the spatial cognition abilities of MLLMs, covering a broad spectrum of tasks that parallel the versatility of human spatial intelligence. We benchmark various state-of-the-art open-source and proprietary MLLMs on Spatial4D-Bench and reveal their substantial limitations in a wide variety of 4D spatial reasoning aspects, such as route plan, action recognition, and physical plausibility reasoning. We hope that the findings provided in this work offer valuable insights to the community and that our benchmark can facilitate the development of more capable MLLMs toward human-level 4D spatial intelligence. More resources can be found on our project page.
Fonte: arXiv cs.CV
Evaluation/Benchmarks • Score 90
Modelagem de Sinais de ECG de 24 Horas para Prever o Risco de Insuficiência Cardíaca com IA Explicável
A insuficiência cardíaca (IC) afeta 11,8% dos adultos com 65 anos ou mais, reduzindo a qualidade de vida e a longevidade. Este estudo utilizou dados de eletrocardiograma (ECG) de 24 horas para prever o risco de IC em cinco anos, utilizando o modelo de deep learning DeepHHF, que superou modelos anteriores e destacou a viabilidade da IA na previsão de riscos de IC.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
Toward Better Temporal Structures for Geopolitical Events Forecasting
arXiv:2601.00430v1 Announce Type: new
Abstract: Forecasting on geopolitical temporal knowledge graphs (TKGs) through the lens of large language models (LLMs) has recently gained traction. While TKGs and their generalization, hyper-relational temporal knowledge graphs (HTKGs), offer a straightforward structure to represent simple temporal relationships, they lack the expressive power to convey complex facts efficiently. One of the critical limitations of HTKGs is a lack of support for more than two primary entities in temporal facts, which commonly occur in real-world events. To address this limitation, in this work, we study a generalization of HTKGs, Hyper-Relational Temporal Knowledge Generalized Hypergraphs (HTKGHs). We first derive a formalization for HTKGHs, demonstrating their backward compatibility while supporting two complex types of facts commonly found in geopolitical incidents. Then, utilizing this formalization, we introduce the htkgh-polecat dataset, built upon the global event database POLECAT. Finally, we benchmark and analyze popular LLMs on the relation prediction task, providing insights into their adaptability and capabilities in complex forecasting scenarios.
Fonte: arXiv cs.CL
Vision • Score 95
Context-Aware Pesticide Recommendation via Few-Shot Pest Recognition for Precision Agriculture
arXiv:2601.00243v1 Announce Type: new
Abstract: Effective pest management is crucial for enhancing agricultural productivity, especially for crops such as sugarcane and wheat that are highly vulnerable to pest infestations. Traditional pest management methods depend heavily on manual field inspections and the use of chemical pesticides. These approaches are often costly, time-consuming, labor-intensive, and can have a negative impact on the environment. To overcome these challenges, this study presents a lightweight framework for pest detection and pesticide recommendation, designed for low-resource devices such as smartphones and drones, making it suitable for use by small and marginal farmers.
The proposed framework includes two main components. The first is a Pest Detection Module that uses a compact, lightweight convolutional neural network (CNN) combined with prototypical meta-learning to accurately identify pests even when only a few training samples are available. The second is a Pesticide Recommendation Module that incorporates environmental factors like crop type and growth stage to suggest safe and eco-friendly pesticide recommendations. To train and evaluate our framework, a comprehensive pest image dataset was developed by combining multiple publicly available datasets. The final dataset contains samples with different viewing angles, pest sizes, and background conditions to ensure strong generalization.
Experimental results show that the proposed lightweight CNN achieves high accuracy, comparable to state-of-the-art models, while significantly reducing computational complexity. The Decision Support System additionally improves pest management by reducing dependence on traditional chemical pesticides and encouraging sustainable practices, demonstrating its potential for real-time applications in precision agriculture.
Fonte: arXiv cs.CV
NLP/LLMs • Score 96
Framework Auto-reparador Agente Bio-inspirado para Sistemas de Computação Distribuída Resilientes
Este artigo apresenta o ReCiSt, um framework auto-reparador bio-inspirado projetado para alcançar resiliência em Sistemas de Computação Distribuída (DCCS). O ReCiSt reconstrói fases biológicas em camadas computacionais que realizam isolamento autônomo de falhas, diagnóstico causal, recuperação adaptativa e consolidação de conhecimento a partir de agentes impulsionados por Language Model (LM).
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Universos Paralelos, Linguagens Paralelas: Um Estudo Abrangente sobre Geração de Exemplos Contrafactuais Multilíngues Baseados em LLM
Os contrafactuais referem-se a entradas minimamente editadas que fazem a previsão de um modelo mudar, servindo como uma abordagem promissora para explicar o comportamento do modelo. Este estudo investiga a eficácia dos LLMs na geração de contrafactuais multilíngues, comparando contrafactuais gerados diretamente e aqueles derivados de tradução em inglês.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
FCMBench: Um Benchmark Multimodal Abrangente de Crédito Financeiro para Aplicações do Mundo Real
À medida que a IA multimodal se torna amplamente utilizada para avaliação de risco de crédito e revisão de documentos, um benchmark específico do domínio é urgentemente necessário. Apresentamos o FCMBench-V1.0, um benchmark multimodal de crédito financeiro em larga escala, cobrindo 18 tipos de certificados principais, com 4.043 imagens em conformidade com a privacidade e 8.446 amostras de QA.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
IntraStyler: Exemplar-based Style Synthesis for Cross-modality Domain Adaptation
arXiv:2601.00212v1 Announce Type: new
Abstract: Image-level domain alignment is the de facto approach for unsupervised domain adaptation, where unpaired image translation is used to minimize the domain gap. Prior studies mainly focus on the domain shift between the source and target domains, whereas the intra-domain variability remains under-explored. To address the latter, an effective strategy is to diversify the styles of the synthetic target domain data during image translation. However, previous methods typically require intra-domain variations to be pre-specified for style synthesis, which may be impractical. In this paper, we propose an exemplar-based style synthesis method named IntraStyler, which can capture diverse intra-domain styles without any prior knowledge. Specifically, IntraStyler uses an exemplar image to guide the style synthesis such that the output style matches the exemplar style. To extract the style-only features, we introduce a style encoder to learn styles discriminatively based on contrastive learning. We evaluate the proposed method on the largest public dataset for cross-modality domain adaptation, CrossMoDA 2023. Our experiments show the efficacy of our method in controllable style synthesis and the benefits of diverse synthetic data for downstream segmentation. Code is available at https://github.com/han-liu/IntraStyler.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
FreeText: Training-Free Text Rendering in Diffusion Transformers via Attention Localization and Spectral Glyph Injection
arXiv:2601.00535v1 Announce Type: new
Abstract: Large-scale text-to-image (T2I) diffusion models excel at open-domain synthesis but still struggle with precise text rendering, especially for multi-line layouts, dense typography, and long-tailed scripts such as Chinese. Prior solutions typically require costly retraining or rigid external layout constraints, which can degrade aesthetics and limit flexibility. We propose \textbf{FreeText}, a training-free, plug-and-play framework that improves text rendering by exploiting intrinsic mechanisms of \emph{Diffusion Transformer (DiT)} models. \textbf{FreeText} decomposes the problem into \emph{where to write} and \emph{what to write}. For \emph{where to write}, we localize writing regions by reading token-wise spatial attribution from endogenous image-to-text attention, using sink-like tokens as stable spatial anchors and topology-aware refinement to produce high-confidence masks. For \emph{what to write}, we introduce Spectral-Modulated Glyph Injection (SGMI), which injects a noise-aligned glyph prior with frequency-domain band-pass modulation to strengthen glyph structure and suppress semantic leakage (rendering the concept instead of the word). Extensive experiments on Qwen-Image, FLUX.1-dev, and SD3 variants across longText-Benchmark, CVTG, and our CLT-Bench show consistent gains in text readability while largely preserving semantic alignment and aesthetic quality, with modest inference overhead.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
ABFR-KAN: Kolmogorov-Arnold Networks for Functional Brain Analysis
arXiv:2601.00416v1 Announce Type: new
Abstract: Functional connectivity (FC) analysis, a valuable tool for computer-aided brain disorder diagnosis, traditionally relies on atlas-based parcellation. However, issues relating to selection bias and a lack of regard for subject specificity can arise as a result of such parcellations. Addressing this, we propose ABFR-KAN, a transformer-based classification network that incorporates novel advanced brain function representation components with the power of Kolmogorov-Arnold Networks (KANs) to mitigate structural bias, improve anatomical conformity, and enhance the reliability of FC estimation. Extensive experiments on the ABIDE I dataset, including cross-site evaluation and ablation studies across varying model backbones and KAN configurations, demonstrate that ABFR-KAN consistently outperforms state-of-the-art baselines for autism spectrum distorder (ASD) classification. Our code is available at https://github.com/tbwa233/ABFR-KAN.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Pat-DEVAL: Chain-of-Legal-Thought Evaluation for Patent Description
arXiv:2601.00166v1 Announce Type: new
Abstract: Patent descriptions must deliver comprehensive technical disclosure while meeting strict legal standards such as enablement and written description requirements. Although large language models have enabled end-to-end automated patent drafting, existing evaluation approaches fail to assess long-form structural coherence and statutory compliance specific to descriptions. We propose Pat-DEVAL, the first multi-dimensional evaluation framework dedicated to patent description bodies. Leveraging the LLM-as-a-judge paradigm, Pat-DEVAL introduces Chain-of-Legal-Thought (CoLT), a legally-constrained reasoning mechanism that enforces sequential patent-law-specific analysis. Experiments validated by patent expert on our Pap2Pat-EvalGold dataset demonstrate that Pat-DEVAL achieves a Pearson correlation of 0.69, significantly outperforming baseline metrics and existing LLM evaluators. Notably, the framework exhibits a superior correlation of 0.73 in Legal-Professional Compliance, proving that the explicit injection of statutory constraints is essential for capturing nuanced legal validity. By establishing a new standard for ensuring both technical soundness and legal compliance, Pat-DEVAL provides a robust methodological foundation for the practical deployment of automated patent drafting systems.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
RoLID-11K: A Dashcam Dataset for Small-Object Roadside Litter Detection
arXiv:2601.00398v1 Announce Type: new
Abstract: Roadside litter poses environmental, safety and economic challenges, yet current monitoring relies on labour-intensive surveys and public reporting, providing limited spatial coverage. Existing vision datasets for litter detection focus on street-level still images, aerial scenes or aquatic environments, and do not reflect the unique characteristics of dashcam footage, where litter appears extremely small, sparse and embedded in cluttered road-verge backgrounds. We introduce RoLID-11K, the first large-scale dataset for roadside litter detection from dashcams, comprising over 11k annotated images spanning diverse UK driving conditions and exhibiting pronounced long-tail and small-object distributions. We benchmark a broad spectrum of modern detectors, from accuracy-oriented transformer architectures to real-time YOLO models, and analyse their strengths and limitations on this challenging task. Our results show that while CO-DETR and related transformers achieve the best localisation accuracy, real-time models remain constrained by coarse feature hierarchies. RoLID-11K establishes a challenging benchmark for extreme small-object detection in dynamic driving scenes and aims to support the development of scalable, low-cost systems for roadside-litter monitoring. The dataset is available at https://github.com/xq141839/RoLID-11K.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning
arXiv:2601.00791v1 Announce Type: cross
Abstract: We present a training-free method for detecting valid mathematical reasoning in large language models through spectral analysis of attention patterns. By treating attention matrices as adjacency matrices of dynamic graphs over tokens, we extract four interpretable spectral diagnostics, the Fiedler value (algebraic connectivity), high-frequency energy ratio (HFER), graph signal smoothness, and spectral entropy, that exhibit statistically significant differences between valid and invalid mathematical proofs. Experiments across seven transformer models from four independent architectural families (Meta Llama, Alibaba Qwen, Microsoft Phi, and Mistral AI) demonstrate that this spectral signature produces effect sizes up to Cohen's $d = 3.30$ ($p < 10^{-116}$), enabling 85.0--95.6\% classification accuracy under rigorous evaluation, with calibrated thresholds reaching 93--95\% on the full dataset. The method requires no training data, fine-tuning, or learned classifiers: a single threshold on a spectral metric suffices for high accuracy. Through systematic label correction, we discover that the spectral method detects logical coherence rather than compiler acceptance, identifying mathematically valid proofs that formal verifiers reject due to technical failures. We further identify an architectural dependency: Mistral-7B's Sliding Window Attention shifts the discriminative signal from HFER to late-layer Smoothness ($d = 2.09$, $p_{\text{MW}} = 1.16 \times 10^{-48}$), revealing that attention mechanism design affects which spectral features capture reasoning validity. These findings establish spectral graph analysis as a principled framework for reasoning verification with immediate applications to hallucination detection and AI safety monitoring.
Fonte: arXiv cs.CL
RL • Score 95
TeleDoCTR: Domain-Specific and Contextual Troubleshooting for Telecommunications
arXiv:2601.00691v1 Announce Type: cross
Abstract: Ticket troubleshooting refers to the process of analyzing and resolving problems that are reported through a ticketing system. In large organizations offering a wide range of services, this task is highly complex due to the diversity of submitted tickets and the need for specialized domain knowledge. In particular, troubleshooting in telecommunications (telecom) is a very time-consuming task as it requires experts to interpret ticket content, consult documentation, and search historical records to identify appropriate resolutions. This human-intensive approach not only delays issue resolution but also hinders overall operational efficiency. To enhance the effectiveness and efficiency of ticket troubleshooting in telecom, we propose TeleDoCTR, a novel telecom-related, domain-specific, and contextual troubleshooting system tailored for end-to-end ticket resolution in telecom. TeleDoCTR integrates both domain-specific ranking and generative models to automate key steps of the troubleshooting workflow which are: routing tickets to the appropriate expert team responsible for resolving the ticket (classification task), retrieving contextually and semantically similar historical tickets (retrieval task), and generating a detailed fault analysis report outlining the issue, root cause, and potential solutions (generation task). We evaluate TeleDoCTR on a real-world dataset from a telecom infrastructure and demonstrate that it achieves superior performance over existing state-of-the-art methods, significantly enhancing the accuracy and efficiency of the troubleshooting process.
Fonte: arXiv cs.CL
Vision • Score 95
BHaRNet: Reliability-Aware Body-Hand Modality Expertized Networks for Fine-grained Skeleton Action Recognition
arXiv:2601.00369v1 Announce Type: new
Abstract: Skeleton-based human action recognition (HAR) has achieved remarkable progress with graph-based architectures. However, most existing methods remain body-centric, focusing on large-scale motions while neglecting subtle hand articulations that are crucial for fine-grained recognition. This work presents a probabilistic dual-stream framework that unifies reliability modeling and multi-modal integration, generalizing expertized learning under uncertainty across both intra-skeleton and cross-modal domains. The framework comprises three key components: (1) a calibration-free preprocessing pipeline that removes canonical-space transformations and learns directly from native coordinates; (2) a probabilistic Noisy-OR fusion that stabilizes reliability-aware dual-stream learning without requiring explicit confidence supervision; and (3) an intra- to cross-modal ensemble that couples four skeleton modalities (Joint, Bone, Joint Motion, and Bone Motion) to RGB representations, bridging structural and visual motion cues in a unified cross-modal formulation. Comprehensive evaluations across multiple benchmarks (NTU RGB+D~60/120, PKU-MMD, N-UCLA) and a newly defined hand-centric benchmark exhibit consistent improvements and robustness under noisy and heterogeneous conditions.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
StockBot 2.0: Vanilla LSTMs Outperform Transformer-based Forecasting for Stock Prices
arXiv:2601.00197v1 Announce Type: cross
Abstract: Accurate forecasting of financial markets remains a long-standing challenge due to complex temporal and often latent dependencies, non-linear dynamics, and high volatility. Building on our earlier recurrent neural network framework, we present an enhanced StockBot architecture that systematically evaluates modern attention-based, convolutional, and recurrent time-series forecasting models within a unified experimental setting. While attention-based and transformer-inspired models offer increased modeling flexibility, extensive empirical evaluation reveals that a carefully constructed vanilla LSTM consistently achieves superior predictive accuracy and more stable buy/sell decision-making when trained under a common set of default hyperparameters. These results highlight the robustness and data efficiency of recurrent sequence models for financial time-series forecasting, particularly in the absence of extensive hyperparameter tuning or the availability of sufficient data when discretized to single-day intervals. Additionally, these results underscore the importance of architectural inductive bias in data-limited market prediction tasks.
Fonte: arXiv cs.CL
NLP/LLMs • Score 96
Raciocínio em Ação: Recuperação de Conhecimento Orientada por MCTS para Modelos de Linguagem Grandes
Modelos de linguagem grandes (LLMs) geralmente melhoram seu desempenho por meio da recuperação de informações semanticamente semelhantes ou da melhoria de suas capacidades de raciocínio. Este artigo apresenta um método de recuperação de conhecimento consciente do raciocínio que enriquece os LLMs com informações alinhadas à estrutura lógica das conversas, superando a similaridade semântica superficial.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
Adapting Natural Language Processing Models Across Jurisdictions: A pilot Study in Canadian Cancer Registries
arXiv:2601.00787v1 Announce Type: new
Abstract: Population-based cancer registries depend on pathology reports as their primary diagnostic source, yet manual abstraction is resource-intensive and contributes to delays in cancer data. While transformer-based NLP systems have improved registry workflows, their ability to generalize across jurisdictions with differing reporting conventions remains poorly understood. We present the first cross-provincial evaluation of adapting BCCRTron, a domain-adapted transformer model developed at the British Columbia Cancer Registry, alongside GatorTron, a biomedical transformer model, for cancer surveillance in Canada. Our training dataset consisted of approximately 104,000 and 22,000 de-identified pathology reports from the Newfoundland & Labrador Cancer Registry (NLCR) for Tier 1 (cancer vs. non-cancer) and Tier 2 (reportable vs. non-reportable) tasks, respectively. Both models were fine-tuned using complementary synoptic and diagnosis focused report section input pipelines. Across NLCR test sets, the adapted models maintained high performance, demonstrating transformers pretrained in one jurisdiction can be localized to another with modest fine-tuning. To improve sensitivity, we combined the two models using a conservative OR-ensemble achieving a Tier 1 recall of 0.99 and reduced missed cancers to 24, compared with 48 and 54 for the standalone models. For Tier 2, the ensemble achieved 0.99 recall and reduced missed reportable cancers to 33, compared with 54 and 46 for the individual models. These findings demonstrate that an ensemble combining complementary text representations substantially reduce missed cancers and improve error coverage in cancer-registry NLP. We implement a privacy-preserving workflow in which only model weights are shared between provinces, supporting interoperable NLP infrastructure and a future pan-Canadian foundation model for cancer pathology and registry workflows.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns
arXiv:2601.00588v1 Announce Type: new
Abstract: Large language models (LLMs) are increasingly deployed in cost-sensitive and on-device scenarios, and safety guardrails have advanced mainly in English. However, real-world Chinese malicious queries typically conceal intent via homophones, pinyin, symbol-based splitting, and other Chinese-specific patterns. These Chinese-specific adversarial patterns create the safety evaluation gap that is not well captured by existing benchmarks focused on English. This gap is particularly concerning for lightweight models, which may be more vulnerable to such specific adversarial perturbations. To bridge this gap, we introduce the Chinese-Specific Safety Benchmark (CSSBench) that emphasizes these adversarial patterns and evaluates the safety of lightweight LLMs in Chinese. Our benchmark covers six domains that are common in real Chinese scenarios, including illegal activities and compliance, privacy leakage, health and medical misinformation, fraud and hate, adult content, and public and political safety, and organizes queries into multiple task types. We evaluate a set of popular lightweight LLMs and measure over-refusal behavior to assess safety-induced performance degradation. Our results show that the Chinese-specific adversarial pattern is a critical challenge for lightweight LLMs. This benchmark offers a comprehensive evaluation of LLM safety in Chinese, assisting robust deployments in practice.
Fonte: arXiv cs.CL
NLP/LLMs • Score 92
InfoSynth: Information-Guided Benchmark Synthesis for LLMs
arXiv:2601.00575v1 Announce Type: new
Abstract: Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation. However, efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher novelty and diversity compared to their seed datasets. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self-verifying pipeline for constructing high-quality, novel and diverse benchmarks for LLMs. Project Page: https://ishirgarg.github.io/infosynth_web/
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak
arXiv:2601.00213v1 Announce Type: cross
Abstract: The widespread deployment of large language models (LLMs) has raised growing concerns about their misuse risks and associated safety issues. While prior studies have examined the safety of LLMs in general usage, code generation, and agent-based applications, their vulnerabilities in automated algorithm design remain underexplored. To fill this gap, this study investigates this overlooked safety vulnerability, with a particular focus on intelligent optimization algorithm design, given its prevalent use in complex decision-making scenarios. We introduce MalOptBench, a benchmark consisting of 60 malicious optimization algorithm requests, and propose MOBjailbreak, a jailbreak method tailored for this scenario. Through extensive evaluation of 13 mainstream LLMs including the latest GPT-5 and DeepSeek-V3.1, we reveal that most models remain highly susceptible to such attacks, with an average attack success rate of 83.59% and an average harmfulness score of 4.28 out of 5 on original harmful prompts, and near-complete failure under MOBjailbreak. Furthermore, we assess state-of-the-art plug-and-play defenses that can be applied to closed-source models, and find that they are only marginally effective against MOBjailbreak and prone to exaggerated safety behaviors. These findings highlight the urgent need for stronger alignment techniques to safeguard LLMs against misuse in algorithm design.
Fonte: arXiv cs.CL
NLP/LLMs • Score 92
Fast-weight Product Key Memory
arXiv:2601.00671v1 Announce Type: new
Abstract: Sequence modeling layers in modern language models typically face a trade-off between storage capacity and computational efficiency. While Softmax attention offers unbounded storage at prohibitive quadratic costs, linear variants provide efficiency but suffer from limited, fixed-size storage. We propose Fast-weight Product Key Memory (FwPKM), a novel architecture that resolves this tension by transforming the sparse Product Key Memory (PKM) from a static module into a dynamic, "fast-weight" episodic memory. Unlike PKM, FwPKM updates its parameters dynamically at both training and inference time via local chunk-level gradient descent, allowing the model to rapidly memorize and retrieve new key-value pairs from input sequences. Experiments reveal that FwPKM functions as an effective episodic memory that complements the semantic memory of standard modules, yielding significant perplexity reductions on long-context datasets. Notably, in Needle in a Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
A Language-Agnostic Hierarchical LoRA-MoE Architecture for CTC-based Multilingual ASR
arXiv:2601.00557v1 Announce Type: new
Abstract: Large-scale multilingual ASR (mASR) models such as Whisper achieve strong performance but incur high computational and latency costs, limiting their deployment on resource-constrained edge devices. In this study, we propose a lightweight and language-agnostic multilingual ASR system based on a CTC architecture with domain adaptation. Specifically, we introduce a Language-agnostic Hierarchical LoRA-MoE (HLoRA) framework integrated into an mHuBERT-CTC model, enabling end-to-end decoding via LID-posterior-driven LoRA routing. The hierarchical design consists of a multilingual shared LoRA for learning language-invariant acoustic representations and language-specific LoRA experts for modeling language-dependent characteristics. The proposed routing mechanism removes the need for prior language identity information or explicit language labels during inference, achieving true language-agnostic decoding. Experiments on MSR-86K and the MLC-SLM 2025 Challenge datasets demonstrate that HLoRA achieves competitive performance with state-of-the-art two-stage inference methods using only single-pass decoding, significantly improving decoding efficiency for low-resource mASR applications.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations
arXiv:2601.00454v1 Announce Type: new
Abstract: Guardrail models are essential for ensuring the safety of Large Language Model (LLM) deployments, but processing full multi-turn conversation histories incurs significant computational cost. We propose Defensive M2S, a training paradigm that fine-tunes guardrail models on Multi-turn to Single-turn (M2S) compressed conversations rather than complete dialogue histories. We provide a formal complexity analysis showing that M2S reduces training cost from $O(n^2)$ to $O(n)$ for $n$-turn conversations. Empirically, on our training dataset (779 samples, avg. 10.6 turns), M2S requires only 169K tokens compared to 15.7M tokens for the multi-turn baseline -- a 93$\times$ reduction. We evaluate Defensive M2S across three guardrail model families (LlamaGuard, Nemotron, Qwen3Guard) and three compression templates (hyphenize, numberize, pythonize) on SafeDialBench, a comprehensive multi-turn jailbreak benchmark. Our best configuration, Qwen3Guard with hyphenize compression, achieves 93.8% attack detection recall while reducing inference tokens by 94.6% (from 3,231 to 173 tokens per conversation). This represents a 38.9 percentage point improvement over the baseline while dramatically reducing both training and inference costs. Our findings demonstrate that M2S compression can serve as an effective efficiency technique for guardrail deployment, enabling scalable safety screening of long multi-turn conversations.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach
arXiv:2601.00388v1 Announce Type: new
Abstract: Recent advances in vision-language models have opened up new possibilities for reasoning-driven image geolocalization. However, existing approaches often rely on synthetic reasoning annotations or external image retrieval, which can limit interpretability and generalizability. In this paper, we present Geo-R, a retrieval-free framework that uncovers structured reasoning paths from existing ground-truth coordinates and optimizes geolocation accuracy via reinforcement learning. We propose the Chain of Region, a rule-based hierarchical reasoning paradigm that generates precise, interpretable supervision by mapping GPS coordinates to geographic entities (e.g., country, province, city) without relying on model-generated or synthetic labels. Building on this, we introduce a lightweight reinforcement learning strategy with coordinate-aligned rewards based on Haversine distance, enabling the model to refine predictions through spatially meaningful feedback. Our approach bridges structured geographic reasoning with direct spatial supervision, yielding improved localization accuracy, stronger generalization, and more transparent inference. Experimental results across multiple benchmarks confirm the effectiveness of Geo-R, establishing a new retrieval-free paradigm for scalable and interpretable image geolocalization. To facilitate further research and ensure reproducibility, both the model and code will be made publicly available.
Fonte: arXiv cs.CL
NLP/LLMs • Score 96
Rumo ao Diagnóstico Diferencial Automatizado de Doenças de Pele Usando Deep Learning e Estratégias Conscientes de Imbalance
À medida que as condições dermatológicas se tornam cada vez mais comuns e a disponibilidade de dermatologistas permanece limitada, há uma necessidade crescente de ferramentas inteligentes para apoiar pacientes e clínicos no diagnóstico oportuno e preciso de doenças de pele. Neste projeto, desenvolvemos um modelo baseado em deep learning para a classificação e diagnóstico de condições cutâneas.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
Exploring the Performance of Large Language Models on Subjective Span Identification Tasks
arXiv:2601.00736v1 Announce Type: new
Abstract: Identifying relevant text spans is important for several downstream tasks in NLP, as it contributes to model explainability. While most span identification approaches rely on relatively smaller pre-trained language models like BERT, a few recent approaches have leveraged the latest generation of Large Language Models (LLMs) for the task. Current work has focused on explicit span identification like Named Entity Recognition (NER), while more subjective span identification with LLMs in tasks like Aspect-based Sentiment Analysis (ABSA) has been underexplored. In this paper, we fill this important gap by presenting an evaluation of the performance of various LLMs on text span identification in three popular tasks, namely sentiment analysis, offensive language identification, and claim verification. We explore several LLM strategies like instruction tuning, in-context learning, and chain of thought. Our results indicate underlying relationships within text aid LLMs in identifying precise text spans.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Beyond IVR: Benchmarking Customer Support LLM Agents for Business-Adherence
arXiv:2601.00596v1 Announce Type: new
Abstract: Traditional customer support systems, such as Interactive Voice Response (IVR), rely on rigid scripts and lack the flexibility required for handling complex, policy-driven tasks. While large language model (LLM) agents offer a promising alternative, evaluating their ability to act in accordance with business rules and real-world support workflows remains an open challenge. Existing benchmarks primarily focus on tool usage or task completion, overlooking an agent's capacity to adhere to multi-step policies, navigate task dependencies, and remain robust to unpredictable user or environment behavior. In this work, we introduce JourneyBench, a benchmark designed to assess policy-aware agents in customer support. JourneyBench leverages graph representations to generate diverse, realistic support scenarios and proposes the User Journey Coverage Score, a novel metric to measure policy adherence. We evaluate multiple state-of-the-art LLMs using two agent designs: a Static-Prompt Agent (SPA) and a Dynamic-Prompt Agent (DPA) that explicitly models policy control. Across 703 conversations in three domains, we show that DPA significantly boosts policy adherence, even allowing smaller models like GPT-4o-mini to outperform more capable ones like GPT-4o. Our findings demonstrate the importance of structured orchestration and establish JourneyBench as a critical resource to advance AI-driven customer support beyond IVR-era limitations.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Comparative Efficiency Analysis of Lightweight Transformer Models: A Multi-Domain Empirical Benchmark for Enterprise NLP Deployment
arXiv:2601.00444v1 Announce Type: new
Abstract: In the rapidly evolving landscape of enterprise natural language processing (NLP), the demand for efficient, lightweight models capable of handling multi-domain text automation tasks has intensified. This study conducts a comparative analysis of three prominent lightweight Transformer models - DistilBERT, MiniLM, and ALBERT - across three distinct domains: customer sentiment classification, news topic classification, and toxicity and hate speech detection. Utilizing datasets from IMDB, AG News, and the Measuring Hate Speech corpus, we evaluated performance using accuracy-based metrics including accuracy, precision, recall, and F1-score, as well as efficiency metrics such as model size, inference time, throughput, and memory usage. Key findings reveal that no single model dominates all performance dimensions. ALBERT achieves the highest task-specific accuracy in multiple domains, MiniLM excels in inference speed and throughput, and DistilBERT demonstrates the most consistent accuracy across tasks while maintaining competitive efficiency. All results reflect controlled fine-tuning under fixed enterprise-oriented constraints rather than exhaustive hyperparameter optimization. These results highlight trade-offs between accuracy and efficiency, recommending MiniLM for latency-sensitive enterprise applications, DistilBERT for balanced performance, and ALBERT for resource-constrained environments.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Rule-Based Approaches to Atomic Sentence Extraction
arXiv:2601.00506v1 Announce Type: new
Abstract: Natural language often combines multiple ideas into complex sentences. Atomic sentence extraction, the task of decomposing complex sentences into simpler sentences that each express a single idea, improves performance in information retrieval, question answering, and automated reasoning systems. Previous work has formalized the "split-and-rephrase" task and established evaluation metrics, and machine learning approaches using large language models have improved extraction accuracy. However, these methods lack interpretability and provide limited insight into which linguistic structures cause extraction failures. Although some studies have explored dependency-based extraction of subject-verb-object triples and clauses, no principled analysis has examined which specific clause structures and dependencies lead to extraction difficulties. This study addresses this gap by analyzing how complex sentence structures, including relative clauses, adverbial clauses, coordination patterns, and passive constructions, affect the performance of rule-based atomic sentence extraction. Using the WikiSplit dataset, we implemented dependency-based extraction rules in spaCy, generated 100 gold=standard atomic sentence sets, and evaluated performance using ROUGE and BERTScore. The system achieved ROUGE-1 F1 = 0.6714, ROUGE-2 F1 = 0.478, ROUGE-L F1 = 0.650, and BERTScore F1 = 0.5898, indicating moderate-to-high lexical, structural, and semantic alignment. Challenging structures included relative clauses, appositions, coordinated predicates, adverbial clauses, and passive constructions. Overall, rule-based extraction is reasonably accurate but sensitive to syntactic complexity.
Fonte: arXiv cs.CL
MLOps/Systems • Score 96
Benchmarking de Métodos de Pré-processamento e Integração em Genômica de Células Únicas
A análise de dados de células únicas tem o potencial de revolucionar a medicina personalizada ao caracterizar mudanças moleculares associadas a doenças em nível celular. Este estudo examina um pipeline geral para análise de dados de células únicas, avaliando diferentes métodos de normalização, redução de dimensionalidade e integração, utilizando seis conjuntos de dados variados.
Fonte: arXiv cs.AI
NLP/LLMs • Score 88
BERT-JEPA: Reorganizing CLS Embeddings for Language-Invariant Semantics
arXiv:2601.00366v1 Announce Type: new
Abstract: Joint Embedding Predictive Architectures (JEPA) are a novel self supervised training technique that have shown recent promise across domains. We introduce BERT-JEPA (BEPA), a training paradigm that adds a JEPA training objective to BERT-style models, working to combat a collapsed [CLS] embedding space and turning it into a language-agnostic space. This new structure leads to increased performance across multilingual benchmarks.
Fonte: arXiv cs.CL
NLP/LLMs • Score 96
Além de APIs Perfeitas: Uma Avaliação Abrangente de Agentes LLM Sob a Complexidade Real de APIs
Apresentamos o WildAGTEval, um benchmark projetado para avaliar as capacidades de chamada de função de agentes de modelos de linguagem grande (LLM) sob a complexidade realista de APIs. Ao contrário de trabalhos anteriores que assumem um sistema de API idealizado, WildAGTEval considera a especificação e a execução de APIs, oferecendo cenários de complexidade variados para avaliar o desempenho dos LLMs.
Fonte: arXiv cs.AI
RL • Score 95
Retrieval--Reasoning Processes for Multi-hop Question Answering: A Four-Axis Design Framework and Empirical Trends
arXiv:2601.00536v1 Announce Type: new
Abstract: Multi-hop question answering (QA) requires systems to iteratively retrieve evidence and reason across multiple hops. While recent RAG and agentic methods report strong results, the underlying retrieval--reasoning \emph{process} is often left implicit, making procedural choices hard to compare across model families. This survey takes the execution procedure as the unit of analysis and introduces a four-axis framework covering (A) overall execution plan, (B) index structure, (C) next-step control (strategies and triggers), and (D) stop/continue criteria. Using this schema, we map representative multi-hop QA systems and synthesize reported ablations and tendencies on standard benchmarks (e.g., HotpotQA, 2WikiMultiHopQA, MuSiQue), highlighting recurring trade-offs among effectiveness, efficiency, and evidence faithfulness. We conclude with open challenges for retrieval--reasoning agents, including structure-aware planning, transferable control policies, and robust stopping under distribution shift.
Fonte: arXiv cs.CL
Vision • Score 95
DepFlow: Disentangled Speech Generation to Mitigate Semantic Bias in Depression Detection
arXiv:2601.00303v1 Announce Type: new
Abstract: Speech is a scalable and non-invasive biomarker for early mental health screening. However, widely used depression datasets like DAIC-WOZ exhibit strong coupling between linguistic sentiment and diagnostic labels, encouraging models to learn semantic shortcuts. As a result, model robustness may be compromised in real-world scenarios, such as Camouflaged Depression, where individuals maintain socially positive or neutral language despite underlying depressive states. To mitigate this semantic bias, we propose DepFlow, a three-stage depression-conditioned text-to-speech framework. First, a Depression Acoustic Encoder learns speaker- and content-invariant depression embeddings through adversarial training, achieving effective disentanglement while preserving depression discriminability (ROC-AUC: 0.693). Second, a flow-matching TTS model with FiLM modulation injects these embeddings into synthesis, enabling control over depressive severity while preserving content and speaker identity. Third, a prototype-based severity mapping mechanism provides smooth and interpretable manipulation across the depression continuum. Using DepFlow, we construct a Camouflage Depression-oriented Augmentation (CDoA) dataset that pairs depressed acoustic patterns with positive/neutral content from a sentiment-stratified text bank, creating acoustic-semantic mismatches underrepresented in natural data. Evaluated across three depression detection architectures, CDoA improves macro-F1 by 9%, 12%, and 5%, respectively, consistently outperforming conventional augmentation strategies in depression Detection. Beyond enhancing robustness, DepFlow provides a controllable synthesis platform for conversational systems and simulation-based evaluation, where real clinical data remains limited by ethical and coverage constraints.
Fonte: arXiv cs.CL
RL • Score 95
Clustering por Denoising: Difusão latente plug-and-play para dados de célula única
O sequenciamento de RNA de célula única (scRNA-seq) permite o estudo da heterogeneidade celular. No entanto, a precisão do clustering e as análises subsequentes baseadas em rótulos celulares ainda são desafiadoras devido ao ruído de medição e à variabilidade biológica. Apresentamos um framework de difusão plug-and-play que separa o espaço de observação e o espaço de denoising.
Fonte: arXiv stat.ML
Evaluation/Benchmarks • Score 93
Correspondência de Fluxo Latente para Síntese de Voz Cantante Expressiva
A síntese de voz cantada baseada em autoencoders variacionais condicionais (cVAE) oferece inferência eficiente e alta qualidade de áudio ao aprender um espaço latente condicionado por pontuação e um espaço latente posterior condicionado por gravações. No entanto, a correspondência imperfeita entre as distribuições pode degradar a expressividade fina, como vibrato e micro-prosódia. Propomos o FM-Singer, que introduz a correspondência de fluxo condicional (CFM) no espaço latente.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
Correção Residual Segura Inspirada em Causalidade para Séries Temporais Multivariadas
Embora os preditores multivariados modernos, como Transformers e GNNs, apresentem forte desempenho em benchmarks, eles frequentemente sofrem de erros sistemáticos em variáveis ou horizontes específicos e, criticamente, carecem de garantias contra degradação de desempenho na implementação. Para abordar essa lacuna de segurança, propomos o CRC (Correção Residual Segura Inspirada em Causalidade), um framework projetado para garantir não degradação.
Fonte: arXiv stat.ML
NLP/LLMs • Score 95
PolarGrad: Uma Classe de Otimizadores de Gradiente de Matriz a Partir de uma Perspectiva Unificadora de Pré-condicionamento
O aumento contínuo da escala dos modelos de deep learning e dos dados de treinamento destaca a importância crítica de métodos de otimização eficientes. Neste artigo, introduzimos um framework unificador para analisar métodos de pré-condicionamento 'conscientes de matriz', levando a uma nova classe de métodos de otimização que demonstram convergência mais rápida.
Fonte: arXiv stat.ML
RL • Score 95
Projetando uma Rede de Sensores Ótima Através da Minimização da Perda de Informação
O design experimental ótimo é um tópico clássico em estatística, com muitos problemas e soluções bem estudados. Este trabalho investiga o posicionamento de sensores para monitorar processos espaço-temporais, considerando a dimensão temporal em nossa modelagem e otimização. Apresentamos um novo critério de posicionamento de sensores baseado em modelo, juntamente com um algoritmo de otimização altamente eficiente.
Fonte: arXiv stat.ML
NLP/LLMs • Score 96
Ajuste Fino Online de Decision Transformers com Gradientes de RL Puro
Os Decision Transformers (DTs) surgiram como um poderoso framework para tomada de decisão sequencial, formulando o aprendizado por reforço offline (RL) como um problema de modelagem de sequência. No entanto, a extensão dos DTs para configurações online com gradientes de RL puro permanece amplamente inexplorada. Identificamos o relabeling de retorno retrospectivo como um obstáculo crítico para o ajuste fino baseado em RL.
Fonte: arXiv cs.AI
Vision • Score 96
Detecção Adaptativa de Coordenação Causal para Mídias Sociais: Um Framework Guiado por Memória com Aprendizado Semi-Supervisionado
Detectar comportamentos inautênticos coordenados em mídias sociais é um desafio crítico. Propomos o framework Adaptive Causal Coordination Detection (ACCD), que utiliza uma arquitetura progressiva em três estágios para aprender e reter configurações de detecção otimizadas. O ACCD melhora a identificação de relações causais e reduz a necessidade de rotulagem manual, alcançando um F1-score de 87,3% em detecção de ataques coordenados.
Fonte: arXiv cs.AI
Vision • Score 95
Compressed Map Priors for 3D Perception
arXiv:2601.00139v1 Announce Type: new
Abstract: Human drivers rarely travel where no person has gone before. After all, thousands of drivers use busy city roads every day, and only one can claim to be the first. The same holds for autonomous computer vision systems. The vast majority of the deployment area of an autonomous vision system will have been visited before. Yet, most autonomous vehicle vision systems act as if they are encountering each location for the first time. In this work, we present Compressed Map Priors (CMP), a simple but effective framework to learn spatial priors from historic traversals. The map priors use a binarized hashmap that requires only $32\text{KB}/\text{km}^2$, a $20\times$ reduction compared to the dense storage. Compressed Map Priors easily integrate into leading 3D perception systems at little to no extra computational costs, and lead to a significant and consistent improvement in 3D object detection on the nuScenes dataset across several architectures.
Fonte: arXiv cs.CV
NLP/LLMs • Score 96
Rumo a Sistemas de IA Potencializados por Fotônica em Grande Escala: Da Automação de Design Físico à Coexploração de Sistema e Algoritmo
Neste trabalho, identificamos três considerações essenciais para a realização de sistemas práticos de IA fotônica em escala: (1) suporte a operações tensorais dinâmicas para modelos modernos; (2) gerenciamento sistemático de sobrecargas de conversão, controle e movimentação de dados; e (3) robustez sob não idealidades de hardware. Desenvolvemos uma ferramenta de suporte ao design de IA fotônica desde a exploração inicial até a realização física.
Fonte: arXiv cs.AI
Vision • Score 96
Do Barro ao Código: Raciocínio Tipológico e Material nas Interpretações de IA das Torres de Pombos Iranianas
Este estudo investiga como sistemas de IA generativa interpretam a inteligência arquitetônica embutida na forma vernacular. Usando a torre de pombos iraniana como estudo de caso, a pesquisa testa três modelos de difusão: Midjourney v6, DALL-E 3 e DreamStudio baseado em Stable Diffusion XL (SDXL), em três estágios de prompt: referencial, adaptativo e especulativo.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Ouça o Batimento em Fases: Biometria de ECG Consciente de Fases Fundamentada Fisiologicamente
A eletrocardiografia (ECG) é utilizada para autenticação de identidade em dispositivos vestíveis devido às suas características específicas de cada indivíduo e à sua natureza inerente de vivacidade. Propomos um framework Hierarchical Phase-Aware Fusion (HPAF) que evita explicitamente o entrelaçamento de características através de um design em três estágios.
Fonte: arXiv cs.AI
Evaluation/Benchmarks • Score 89
Uma Visão do Processo Gaussiano sobre Ruído de Observação e Inicialização em Redes Neurais Amplas
Realizar descida de gradiente em uma rede neural ampla é equivalente a calcular a média posterior de um Processo Gaussiano com o Neural Tangent Kernel (NTK-GP), para uma média a priori específica e com ruído de observação zero. No entanto, as formulações existentes apresentam duas limitações: (i) o NTK-GP assume alvos sem ruído, levando a uma especificação incorreta em dados ruidosos; (ii) a equivalência não se estende a médias a priori arbitrárias, essenciais para modelos bem especificados.
Fonte: arXiv stat.ML
RL • Score 92
Efeitos da Alocação Estrutural da Diversidade de Tarefas Geométricas em Modelos Lineares de Meta-Aprendizado
O meta-aprendizado busca aproveitar informações de tarefas relacionadas para melhorar a previsão em dados não rotulados para novas tarefas com um número limitado de observações rotuladas. Embora a diversidade de tarefas seja considerada benéfica, estudos recentes mostram que ela pode degradar o desempenho de previsão em meta-aprendizado, dependendo da alocação da variabilidade geométrica das tarefas.
Fonte: arXiv stat.ML
NLP/LLMs • Score 96
DA-DPO: Otimização de Preferências Consciente da Dificuldade e Custo-Eficiente para Reduzir Alucinações em MLLMs
O Direct Preference Optimization (DPO) demonstrou grande potencial para mitigar alucinações em Multimodal Large Language Models (MLLMs). No entanto, abordagens existentes frequentemente sofrem com overfitting devido ao desequilíbrio de dificuldade nos dados de preferência. Propomos o Difficulty-Aware Direct Preference Optimization (DA-DPO), um framework custo-efetivo que equilibra o processo de aprendizado.
Fonte: arXiv cs.AI
Vision • Score 96
Campos Cerebrais Neurais: Uma Abordagem Inspirada em NeRF para Gerar Eletrodos de EEG Inexistentes
Os dados de eletroencefalografia (EEG) apresentam desafios únicos de modelagem devido à variação de comprimento, baixíssima relação sinal-ruído e diferenças significativas entre participantes. Este trabalho apresenta um novo método inspirado em Neural Radiance Fields (NeRF) para processar sinais de EEG, permitindo a visualização contínua da atividade cerebral e a simulação de dados de eletrodos inexistentes.
Fonte: arXiv cs.AI
Evaluation/Benchmarks • Score 90
O Paradoxo do Clima: Por Que a Precipitação Falha em Prever a Severidade de Acidentes de Trânsito em Dados de Grande Escala nos EUA
Este estudo investiga a capacidade preditiva de fatores ambientais, temporais e espaciais sobre a severidade de acidentes de trânsito nos Estados Unidos. Utilizando um conjunto de dados de 500.000 acidentes de trânsito entre 2016 e 2023, treinamos um classificador XGBoost otimizado por meio de validação cruzada com busca aleatória. O modelo final alcança uma precisão geral de 78%, com desempenho forte na classe majoritária (Severidade 2).
Fonte: arXiv cs.LG
RL • Score 95
AEGIS: Exploring the Limit of World Knowledge Capabilities for Unified Mulitmodal Models
arXiv:2601.00561v1 Announce Type: new
Abstract: The capability of Unified Multimodal Models (UMMs) to apply world knowledge across diverse tasks remains a critical, unresolved challenge. Existing benchmarks fall short, offering only siloed, single-task evaluations with limited diagnostic power. To bridge this gap, we propose AEGIS (\emph{i.e.}, \textbf{A}ssessing \textbf{E}diting, \textbf{G}eneration, \textbf{I}nterpretation-Understanding for \textbf{S}uper-intelligence), a comprehensive multi-task benchmark covering visual understanding, generation, editing, and interleaved generation. AEGIS comprises 1,050 challenging, manually-annotated questions spanning 21 topics (including STEM, humanities, daily life, etc.) and 6 reasoning types. To concretely evaluate the performance of UMMs in world knowledge scope without ambiguous metrics, we further propose Deterministic Checklist-based Evaluation (DCE), a protocol that replaces ambiguous prompt-based scoring with atomic ``Y/N'' judgments, to enhance evaluation reliability. Our extensive experiments reveal that most UMMs exhibit severe world knowledge deficits and that performance degrades significantly with complex reasoning. Additionally, simple plug-in reasoning modules can partially mitigate these vulnerabilities, highlighting a promising direction for future research. These results highlight the importance of world-knowledge-based reasoning as a critical frontier for UMMs.
Fonte: arXiv cs.CV
Evaluation/Benchmarks • Score 93
Otimização de Redes Neurais LSTM para Previsão de Vendas no Varejo com Recursos Limitados: Um Estudo de Compressão de Modelo
As redes neurais LSTM (Long Short-Term Memory) padrão oferecem previsões precisas para dados de vendas no setor varejista, mas exigem grande poder computacional, o que pode ser desafiador para indústrias de varejo de médio a pequeno porte. Este artigo investiga a compressão do modelo LSTM ao reduzir gradualmente o número de unidades ocultas de 128 para 16.
Fonte: arXiv cs.LG
RL • Score 95
Integração de Multi-Armed Bandit, Aprendizado Ativo e Computação Distribuída para Otimização Escalável
Problemas modernos de otimização em domínios científicos e de engenharia frequentemente dependem de avaliações black-box caras. Propomos o ALMAB-DC, um framework modular e unificado para otimização black-box escalável que integra aprendizado ativo, multi-armed bandits e computação distribuída, com aceleração opcional por GPU. Resultados empíricos mostram que ALMAB-DC supera consistentemente otimizadores black-box de última geração.
Fonte: arXiv stat.ML
Vision • Score 95
Estimativa de densidade espectral de séries temporais funcionais em grandes domínios usando deep learning
Derivamos um estimador da densidade espectral de uma série temporal funcional que é a saída de uma rede neural multilayer perceptron. O estimador é motivado por dificuldades na computação de estimadores de densidade espectral existentes para séries temporais de funções definidas em grades muito grandes, como em modelos climáticos e exames médicos.
Fonte: arXiv stat.ML
NLP/LLMs • Score 96
Democratizando Sistemas de IA Eletrônico-Fotônicos: Um Fluxo de Ferramentas de Co-Projeto e Automação de Design Infundido com IA e Código Aberto
A fotônica está se tornando uma tecnologia fundamental para sistemas de IA de alto desempenho e computação científica, oferecendo velocidade, paralelismo e eficiência energética incomparáveis. No entanto, o design e a implementação de sistemas de IA eletrônico-fotônicos permanecem desafiadores devido a uma curva de aprendizado acentuada. Apresentamos um framework de co-design e automação de múltiplas camadas para democratizar o desenvolvimento de sistemas de IA fotônicos.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Uma Avaliação Empírica de Abordagens Baseadas em LLM para Detecção de Vulnerabilidades de Código: RAG, SFT e Sistemas de Agentes Duplos
O rápido avanço dos Large Language Models (LLMs) apresenta novas oportunidades para a detecção automatizada de vulnerabilidades de software, uma tarefa crucial para a segurança de bases de código modernas. Este artigo apresenta um estudo comparativo sobre a eficácia de técnicas baseadas em LLM para detectar vulnerabilidades de software, avaliando três abordagens: Retrieval-Augmented Generation (RAG), Supervised Fine-Tuning (SFT) e um framework de LLM de Agente Duplo.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
Detecção de Confundidores Não Observados: Uma Abordagem de Regressão Kernelizada
Detectar confundidores não observados é crucial para inferência causal confiável em estudos observacionais. Métodos existentes exigem suposições de linearidade ou múltiplos ambientes heterogêneos, limitando a aplicabilidade em configurações não lineares de ambiente único. Propomos a Detecção de Confundidores por Regressão Kernelizada (KRCD), um método inovador que utiliza espaços de Hilbert de kernel reprodutivo para modelar dependências complexas.
Fonte: arXiv stat.ML
NLP/LLMs • Score 95
From Evidence-Based Medicine to Knowledge Graph: Retrieval-Augmented Generation for Sports Rehabilitation and a Domain Benchmark
arXiv:2601.00216v1 Announce Type: new
Abstract: In medicine, large language models (LLMs) increasingly rely on retrieval-augmented generation (RAG) to ground outputs in up-to-date external evidence. However, current RAG approaches focus primarily on performance improvements while overlooking evidence-based medicine (EBM) principles. This study addresses two key gaps: (1) the lack of PICO alignment between queries and retrieved evidence, and (2) the absence of evidence hierarchy considerations during reranking. We present a generalizable strategy for adapting EBM to graph-based RAG, integrating the PICO framework into knowledge graph construction and retrieval, and proposing a Bayesian-inspired reranking algorithm to calibrate ranking scores by evidence grade without introducing predefined weights. We validated this framework in sports rehabilitation, a literature-rich domain currently lacking RAG systems and benchmarks. We released a knowledge graph (357,844 nodes and 371,226 edges) and a reusable benchmark of 1,637 QA pairs. The system achieved 0.830 nugget coverage, 0.819 answer faithfulness, 0.882 semantic similarity, and 0.788 PICOT match accuracy. In a 5-point Likert evaluation, five expert clinicians rated the system 4.66-4.84 across factual accuracy, faithfulness, relevance, safety, and PICO alignment. These findings demonstrate that the proposed EBM adaptation strategy improves retrieval and answer quality and is transferable to other clinical domains. The released resources also help address the scarcity of RAG datasets in sports rehabilitation.
Fonte: arXiv cs.CL
NLP/LLMs • Score 96
MethConvTransformer: Um Framework de Deep Learning para Detecção de Doença de Alzheimer em Múltiplos Tecidos
A doença de Alzheimer (AD) é um distúrbio neurodegenerativo multifatorial caracterizado por declínio cognitivo progressivo. O MethConvTransformer é um framework de deep learning baseado em transformer que integra perfis de metilação de DNA de tecidos cerebrais e periféricos, permitindo a descoberta de biomarcadores. O modelo supera as abordagens convencionais de machine learning, oferecendo biomarcadores epigenéticos robustos e interpretabilidade multi-resolução.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
SSI-GAN: Redes Geradoras Adversariais Semi-Supervisionadas Inspiradas no Swin para Classificação de Espículas Neurais
Os mosquitos são os principais agentes transmissores de doenças arbovirais. A classificação manual de seus padrões de espículas neurais é muito trabalhosa e cara. Para resolver a escassez de dados rotulados, propomos uma nova arquitetura de Rede Geradora Adversarial (GAN) chamada SSI-GAN, que alcançou 99,93% de precisão na classificação com apenas 3% de dados rotulados.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Aprendizado por Reforço com Aproximação de Função para Processos Não-Markovianos
Estudamos métodos de aprendizado por reforço com aproximação de função linear sob processos de estado e custo não-Markovianos. Consideramos inicialmente o método de avaliação de política e demonstramos que o algoritmo converge sob condições adequadas de ergodicidade. Além disso, mostramos que o limite corresponde ao ponto fixo de um operador conjunto composto por uma projeção ortogonal e o operador de Bellman de um processo de decisão auxiliar extit{Markov}.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
HFedMoE: Aprendizado Federado Heterogêneo Consciente de Recursos com Mixture-of-Experts
Embora o aprendizado federado (FL) permita o ajuste fino de grandes modelos de linguagem (LLMs) sem comprometer a privacidade dos dados, o tamanho substancial de um LLM torna o treinamento em dispositivos impraticável para clientes com recursos limitados, como dispositivos móveis. Modelos Mixture-of-Experts (MoE) surgiram como uma solução eficiente em termos de computação, ativando apenas um subconjunto esparso de especialistas durante o treinamento do modelo.
Fonte: arXiv cs.LG
Vision • Score 96
Avaliação de Detectores de Anomalias para Problemas de Classificação Industrial Altamente Desequilibrados Simulados
O machine learning oferece soluções potenciais para problemas atuais em sistemas industriais, como controle de qualidade e manutenção preditiva, mas enfrenta barreiras únicas em aplicações industriais. Este artigo apresenta uma avaliação abrangente de algoritmos de detecção de anomalias usando um conjunto de dados simulado que reflete restrições de engenharia do mundo real.
Fonte: arXiv cs.AI
Vision • Score 96
Detecção Humana em Tempo Real para Sequências de Vídeo Capturadas Aéreas via Modelos Profundos
A detecção humana em vídeos desempenha um papel importante em diversas aplicações do mundo real. Abordagens tradicionais dependem de características feitas à mão, que são específicas para problemas e tarefas. Este trabalho utiliza métodos de aprendizado automático de características, combinando fluxo óptico e três modelos profundos diferentes para detecção humana em vídeos capturados com uma câmera não estática em plataformas aéreas.
Fonte: arXiv cs.LG
Evaluation/Benchmarks • Score 92
Redes de Imputação Condicional Generativa de Valores Ausentes
Neste estudo, apresentamos uma estratégia condicional generativa sofisticada para imputar valores ausentes em conjuntos de dados, uma área de grande importância na análise estatística. Esclarecemos os fundamentos teóricos das Redes de Imputação Condicional Generativa de Valores Ausentes (GCMI) e demonstramos suas propriedades robustas em contextos de Missing Completely at Random (MCAR) e Missing at Random (MAR).
Fonte: arXiv stat.ML
Vision • Score 95
Boosting Segment Anything Model to Generalize Visually Non-Salient Scenarios
arXiv:2601.00537v1 Announce Type: new
Abstract: Segment Anything Model (SAM), known for its remarkable zero-shot segmentation capabilities, has garnered significant attention in the community. Nevertheless, its performance is challenged when dealing with what we refer to as visually non-salient scenarios, where there is low contrast between the foreground and background. In these cases, existing methods often cannot capture accurate contours and fail to produce promising segmentation results. In this paper, we propose Visually Non-Salient SAM (VNS-SAM), aiming to enhance SAM's perception of visually non-salient scenarios while preserving its original zero-shot generalizability. We achieve this by effectively exploiting SAM's low-level features through two designs: Mask-Edge Token Interactive decoder and Non-Salient Feature Mining module. These designs help the SAM decoder gain a deeper understanding of non-salient characteristics with only marginal parameter increments and computational requirements. The additional parameters of VNS-SAM can be optimized within 4 hours, demonstrating its feasibility and practicality. In terms of data, we established VNS-SEG, a unified dataset for various VNS scenarios, with more than 35K images, in contrast to previous single-task adaptations. It is designed to make the model learn more robust VNS features and comprehensively benchmark the model's segmentation performance and generalizability on VNS scenarios. Extensive experiments across various VNS segmentation tasks demonstrate the superior performance of VNS-SAM, particularly under zero-shot settings, highlighting its potential for broad real-world applications. Codes and datasets are publicly available at https://guangqian-guo.github.io/VNS-SAM.
Fonte: arXiv cs.CV
RL • Score 96
Modelagem de Estratégia Baseada em Regras Quantitativas no Classic Indian Rummy: Uma Abordagem de Otimização Métrica
O variante de 13 cartas do Classic Indian Rummy é um jogo sequencial de informação incompleta que requer raciocínio probabilístico e tomada de decisão combinatória. Este artigo propõe um framework baseado em regras para jogo estratégico, impulsionado por uma nova métrica de avaliação de mãos chamada MinDist.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
Efficient Prediction of Dense Visual Embeddings via Distillation and RGB-D Transformers
arXiv:2601.00359v1 Announce Type: new
Abstract: In domestic environments, robots require a comprehensive understanding of their surroundings to interact effectively and intuitively with untrained humans. In this paper, we propose DVEFormer - an efficient RGB-D Transformer-based approach that predicts dense text-aligned visual embeddings (DVE) via knowledge distillation. Instead of directly performing classical semantic segmentation with fixed predefined classes, our method uses teacher embeddings from Alpha-CLIP to guide our efficient student model DVEFormer in learning fine-grained pixel-wise embeddings. While this approach still enables classical semantic segmentation, e.g., via linear probing, it further enables flexible text-based querying and other applications, such as creating comprehensive 3D maps. Evaluations on common indoor datasets demonstrate that our approach achieves competitive performance while meeting real-time requirements, operating at 26.3 FPS for the full model and 77.0 FPS for a smaller variant on an NVIDIA Jetson AGX Orin. Additionally, we show qualitative results that highlight the effectiveness and possible use cases in real-world applications. Overall, our method serves as a drop-in replacement for traditional segmentation approaches while enabling flexible natural-language querying and seamless integration into 3D mapping pipelines for mobile robotics.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
From Transformers to LLMs: A Systematic Survey of Efficiency Considerations in NLP
arXiv:2406.16893v2 Announce Type: replace
Abstract: The emergence of Transformer-based Large Language Models (LLMs) has substantially augmented the capabilities of Natural Language Processing (NLP), thereby intensifying the demand for computational resources. Therefore, enhancing efficiency based on factors like computational requirements, energy consumption, carbon footprint and financial cost has become a vital area of research. This motivates us to conduct a systematic literature review on Transformer-based LLMs in NLP from the perspective of efficiency. In this survey of 312 articles published between the years 2011 and 2025, efficiency-improvement endeavors have been systematically discussed targeting various aspects such as data curation, model design, model downsizing, and dynamic inferencing. This has been augmented with efficiency considerations in model adaptation strategies like pre-training, fine-tuning, prompt-engineering and Retrieval-Augmented Generation (RAG). Furthermore, a statistical analysis of the articles has been performed followed by an in-depth evaluation of the efficiency and efficacy of more than 30 renowned NLP models has been conducted on 13 evaluation benchmarks. This paper offers valuable insights for researchers, professionals as well as scholars, and explores the trend of research toward sustainable practices in NLP.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
A Chain-of-Thought Approach to Semantic Query Categorization in e-Commerce Taxonomies
arXiv:2601.00510v1 Announce Type: cross
Abstract: Search in e-Commerce is powered at the core by a structured representation of the inventory, often formulated as a category taxonomy. An important capability in e-Commerce with hierarchical taxonomies is to select a set of relevant leaf categories that are semantically aligned with a given user query. In this scope, we address a fundamental problem of search query categorization in real-world e-Commerce taxonomies. A correct categorization of a query not only provides a way to zoom into the correct inventory space, but opens the door to multiple intent understanding capabilities for a query. A practical and accurate solution to this problem has many applications in e-commerce, including constraining retrieved items and improving the relevance of the search results. For this task, we explore a novel Chain-of-Thought (CoT) paradigm that combines simple tree-search with LLM semantic scoring. Assessing its classification performance on human-judged query-category pairs, relevance tests, and LLM-based reference methods, we find that the CoT approach performs better than a benchmark that uses embedding-based query category predictions. We show how the CoT approach can detect problems within a hierarchical taxonomy. Finally, we also propose LLM-based approaches for query-categorization of the same spirit, but which scale better at the range of millions of queries.
Fonte: arXiv cs.CL
Evaluation/Benchmarks • Score 95
Reparametrização Categórica com Modelos de Difusão Denoising
A otimização baseada em gradiente com variáveis categóricas geralmente depende de estimadores de função de pontuação, que são imparciais, mas ruidosos, ou de relaxamentos contínuos que substituem a distribuição discreta por um substituto suave. Neste artigo, estendemos essa família de relaxamentos introduzindo uma reparametrização suave baseada em difusão para distribuições categóricas, permitindo um sampler de difusão sem treinamento.
Fonte: arXiv stat.ML
RL • Score 95
Decomposição Tucker Esparsa e Regularização Gráfica para Previsão de Séries Temporais de Alta Dimensionalidade
Métodos existentes de modelos autorregressivos vetoriais para análise de séries temporais multivariadas utilizam aproximação de matriz de baixa classificação ou decomposição Tucker para reduzir a dimensão do problema de superparametrização. Este artigo propõe um método de decomposição Tucker esparsa com regularização gráfica para séries temporais autorregressivas vetoriais de alta dimensionalidade.
Fonte: arXiv stat.ML
NLP/LLMs • Score 96
Framework de Otimização Bayesiana Dinâmica para Ajuste de Instruções na Descoberta de Equações Diferenciais Parciais
Modelos de Linguagem de Grande Escala (LLMs) mostram potencial para a descoberta de equações, mas suas saídas são altamente sensíveis à formulação dos prompts, um fenômeno que chamamos de fragilidade das instruções. Para resolver isso, propomos o NeuroSymBO, que reformula a engenharia de prompts como um problema de decisão sequencial.
Fonte: arXiv cs.LG
RL • Score 96
Yahtzee: Técnicas de Aprendizado por Reforço para Jogos Combinatórios Estocásticos
Yahtzee é um jogo clássico de dados com uma estrutura estocástica e combinatória, apresentando recompensas atrasadas, o que o torna um interessante benchmark de RL em escala média. Este trabalho formula Yahtzee como um Processo de Decisão de Markov (MDP) e treina agentes de auto-jogo utilizando diversos métodos de gradiente de política.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Detecção de Descargas de Onda Espigada (SWD) usando 1-dimensional Residual UNet
A rotulagem manual de eventos em registros de eletroencefalografia (EEG) é um processo demorado, especialmente quando as gravações são contínuas por semanas a meses. Um método para rotular automaticamente eventos relevantes de EEG reduz a carga de trabalho manual. Neste estudo, comparamos o desempenho de 14 classificadores de machine learning em um conjunto de dados anotado manualmente, encontrando que um 1D UNet é o mais eficaz para rotular SWDs.
Fonte: arXiv cs.LG
RL • Score 93
Imitação a partir de Observações com Embeddings Gerativos em Nível de Trajetória
Consideramos o aprendizado de imitação offline a partir de observações (LfO) onde as demonstrações de especialistas são escassas e os dados offline disponíveis são subótimos e distantes do comportamento do especialista. Propomos o TGE, um embedding gerativo em nível de trajetória que constrói uma recompensa substituta densa e suave, estimando a densidade de estados do especialista em um modelo de difusão temporal treinado com dados de trajetória offline.
Fonte: arXiv cs.LG
Vision • Score 96
Um Estudo Comparativo de Estratégias de Adaptação para Modelos Fundamentais de Séries Temporais na Detecção de Anomalias
A detecção de anomalias em séries temporais é essencial para a operação confiável de sistemas complexos, mas a maioria dos métodos existentes requer treinamento extenso e específico. Este estudo investiga se modelos fundamentais de séries temporais (TSFMs), pré-treinados em grandes dados heterogêneos, podem servir como bases universais para detecção de anomalias.
Fonte: arXiv cs.LG
Vision • Score 96
IMBWatch -- uma abordagem de Rede Neural Gráfica Espacial-Temporal para detectar Negócios de Massagem Ilícitos
Os Negócios de Massagem Ilícitos (IMBs) são uma forma encoberta e persistente de exploração organizada que operam sob a fachada de serviços de bem-estar legítimos. A detecção de IMBs é difícil devido a anúncios digitais codificados e mudanças frequentes de pessoal e locais. Apresentamos o IMBWatch, um framework de rede neural gráfica espacial-temporal (ST-GNN) para a detecção em larga escala de IMBs, que combina operações de convolução gráfica com mecanismos de atenção temporal.
Fonte: arXiv cs.LG
RL • Score 96
ClinicalReTrial: Um Agente de IA Autoevolutivo para Otimização de Protocolos de Ensaios Clínicos
O fracasso de ensaios clínicos continua sendo um gargalo central no desenvolvimento de medicamentos, onde pequenas falhas no design do protocolo podem comprometer irreversivelmente os resultados. Este artigo propõe o ClinicalReTrial, um framework de agente de IA autoevolutivo que aborda essa lacuna ao tratar o raciocínio de ensaios clínicos como um problema iterativo de redesign de protocolo.
Fonte: arXiv cs.AI
NLP/LLMs • Score 93
GRIT -- PEFT Consciente de Geometria com Pré-condicionamento K-FAC, Reprojeção Guiada por Fisher e Adaptação Dinâmica de Rank
O ajuste fino eficiente em parâmetros (PEFT) é a abordagem padrão para adaptar LLMs, mas LoRA e QLoRA são amplamente agnósticas em relação à geometria. Introduzimos o GRIT, um procedimento LoRA dinâmico e consciente da curvatura que preserva a parametrização LoRA, melhorando a eficiência e reduzindo a deriva em direções fracas. GRIT reduz em média 46% os parâmetros treináveis, mantendo a qualidade em diferentes estilos de prompt e misturas de dados.
Fonte: arXiv cs.AI
RL • Score 96
Predição Precoce de Cirrose Hepática com Antecedência de Até Três Anos: Um Estudo de Machine Learning Comparando com o FIB-4
Objetivo: Desenvolver e avaliar modelos de machine learning (ML) para prever a cirrose hepática incidente um, dois e três anos antes do diagnóstico, utilizando dados de registros eletrônicos de saúde (EHR) coletados rotineiramente, e comparar seu desempenho com o escore FIB-4. Métodos: Realizamos um estudo de coorte retrospectivo usando dados de EHR desidentificados de um grande sistema de saúde acadêmico.
Fonte: arXiv cs.LG
Evaluation/Benchmarks • Score 96
Ajuste Fino Robusto de Grafos com Prompting Adversarial de Grafos
O método de Ajuste Fino Eficiente em Parâmetros (PEFT) se destacou como um paradigma dominante para adaptar modelos GNN pré-treinados a tarefas específicas. No entanto, métodos PEFT existentes geralmente mostram vulnerabilidades significativas a ruídos e ataques na topologia de grafos e atributos/nomeações de nós. Propomos integrar aprendizado adversarial ao prompting de grafos, desenvolvendo um novo framework de Adversarial Graph Prompting (AGP) para alcançar um ajuste fino robusto.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
Ajuste Fino de Modelos de Linguagem de Grande Escala para Triagem Automatizada de Depressão em Pidgin Nigeriano: Estudo Piloto GENSCORE
A depressão é um grande contribuinte para a carga de saúde mental na Nigéria, mas a cobertura de triagem é limitada devido ao baixo acesso a clínicos, estigma e barreiras linguísticas. Este estudo apresenta uma abordagem inovadora para triagem automatizada de depressão usando modelos de linguagem de grande escala (LLMs) ajustados para o Pidgin Nigeriano conversacional.
Fonte: arXiv cs.AI
Vision • Score 96
Engenharia de Recursos Híbridos Otimizada para Detecção de Arritmias Eficiente em Recursos em Sinais de ECG: Um Framework de Otimização
As doenças cardiovasculares, especialmente as arritmias, continuam a ser uma das principais causas de mortalidade global, exigindo monitoramento contínuo via Internet das Coisas Médicas (IoMT). Este estudo propõe um framework centrado em dados e eficiente em recursos que prioriza a engenharia de características em vez da complexidade, alcançando alta precisão diagnóstica com um modelo leve.
Fonte: arXiv cs.LG
RL • Score 96
Uma Análise Comparativa de Métodos de Machine Learning Interpretabéis
Nos últimos anos, o Machine Learning (ML) tem sido amplamente adotado em diversos setores, incluindo áreas críticas como saúde, finanças e direito. Essa dependência crescente levantou preocupações sobre a interpretabilidade e a responsabilidade dos modelos, especialmente com a imposição de restrições legais e regulatórias sobre o uso de modelos black-box. Este estudo apresenta uma avaliação comparativa de 16 métodos inerentemente interpretabéis, abrangendo 216 conjuntos de dados tabulares do mundo real.
Fonte: arXiv cs.LG
RL • Score 96
Dominação Quântica King-Ring no Xadrez: Uma Abordagem QAOA
O Quantum Approximate Optimization Algorithm (QAOA) é amplamente testado em instâncias aleatórias sintéticas, mas carece de estrutura semântica e interpretabilidade humana. Apresentamos a Dominação Quântica King-Ring (QKRD), um benchmark em escala NISQ derivado de posições táticas de xadrez, oferecendo 5.000 instâncias estruturadas. Usando QKRD, avaliamos escolhas de design do QAOA e mostramos que técnicas informadas por problemas revelam vantagens ocultas em instâncias aleatórias.
Fonte: arXiv cs.LG
Evaluation/Benchmarks • Score 89
MCD: Discriminação Contrastiva Marginal para estimativa de densidade condicional
Consideramos o problema da estimativa de densidade condicional, um tema de grande interesse nas áreas de estatística e machine learning. Nosso método, chamado Discriminação Contrastiva Marginal (MCD), reformula a função de densidade condicional em dois fatores, a função de densidade marginal da variável-alvo e uma razão de funções de densidade que pode ser estimada por meio de classificação binária.
Fonte: arXiv stat.ML
NLP/LLMs • Score 95
Do LLMs Judge Distantly Supervised Named Entity Labels Well? Constructing the JudgeWEL Dataset
arXiv:2601.00411v1 Announce Type: new
Abstract: We present judgeWEL, a dataset for named entity recognition (NER) in Luxembourgish, automatically labelled and subsequently verified using large language models (LLM) in a novel pipeline. Building datasets for under-represented languages remains one of the major bottlenecks in natural language processing, where the scarcity of resources and linguistic particularities make large-scale annotation costly and potentially inconsistent. To address these challenges, we propose and evaluate a novel approach that leverages Wikipedia and Wikidata as structured sources of weak supervision. By exploiting internal links within Wikipedia articles, we infer entity types based on their corresponding Wikidata entries, thereby generating initial annotations with minimal human intervention. Because such links are not uniformly reliable, we mitigate noise by employing and comparing several LLMs to identify and retain only high-quality labelled sentences. The resulting corpus is approximately five times larger than the currently available Luxembourgish NER dataset and offers broader and more balanced coverage across entity categories, providing a substantial new resource for multilingual and low-resource NER research.
Fonte: arXiv cs.CL
MLOps/Systems • Score 93
Compreendendo Emoção no Discurso: Insights de Reconhecimento e Padrões Linguísticos para Geração
O reconhecimento de emoções em conversas (ERC) alcançou alta precisão, mas duas lacunas críticas permanecem: uma compreensão limitada das escolhas arquitetônicas que realmente importam e a falta de análise linguística conectando reconhecimento à geração. Abordamos ambas as lacunas por meio de uma análise sistemática do conjunto de dados IEMOCAP.
Fonte: arXiv cs.AI
Vision • Score 95
A Cascaded Information Interaction Network for Precise Image Segmentation
arXiv:2601.00562v1 Announce Type: new
Abstract: Visual perception plays a pivotal role in enabling autonomous behavior, offering a cost-effective and efficient alternative to complex multi-sensor systems. However, robust segmentation remains a challenge in complex scenarios. To address this, this paper proposes a cascaded convolutional neural network integrated with a novel Global Information Guidance Module. This module is designed to effectively fuse low-level texture details with high-level semantic features across multiple layers, thereby overcoming the inherent limitations of single-scale feature extraction. This architectural innovation significantly enhances segmentation accuracy, particularly in visually cluttered or blurred environments where traditional methods often fail. Experimental evaluations on benchmark image segmentation datasets demonstrate that the proposed framework achieves superior precision, outperforming existing state-of-the-art methods. The results highlight the effectiveness of the approach and its promising potential for deployment in practical robotic applications.
Fonte: arXiv cs.CV
Vision • Score 95
Robust Assembly Progress Estimation via Deep Metric Learning
arXiv:2601.00422v1 Announce Type: new
Abstract: In recent years, the advancement of AI technologies has accelerated the development of smart factories. In particular, the automatic monitoring of product assembly progress is crucial for improving operational efficiency, minimizing the cost of discarded parts, and maximizing factory productivity. However, in cases where assembly tasks are performed manually over multiple days, implementing smart factory systems remains a challenge. Previous work has proposed Anomaly Triplet-Net, which estimates assembly progress by applying deep metric learning to the visual features of products. Nevertheless, when visual changes between consecutive tasks are subtle, misclassification often occurs. To address this issue, this paper proposes a robust system for estimating assembly progress, even in cases of occlusion or minimal visual change, using a small-scale dataset. Our method leverages a Quadruplet Loss-based learning approach for anomaly images and introduces a custom data loader that strategically selects training samples to enhance estimation accuracy. We evaluated our approach using a image datasets: captured during desktop PC assembly. The proposed Anomaly Quadruplet-Net outperformed existing methods on the dataset. Specifically, it improved the estimation accuracy by 1.3% and reduced misclassification between adjacent tasks by 1.9% in the desktop PC dataset and demonstrating the effectiveness of the proposed method.
Fonte: arXiv cs.CV
NLP/LLMs • Score 96
Trajectory Guard -- Um Modelo Leve e Consciente de Sequência para Detecção de Anomalias em Tempo Real em AI Agente
Agentes autônomos de LLM geram planos de ação de múltiplos passos que podem falhar devido a desalinhamento contextual ou incoerência estrutural. Métodos existentes de detecção de anomalias não são adequados para esse desafio. Apresentamos o Trajectory Guard, um Autoencoder Recorrente Siamês que aprende alinhamento de tarefa e trajetória, permitindo a detecção unificada de planos incorretos e estruturas de planos malformadas.
Fonte: arXiv cs.LG
Evaluation/Benchmarks • Score 96
Um Modelo de Aprendizado Profundo com Atenção Esparsa Integrando Recursos Multimodais Heterogêneos para o Perfil de Gravidade da Doença de Parkinson
Caracterizar a apresentação heterogênea da doença de Parkinson (DP) requer a integração de marcadores biológicos e clínicos dentro de um framework preditivo unificado. Propomos a Class-Weighted Sparse-Attention Fusion Network (SAFN), um framework de aprendizado profundo interpretável para o perfil multimodal robusto, que supera limitações de interpretabilidade e desequilíbrio de classes.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
JP-TL-Bench: Avaliação Ancorada de LLM em Par para Tradução Bidirecional Japonês-Inglês
Apresentamos o JP-TL-Bench, um benchmark leve e aberto projetado para guiar o desenvolvimento iterativo de sistemas de tradução Japonês-Inglês. O desafio muitas vezes é 'qual dessas duas boas traduções é melhor?' em vez de 'esta tradução é aceitável?'. Essa distinção é crucial para o Japonês-Inglês, onde escolhas sutis em polidez, implicatura, elipse e registro afetam fortemente a naturalidade percebida.
Fonte: arXiv cs.AI
Evaluation/Benchmarks • Score 90
Previsão de Tempo em Corridas de Ciclismo: Uma Abordagem Personalizada de Machine Learning Usando Topologia de Rota e Carga de Treinamento
Prever a duração de ciclismo para uma rota específica é essencial para o planejamento de treinos e preparação de eventos. Este trabalho apresenta uma abordagem de machine learning que prevê a duração da pedalada utilizando características da topologia da rota combinadas com o estado físico atual do atleta, derivado de métricas de carga de treinamento.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
Os Chatbots LLMs Falam Demais? O Benchmark YapBench
Modelos de Linguagem de Grande Escala (LLMs) como ChatGPT, Claude e Gemini atuam cada vez mais como copilotos de propósito geral, mas frequentemente respondem com excessiva extensão em solicitações simples, aumentando a carga cognitiva e inflacionando o custo de inferência baseado em tokens. Apresentamos o YapBench, um benchmark leve para quantificar a sobregeração visível ao usuário em prompts de brevidade ideal.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Knowledge Distillation for Temporal Knowledge Graph Reasoning with Large Language Models
arXiv:2601.00202v1 Announce Type: new
Abstract: Reasoning over temporal knowledge graphs (TKGs) is fundamental to improving the efficiency and reliability of intelligent decision-making systems and has become a key technological foundation for future artificial intelligence applications. Despite recent progress, existing TKG reasoning models typically rely on large parameter sizes and intensive computation, leading to high hardware costs and energy consumption. These constraints hinder their deployment on resource-constrained, low-power, and distributed platforms that require real-time inference. Moreover, most existing model compression and distillation techniques are designed for static knowledge graphs and fail to adequately capture the temporal dependencies inherent in TKGs, often resulting in degraded reasoning performance. To address these challenges, we propose a distillation framework specifically tailored for temporal knowledge graph reasoning. Our approach leverages large language models as teacher models to guide the distillation process, enabling effective transfer of both structural and temporal reasoning capabilities to lightweight student models. By integrating large-scale public knowledge with task-specific temporal information, the proposed framework enhances the student model's ability to model temporal dynamics while maintaining a compact and efficient architecture. Extensive experiments on multiple publicly available benchmark datasets demonstrate that our method consistently outperforms strong baselines, achieving a favorable trade-off between reasoning accuracy, computational efficiency, and practical deployability.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
RIMRULE: Improving Tool-Using Language Agents via MDL-Guided Rule Learning
arXiv:2601.00086v1 Announce Type: new
Abstract: Large language models (LLMs) often struggle to use tools reliably in domain-specific settings, where APIs may be idiosyncratic, under-documented, or tailored to private workflows. This highlights the need for effective adaptation to task-specific tools. We propose RIMRULE, a neuro-symbolic approach for LLM adaptation based on dynamic rule injection. Compact, interpretable rules are distilled from failure traces and injected into the prompt during inference to improve task performance. These rules are proposed by the LLM itself and consolidated using a Minimum Description Length (MDL) objective that favors generality and conciseness. Each rule is stored in both natural language and a structured symbolic form, supporting efficient retrieval at inference time. Experiments on tool-use benchmarks show that this approach improves accuracy on both seen and unseen tools without modifying LLM weights. It outperforms prompting-based adaptation methods and complements finetuning. Moreover, rules learned from one LLM can be reused to improve others, including long reasoning LLMs, highlighting the portability of symbolic knowledge across architectures.
Fonte: arXiv cs.CL
RL • Score 96
Colocação Ótima de Táxis Consciente do Tráfego Usando Aprendizado por Reforço Baseado em Redes Neurais Gráficas
No contexto do transporte em cidades inteligentes, o emparelhamento eficiente da oferta de táxis com a demanda de passageiros requer a integração em tempo real de dados da rede de tráfego urbano e padrões de mobilidade. Este artigo apresenta um framework de aprendizado por reforço (RL) baseado em grafos para a colocação ótima de táxis em ambientes metropolitanos.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Talk Less, Verify More: Improving LLM Assistants with Semantic Checks and Execution Feedback
arXiv:2601.00224v1 Announce Type: new
Abstract: As large language model (LLM) assistants become increasingly integrated into enterprise workflows, their ability to generate accurate, semantically aligned, and executable outputs is critical. However, current conversational business analytics (CBA) systems often lack built-in verification mechanisms, leaving users to manually validate potentially flawed results. This paper introduces two complementary verification techniques: Q*, which performs reverse translation and semantic matching between code and user intent, and Feedback+, which incorporates execution feedback to guide code refinement. Embedded within a generator-discriminator framework, these mechanisms shift validation responsibilities from users to the system. Evaluations on three benchmark datasets, Spider, Bird, and GSM8K, demonstrate that both Q* and Feedback+ reduce error rates and task completion time. The study also identifies reverse translation as a key bottleneck, highlighting opportunities for future improvement. Overall, this work contributes a design-oriented framework for building more reliable, enterprise-grade GenAI systems capable of trustworthy decision support.
Fonte: arXiv cs.CL
RL • Score 96
Amostras Adversariais Não São Criadas Iguais
No último década, diversas teorias foram propostas para explicar a vulnerabilidade generalizada das redes neurais profundas a ataques de evasão adversariais. Este trabalho defende que amostras que utilizam características frágeis, mas preditivas, e aquelas que não utilizam, representam dois tipos de fraquezas adversariais e devem ser diferenciadas na avaliação da robustez adversarial.
Fonte: arXiv cs.LG
Vision • Score 95
A Comprehensive Dataset for Human vs. AI Generated Image Detection
arXiv:2601.00553v1 Announce Type: new
Abstract: Multimodal generative AI systems like Stable Diffusion, DALL-E, and MidJourney have fundamentally changed how synthetic images are created. These tools drive innovation but also enable the spread of misleading content, false information, and manipulated media. As generated images become harder to distinguish from photographs, detecting them has become an urgent priority. To combat this challenge, We release MS COCOAI, a novel dataset for AI generated image detection consisting of 96000 real and synthetic datapoints, built using the MS COCO dataset. To generate synthetic images, we use five generators: Stable Diffusion 3, Stable Diffusion 2.1, SDXL, DALL-E 3, and MidJourney v6. Based on the dataset, we propose two tasks: (1) classifying images as real or generated, and (2) identifying which model produced a given synthetic image. The dataset is available at https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Image_Dataset.
Fonte: arXiv cs.CV
RL • Score 95
VisNet: Efficient Person Re-Identification via Alpha-Divergence Loss, Feature Fusion and Dynamic Multi-Task Learning
arXiv:2601.00307v1 Announce Type: new
Abstract: Person re-identification (ReID) is an extremely important area in both surveillance and mobile applications, requiring strong accuracy with minimal computational cost. State-of-the-art methods give good accuracy but with high computational budgets. To remedy this, this paper proposes VisNet, a computationally efficient and effective re-identification model suitable for real-world scenarios. It is the culmination of conceptual contributions, including feature fusion at multiple scales with automatic attention on each, semantic clustering with anatomical body partitioning, a dynamic weight averaging technique to balance classification semantic regularization, and the use of loss function FIDI for improved metric learning tasks. The multiple scales fuse ResNet50's stages 1 through 4 without the use of parallel paths, with semantic clustering introducing spatial constraints through the use of rule-based pseudo-labeling. VisNet achieves 87.05% Rank-1 and 77.65% mAP on the Market-1501 dataset, having 32.41M parameters and 4.601 GFLOPs, hence, proposing a practical approach for real-time deployment in surveillance and mobile applications where computational resources are limited.
Fonte: arXiv cs.CV
RL • Score 95
Disentangling Hardness from Noise: An Uncertainty-Driven Model-Agnostic Framework for Long-Tailed Remote Sensing Classification
arXiv:2601.00278v1 Announce Type: new
Abstract: Long-Tailed distributions are pervasive in remote sensing due to the inherently imbalanced occurrence of grounded objects. However, a critical challenge remains largely overlooked, i.e., disentangling hard tail data samples from noisy ambiguous ones. Conventional methods often indiscriminately emphasize all low-confidence samples, leading to overfitting on noisy data. To bridge this gap, building upon Evidential Deep Learning, we propose a model-agnostic uncertainty-aware framework termed DUAL, which dynamically disentangles prediction uncertainty into Epistemic Uncertainty (EU) and Aleatoric Uncertainty (AU). Specifically, we introduce EU as an indicator of sample scarcity to guide a reweighting strategy for hard-to-learn tail samples, while leveraging AU to quantify data ambiguity, employing an adaptive label smoothing mechanism to suppress the impact of noise. Extensive experiments on multiple datasets across various backbones demonstrate the effectiveness and generalization of our framework, surpassing strong baselines such as TGN and SADE. Ablation studies provide further insights into the crucial choices of our design.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
TotalFM: An Organ-Separated Framework for 3D-CT Vision Foundation Models
arXiv:2601.00260v1 Announce Type: new
Abstract: While foundation models in radiology are expected to be applied to various clinical tasks, computational cost constraints remain a major challenge when training on 3D-CT volumetric data. In this study, we propose TotalFM, a radiological foundation model that efficiently learns the correspondence between 3D-CT images and linguistic expressions based on the concept of organ separation, utilizing a large-scale dataset of 140,000 series. By automating the creation of organ volume and finding-sentence pairs through segmentation techniques and Large Language Model (LLM)-based radiology report processing, and by combining self-supervised pre-training via VideoMAE with contrastive learning using volume-text pairs, we aimed to balance computational efficiency and representation capability. In zero-shot organ-wise lesion classification tasks, the proposed model achieved higher F1 scores in 83% (5/6) of organs compared to CT-CLIP and 64% (9/14) of organs compared to Merlin. These results suggest that the proposed model exhibits high generalization performance in a clinical evaluation setting using actual radiology report sentences. Furthermore, in zero-shot finding-wise lesion classification tasks, our model achieved a higher AUROC in 83% (25/30) of finding categories compared to Merlin. We also confirmed performance comparable to existing Vision-Language Models (VLMs) in radiology report generation tasks. Our results demonstrate that the organ-separated learning framework can serve as a realistic and effective design guideline for the practical implementation of 3D-CT foundation models.
Fonte: arXiv cs.CV
Vision • Score 92
Depth-Synergized Mamba Meets Memory Experts for All-Day Image Reflection Separation
arXiv:2601.00322v1 Announce Type: new
Abstract: Image reflection separation aims to disentangle the transmission layer and the reflection layer from a blended image. Existing methods rely on limited information from a single image, tending to confuse the two layers when their contrasts are similar, a challenge more severe at night. To address this issue, we propose the Depth-Memory Decoupling Network (DMDNet). It employs the Depth-Aware Scanning (DAScan) to guide Mamba toward salient structures, promoting information flow along semantic coherence to construct stable states. Working in synergy with DAScan, the Depth-Synergized State-Space Model (DS-SSM) modulates the sensitivity of state activations by depth, suppressing the spread of ambiguous features that interfere with layer disentanglement. Furthermore, we introduce the Memory Expert Compensation Module (MECM), leveraging cross-image historical knowledge to guide experts in providing layer-specific compensation. To address the lack of datasets for nighttime reflection separation, we construct the Nighttime Image Reflection Separation (NightIRS) dataset. Extensive experiments demonstrate that DMDNet outperforms state-of-the-art methods in both daytime and nighttime.
Fonte: arXiv cs.CV
Vision • Score 92
DynaDrag: Dynamic Drag-Style Image Editing by Motion Prediction
arXiv:2601.00542v1 Announce Type: new
Abstract: To achieve pixel-level image manipulation, drag-style image editing which edits images using points or trajectories as conditions is attracting widespread attention. Most previous methods follow move-and-track framework, in which miss tracking and ambiguous tracking are unavoidable challenging issues. Other methods under different frameworks suffer from various problems like the huge gap between source image and target edited image as well as unreasonable intermediate point which can lead to low editability. To avoid these problems, we propose DynaDrag, the first dragging method under predict-and-move framework. In DynaDrag, Motion Prediction and Motion Supervision are performed iteratively. In each iteration, Motion Prediction first predicts where the handle points should move, and then Motion Supervision drags them accordingly. We also propose to dynamically adjust the valid handle points to further improve the performance. Experiments on face and human datasets showcase the superiority over previous works.
Fonte: arXiv cs.CV
Vision • Score 95
ReMA: A Training-Free Plug-and-Play Mixing Augmentation for Video Behavior Recognition
arXiv:2601.00311v1 Announce Type: new
Abstract: Video behavior recognition demands stable and discriminative representations under complex spatiotemporal variations. However, prevailing data augmentation strategies for videos remain largely perturbation-driven, often introducing uncontrolled variations that amplify non-discriminative factors, which finally weaken intra-class distributional structure and representation drift with inconsistent gains across temporal scales. To address these problems, we propose Representation-aware Mixing Augmentation (ReMA), a plug-and-play augmentation strategy that formulates mixing as a controlled replacement process to expand representations while preserving class-conditional stability. ReMA integrates two complementary mechanisms. Firstly, the Representation Alignment Mechanism (RAM) performs structured intra-class mixing under distributional alignment constraints, suppressing irrelevant intra-class drift while enhancing statistical reliability. Then, the Dynamic Selection Mechanism (DSM) generates motion-aware spatiotemporal masks to localize perturbations, guiding them away from discrimination-sensitive regions and promoting temporal coherence. By jointly controlling how and where mixing is applied, ReMA improves representation robustness without additional supervision or trainable parameters. Extensive experiments on diverse video behavior benchmarks demonstrate that ReMA consistently enhances generalization and robustness across different spatiotemporal granularities.
Fonte: arXiv cs.CV
RL • Score 95
Defesa Certificada sobre a Justiça das Redes Neurais Gráficas
As Redes Neurais Gráficas (GNNs) se destacaram como um modelo proeminente de aprendizado em grafos, mas são vulneráveis a ataques que podem corromper a justiça de suas previsões. Neste artigo, propomos um framework chamado ELEGANT, que oferece uma análise teórica detalhada para certificar a justiça das GNNs, sem exigir re-treinamento e sem suposições sobre a estrutura ou parâmetros das GNNs.
Fonte: arXiv stat.ML
NLP/LLMs • Score 96
Contribuição Consciente de Dados via Destilação de Cadeia de Pensamento Orientada pela Comunidade
A era atual de desenvolvimento de IA enfatiza fortemente o treinamento de grandes modelos em conjuntos de dados cada vez maiores. Este paradigma gerou novas categorias de produtos, como chatbots LLM, mas também levantou preocupações sobre privacidade de dados e escolha do consumidor. Este artigo aborda a portabilidade de dados e a autonomia do usuário no contexto de LLMs que 'raciocinam' usando rastros de cadeia de pensamento (CoT).
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
Aprendizado por Reforço Estável e Eficiente com Um Único Rollout para Raciocínio Multimodal
O Aprendizado por Reforço com Recompensas Verificáveis (RLVR) se tornou um paradigma chave para melhorar as capacidades de raciocínio de Modelos de Linguagem Multimodal de Grande Escala (MLLMs). Introduzimos o $ extbf{MSSR}$ (Multimodal Stabilized Single-Rollout), um framework RLVR livre de grupos que alcança otimização estável e desempenho eficaz em raciocínio multimodal.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
NL2CA: Auto-formalizando a Tomada de Decisão Cognitiva a partir da Linguagem Natural Usando um Framework Unsupervised CriticNL2LTL
Modelos de computação cognitiva oferecem uma forma formal e interpretável de caracterizar a deliberação e a tomada de decisão humana, mas seu desenvolvimento continua sendo intensivo em mão de obra. Neste artigo, propomos o NL2CA, um método inovador para auto-formalizar regras de tomada de decisão cognitiva a partir de descrições em linguagem natural da experiência humana.
Fonte: arXiv cs.AI
RL • Score 96
Seleção de Dados Comportamentais Offline
O comportamento de clonagem é uma abordagem amplamente adotada para aprendizado de políticas offline a partir de demonstrações de especialistas. Neste artigo, revelamos a saturação de dados comportamentais offline, onde o desempenho da política se estabiliza rapidamente com uma pequena fração do conjunto de dados. Propomos o método Stepwise Dual Ranking (SDR) para extrair um subconjunto compacto e informativo de grandes conjuntos de dados comportamentais offline.
Fonte: arXiv cs.LG
RL • Score 92
Modelos Aditivos Generalizados Baseados em Cluster Informados por Recursos Aleatórios de Fourier
A aprendizagem de máquina explicável busca equilibrar a precisão preditiva e a transparência do modelo, especialmente em contextos onde modelos preditivos de caixa-preta, como redes neurais profundas, apresentam bom desempenho empírico, mas são difíceis de interpretar. Este trabalho introduz uma mistura de modelos aditivos generalizados (GAMs) que utilizam representações de recursos aleatórios de Fourier (RFF) para revelar estruturas localmente adaptativas nos dados.
Fonte: arXiv stat.ML
MLOps/Systems • Score 95
GenUQ: Estimativas de Incerteza Preditiva via Redes Hiper-Geradoras
O aprendizado de operadores é uma generalização recente da regressão para mapeamentos entre funções, prometendo reduzir a integração numérica cara de PDEs para avaliações rápidas de mapeamentos entre estados funcionais de um sistema. Neste artigo, apresentamos o GenUQ, uma abordagem teórica de medida para UQ que evita a construção de uma verossimilhança ao introduzir um modelo de rede hiper-geradora.
Fonte: arXiv stat.ML
NLP/LLMs • Score 95
Rumo ao Desaprendizado que Preserva o Raciocínio em Modelos de Linguagem Grande Multimodal
O desaprendizado de máquinas visa apagar dados solicitados de modelos treinados sem re-treinamento completo. Para Modelos de Linguagem Grande Multimodal com Raciocínio (RMLLMs), isso é desafiador, pois etapas intermediárias podem vazar informações sensíveis. Apresentamos o RMLLMU-Bench, o primeiro benchmark para desaprendizado de RMLLM que avalia o vazamento de raciocínio e a retenção de raciocínio.
Fonte: arXiv cs.CL
RecSys • Score 96
Gêmeos Digitais Probabilísticos de Usuários: Aprendizado de Representação Latente com Semântica Estatisticamente Validada
Entender a identidade e o comportamento do usuário é central para aplicações como personalização, recomendação e suporte à decisão. Propomos um framework de gêmeo digital probabilístico onde cada usuário é modelado como um estado estocástico latente que gera dados comportamentais observados. Este framework é aplicado a um conjunto de dados de respostas de usuários para capturar aspectos estáveis da identidade do usuário.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Supplementary Resources and Analysis for Automatic Speech Recognition Systems Trained on the Loquacious Dataset
arXiv:2512.17915v1 Announce Type: new
Abstract: The recently published Loquacious dataset aims to be a replacement for established English automatic speech recognition (ASR) datasets such as LibriSpeech or TED-Lium. The main goal of the Loquacious dataset is to provide properly defined training and test partitions across many acoustic and language domains, with an open license suitable for both academia and industry. To further promote the benchmarking and usability of this new dataset, we present additional resources in the form of n-gram language models (LMs), a grapheme-to-phoneme (G2P) model and pronunciation lexica, with open and public access. Utilizing those additional resources we show experimental results across a wide range of ASR architectures with different label units and topologies. Our initial experimental results indicate that the Loquacious dataset offers a valuable study case for a variety of common challenges in ASR.
Fonte: arXiv cs.CL
Vision • Score 96
Grad: Geração de Difusão de Relações Guiadas para Aumento de Grafos na Detecção de Fraude em Grafos
Atualmente, a Detecção de Fraude em Grafos (GFD) em cenários financeiros tornou-se um tópico de pesquisa urgente para proteger a segurança de pagamentos online. Com a evolução das estratégias de camuflagem dos fraudadores, propomos o modelo Grad, que utiliza um módulo de aprendizado contrastivo supervisionado para melhorar a diferença entre fraudes e usuários benignos, gerando relações homofílicas auxiliares.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Rumo a Agentes Eficientes: Um Co-Design da Arquitetura de Inferência e do Sistema
O rápido desenvolvimento de agentes baseados em large language models (LLMs) abriu novas possibilidades para raciocínio autônomo em múltiplas interações e tomada de decisão com ferramentas. No entanto, sua implementação no mundo real é dificultada por ineficiências severas que surgem não da inferência isolada do modelo, mas da latência sistêmica acumulada ao longo dos ciclos de raciocínio, crescimento de contexto e interações heterogêneas com ferramentas.
Fonte: arXiv cs.CL
RL • Score 95
ReGal: A First Look at PPO-based Legal AI for Judgment Prediction and Summarization in India
arXiv:2512.18014v1 Announce Type: new
Abstract: This paper presents an early exploration of reinforcement learning methodologies for legal AI in the Indian context. We introduce Reinforcement Learning-based Legal Reasoning (ReGal), a framework that integrates Multi-Task Instruction Tuning with Reinforcement Learning from AI Feedback (RLAIF) using Proximal Policy Optimization (PPO). Our approach is evaluated across two critical legal tasks: (i) Court Judgment Prediction and Explanation (CJPE), and (ii) Legal Document Summarization. Although the framework underperforms on standard evaluation metrics compared to supervised and proprietary models, it provides valuable insights into the challenges of applying RL to legal texts. These challenges include reward model alignment, legal language complexity, and domain-specific adaptation. Through empirical and qualitative analysis, we demonstrate how RL can be repurposed for high-stakes, long-document tasks in law. Our findings establish a foundation for future work on optimizing legal reasoning pipelines using reinforcement learning, with broader implications for building interpretable and adaptive legal AI systems.
Fonte: arXiv cs.CL
RL • Score 96
ARC: Aproveitando Representações Composicionais para Aprendizado entre Problemas em VRPs
Os Problemas de Roteamento de Veículos (VRPs) com atributos diversos do mundo real têm gerado interesse recente em abordagens de aprendizado entre problemas que generalizam eficientemente entre variantes. Propomos o ARC (Attribute Representation via Compositional Learning), um framework de aprendizado entre problemas que aprende representações de atributos desentranhadas, decompondo-as em dois componentes complementares.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
Uma Rede Híbrida Indutiva-Transdutiva para Imputação de Fluxo de Tráfego em Locais Não Amostrados
Imputar com precisão o fluxo de tráfego em locais não sensorizados é desafiador. Propomos a HINT, uma Rede Híbrida Indutiva-Transdutiva, que utiliza uma estratégia de treinamento INDU-TRANSDUTIVA para tratar a velocidade como um sinal transdutivo, enquanto aprende o fluxo indutivamente. HINT supera consistentemente as linhas de base indutivas em três conjuntos de dados do mundo real.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Does It Tie Out? Towards Autonomous Legal Agents in Venture Capital
arXiv:2512.18658v1 Announce Type: new
Abstract: Before closing venture capital financing rounds, lawyers conduct diligence that includes tying out the capitalization table: verifying that every security (for example, shares, options, warrants) and issuance term (for example, vesting schedules, acceleration triggers, transfer restrictions) is supported by large sets of underlying legal documentation. While LLMs continue to improve on legal benchmarks, specialized legal workflows, such as capitalization tie-out, remain out of reach even for strong agentic systems. The task requires multi-document reasoning, strict evidence traceability, and deterministic outputs that current approaches fail to reliably deliver. We characterize capitalization tie-out as an instance of a real-world benchmark for legal AI, analyze and compare the performance of existing agentic systems, and propose a world model architecture toward tie-out automation-and more broadly as a foundation for applied legal intelligence.
Fonte: arXiv cs.CL
RL • Score 93
Qual é o Preço da Monotonicidade? Um Benchmark Multi-Conjunto de Dados do Gradient Boosting com Restrições Monotônicas para PD de Crédito
Instituições financeiras enfrentam um trade-off entre precisão preditiva e interpretabilidade ao implantar modelos de machine learning para risco de crédito. Este artigo avalia modelos de gradient boosting com e sem restrições monotônicas em cinco conjuntos de dados públicos e três bibliotecas, definindo o Preço da Monotonicidade (PoM) como a mudança relativa em métricas de desempenho padrão.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
ChronoDreamer: Modelo de Mundo Condicionado por Ação como um Simulador Online para Planejamento Robótico
Apresentamos o ChronoDreamer, um modelo de mundo condicionado por ação para manipulação robótica rica em contatos. Com base em um histórico de quadros RGB egocêntricos, mapas de contato, ações e estados das juntas, o ChronoDreamer prevê quadros de vídeo futuros, distribuições de contato e ângulos das juntas através de um transformer espaço-temporal treinado com previsão mascarada no estilo MaskGIT.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
LLM-based Few-Shot Early Rumor Detection with Imitation Agent
arXiv:2512.18352v1 Announce Type: new
Abstract: Early Rumor Detection (EARD) aims to identify the earliest point at which a claim can be accurately classified based on a sequence of social media posts. This is especially challenging in data-scarce settings. While Large Language Models (LLMs) perform well in few-shot NLP tasks, they are not well-suited for time-series data and are computationally expensive for both training and inference. In this work, we propose a novel EARD framework that combines an autonomous agent and an LLM-based detection model, where the agent acts as a reliable decision-maker for \textit{early time point determination}, while the LLM serves as a powerful \textit{rumor detector}. This approach offers the first solution for few-shot EARD, necessitating only the training of a lightweight agent and allowing the LLM to remain training-free. Extensive experiments on four real-world datasets show our approach boosts performance across LLMs and surpasses existing EARD methods in accuracy and earliness.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
A Comparative Study of Light-weight Language Models for PII Masking and their Deployment for Real Conversational Texts
arXiv:2512.18608v1 Announce Type: new
Abstract: Automated masking of Personally Identifiable Information (PII) is critical for privacy-preserving conversational systems. While current frontier large language models demonstrate strong PII masking capabilities, concerns about data handling and computational costs motivate exploration of whether lightweight models can achieve comparable performance. We compare encoder-decoder and decoder-only architectures by fine-tuning T5-small and Mistral-Instruct-v0.3 on English datasets constructed from the AI4Privacy benchmark. We create different dataset variants to study label standardization and PII representation, covering 24 standardized PII categories and higher-granularity settings. Evaluation using entity-level and character-level metrics, type accuracy, and exact match shows that both lightweight models achieve performance comparable to frontier LLMs for PII masking tasks. Label normalization consistently improves performance across architectures. Mistral achieves higher F1 and recall with greater robustness across PII types but incurs significantly higher generation latency. T5, while less robust in conversational text, offers more controllable structured outputs and lower inference cost, motivating its use in a real-time Discord bot for real-world PII redaction. Evaluation on live messages reveals performance degradation under informal inputs. These results clarify trade-offs between accuracy, robustness, and computational efficiency, demonstrating that lightweight models can provide effective PII masking while addressing data handling concerns associated with frontier LLMs.
Fonte: arXiv cs.CL
NLP/LLMs • Score 96
ASTIF: Integração Semântica-Temporal Adaptativa para Previsão de Preços de Criptomoedas
A previsão de séries temporais financeiras é um desafio de fusão de informações, mas a maioria dos modelos existentes depende de arquiteturas estáticas que têm dificuldade em integrar fontes de conhecimento heterogêneas. Propomos o ASTIF, um sistema híbrido inteligente que adapta sua estratégia de previsão em tempo real através de meta-aprendizado baseado em confiança, integrando componentes complementares para melhorar a precisão das previsões.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
NEURO-GUARD: Generalização Neuro-Simbólica e Roteamento Adaptativo Imparcial para Diagnósticos -- IA Médica Explicável
O diagnóstico baseado em imagem, preciso e interpretável, continua sendo um desafio central na IA médica, especialmente em ambientes com dados limitados e decisões clínicas críticas. Apresentamos o NEURO-GUARD, um novo framework guiado por conhecimento que integra Vision Transformers (ViTs) com raciocínio orientado por linguagem, melhorando desempenho e robustez em diferentes domínios.
Fonte: arXiv cs.AI
Vision • Score 96
Rede Bayesiana Multimodal para Avaliação Robusta de Vítimas em Triagem Autônoma
Incidentes de Múltiplas Vítimas podem sobrecarregar sistemas médicos de emergência, e atrasos ou erros na avaliação das vítimas podem resultar em mortes evitáveis. Apresentamos um framework de suporte à decisão que combina saídas de múltiplos modelos de computer vision, estimando sinais de hemorragia severa, dificuldade respiratória, alerta físico ou trauma visível, em uma rede Bayesiana construída inteiramente a partir de regras definidas por especialistas.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
FC-MIR: Um Framework de Consciência de Tela Móvel para Recomendação Consciente de Intenção Baseada em Raciocínio Multimodal de Trajetória Comprimida por Frame
Identificar a intenção do usuário a partir de trajetórias de operação da interface móvel é crucial para avançar na compreensão da UI e habilitar agentes de automação de tarefas. Propomos o framework FC-MIR, que utiliza amostragem de keyframes e concatenação adaptativa para reduzir a redundância visual e aumentar a eficiência da inferência, integrando MLLMs de última geração para sumarização de trajetórias e previsão de intenção.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
MEEA: Otimização Confrontacional Baseada no Efeito de Exposição Mere para Jailbreaking de LLMs
O rápido avanço dos grandes modelos de linguagem (LLMs) intensificou preocupações sobre a robustez de seu alinhamento de segurança. Propomos o MEEA (Mere Exposure Effect Attack), um framework automatizado inspirado na psicologia para avaliar a robustez de segurança em interações multi-turno, utilizando o efeito de exposição mere. Nossos experimentos mostram que o MEEA supera consistentemente as taxas de sucesso de ataque de modelos como GPT-4 e Claude-3.5.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Observador, Não Jogador: Simulando a Teoria da Mente em LLMs através da Observação de Jogos
Apresentamos um framework interativo para avaliar se modelos de linguagem grandes (LLMs) exibem um verdadeiro 'entendimento' em um ambiente simples, mas estratégico. Utilizando o jogo Pedra-Papel-Tesoura (RPS) como exemplo, nosso sistema posiciona o LLM como um Observador, cuja tarefa é identificar as estratégias em jogo e articular o raciocínio por trás desse julgamento.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Geração de Regras Programáticas para Detecção de Falsificação de Documentos Usando Modelos de Linguagem de Grande Escala
A falsificação de documentos representa uma ameaça crescente a processos legais, econômicos e governamentais, exigindo mecanismos de verificação cada vez mais sofisticados. Este trabalho investiga como modelos de linguagem de grande escala (LLMs) podem ser adaptados para gerar verificações de plausibilidade baseadas em regras para detecção de falsificações, utilizando recursos de hardware limitados.
Fonte: arXiv cs.AI
Vision • Score 96
Um Conjunto de Dados e Benchmarks para Detecção de Fibrilação Atrial a partir de Eletrocardiogramas de Pacientes em Unidades de Terapia Intensiva
Objetivo: A fibrilação atrial (AF) é a arritmia cardíaca mais comum entre pacientes em unidades de terapia intensiva (UTI) e pode causar efeitos adversos à saúde. Neste estudo, publicamos um conjunto de dados rotulado da UTI e benchmarks para detecção de AF, comparando modelos de machine learning em três abordagens de inteligência artificial (IA) baseadas em dados.
Fonte: arXiv cs.LG
NLP/LLMs • Score 92
DramaBench: Um Framework de Avaliação em Seis Dimensões para Continuação de Roteiros Dramáticos
A continuação de roteiros dramáticos exige modelos que mantenham a consistência dos personagens, avancem a trama de forma coerente e preservem a estrutura dramática, capacidades que os benchmarks existentes não avaliam de forma abrangente. Apresentamos o DramaBench, o primeiro benchmark em larga escala para avaliar a continuação de roteiros dramáticos em seis dimensões independentes.
Fonte: arXiv cs.CL
RL • Score 95
AraMix: Recycling, Refiltering, and Deduplicating to Deliver the Largest Arabic Pretraining Corpus
arXiv:2512.18834v1 Announce Type: new
Abstract: We present AraMix, a deduplicated Arabic pretraining corpus containing approximately 178 billion tokens across 179 million documents. Rather than scraping the web again, AraMix demonstrates that substantial value lies in systematically reusing and curating existing pretraining datasets: we combine seven publicly available Arabic web datasets, apply quality filtering designed specifically for Arabic text to re-filter some datasets, and perform cross-dataset deduplication, both MinHash and sentence-level. This approach reveals that nearly 60% of tokens across these independently collected corpora are duplicates, redundancy that any new scraping efforts will reproduce. Our work suggests that for lower resource languages, investment in curation pipelines for existing data yields greater returns than additional web crawls, an approach that allowed us to curate the largest heavily filtered publicly available Arabic pretraining corpus.
Fonte: arXiv cs.CL
Vision • Score 96
FedOAED: Denoiser Autoencoder Federado em Dispositivos para Dados Heterogêneos sob Disponibilidade Limitada de Clientes
Nos últimos anos, soluções de machine learning (ML) e deep learning (DL) mostraram seu potencial em diversas aplicações, mas regulamentações rigorosas de compartilhamento de dados, como o GDPR e o HIPAA, dificultaram a implementação de aplicações baseadas em dados. O Federated Learning (FL) surge como uma solução, mas ainda enfrenta desafios relacionados à heterogeneidade. Neste trabalho, apresentamos o FedOAED, um algoritmo inovador que visa mitigar a deriva de clientes e a variância induzida pela participação parcial de clientes.
Fonte: arXiv cs.LG
MLOps/Systems • Score 93
Comparando Modelos Dinâmicos Através do Alinhamento de Campos Vetoriais Difeomórficos
Modelos de sistemas dinâmicos, como redes neurais recorrentes (RNNs), são cada vez mais populares na neurociência teórica para geração de hipóteses e análise de dados. Avaliar a dinâmica nesses modelos é crucial para entender seus mecanismos generativos aprendidos, mas enfrenta desafios significativos relacionados à comparação de dinâmicas e identificação de motivos importantes em modelos não lineares de alta dimensão.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Context-Aware Initialization for Reducing Generative Path Length in Diffusion Language Models
arXiv:2512.19004v1 Announce Type: new
Abstract: Diffusion Large Language Models (DLLMs) enable fully parallel token decoding but often remain impractical at inference time due to the many denoising iterations required to refine an information-free, fully masked initialization into coherent text. Most existing acceleration methods focus on traversing this generative trajectory more efficiently via improved solvers or sampling strategies. We advance a complementary perspective: shorten the trajectory itself by starting closer to the target distribution through context-aware initialization.
We propose a training-free interface that injects prompt-conditioned priors from a lightweight auxiliary model into the diffusion initialization, and instantiate it with two mechanisms: discrete token injection and representation-level embedding interpolation. Because injected priors can be imperfect and unmask-only decoding can over-commit early, we also introduce a simple confidence-based remasking mechanism as a form of prior skepticism. Preliminary evidence on GSM8K suggests that context-aware initialization can substantially reduce denoising iterations (about 35\% fewer function evaluations in our setting), while also exposing a key open challenge: naive warm-starting can degrade final accuracy relative to strong diffusion baselines. We use these findings to motivate a research agenda around calibration, revision mechanisms, and representation alignment for reliable warm-started diffusion decoding.
Fonte: arXiv cs.CL
NLP/LLMs • Score 96
Aprendizado por transferência baseado em operador neural convolucional para resolver PDEs
O operador neural convolucional é uma arquitetura baseada em CNN recentemente proposta para garantir equivalência contínua-discreta que preserva a estrutura e possibilitar o aprendizado genuíno e sem aliasing de operadores de solução de PDEs. Este operador neural demonstrou superar, em certos casos, modelos de referência como DeepONet e Fourier neural operator em termos de precisão de surrogates.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
Benchmarking de substitutos neurais em fluxos multifísicos espaço-temporais realistas
Prever dinâmicas multifísicas é computacionalmente caro e desafiador devido ao acoplamento severo de processos físicos heterogêneos e multiescala. Apresentamos o REALM (REalistic AI Learning for Multiphysics), um framework rigoroso de benchmarking para testar substitutos neurais em fluxos reativos desafiadores, com 11 conjuntos de dados de alta fidelidade e um protocolo padronizado de treinamento e avaliação.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Separating Constraint Compliance from Semantic Accuracy: A Novel Benchmark for Evaluating Instruction-Following Under Compression
arXiv:2512.17920v1 Announce Type: new
Abstract: Large language models (LLMs) exhibit degraded performance under prompt compression, but the mechanisms remain poorly understood. We introduce the Compression-Decay Comprehension Test (CDCT), a benchmark that independently measures constraint compliance (CC) and semantic accuracy (SA) across compression levels. We evaluate 9 frontier LLMs across 8 concepts using 5 compression levels from extreme (c=0.0, ~2 words) to none (c=1.0, ~135 words). A three-judge LLM jury achieves almost perfect inter-rater agreement on CC (Fleiss' \k{appa}=0.90).
We observe a universal U-curve pattern in constraint compliance (97.2% prevalence), with violations peaking at medium compression (c=0.5, ~27 words). Counterintuitively, models perform better at extreme compression than medium lengths. The dimensions are statistically orthogonal (r=0.193, p=0.084), with constraint effects 2.9x larger than semantic effects.
Experimental validation via RLHF ablation confirms our constraint salience hypothesis: removing "helpfulness" signals improves CC by 598% on average (71/72 trials, p<0.001), with 79% achieving perfect compliance. This demonstrates that RLHF-trained helpfulness behaviors are the dominant cause of constraint violations at medium compression. Reasoning models outperform efficient models by 27.5% (Cohen's d=0.96).
Our findings reveal a fundamental tension between RLHF alignment and instruction-following, providing actionable guidelines for improving deployed systems.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Generalization Gaps in Political Fake News Detection: An Empirical Study on the LIAR Dataset
arXiv:2512.18533v1 Announce Type: new
Abstract: The proliferation of linguistically subtle political disinformation poses a significant challenge to automated fact-checking systems. Despite increasing emphasis on complex neural architectures, the empirical limits of text-only linguistic modeling remain underexplored. We present a systematic diagnostic evaluation of nine machine learning algorithms on the LIAR benchmark. By isolating lexical features (Bag-of-Words, TF-IDF) and semantic embeddings (GloVe), we uncover a hard "Performance Ceiling", with fine-grained classification not exceeding a Weighted F1-score of 0.32 across models. Crucially, a simple linear SVM (Accuracy: 0.624) matches the performance of pre-trained Transformers such as RoBERTa (Accuracy: 0.620), suggesting that model capacity is not the primary bottleneck. We further diagnose a massive "Generalization Gap" in tree-based ensembles, which achieve more than 99% training accuracy but collapse to approximately 25% on test data, indicating reliance on lexical memorization rather than semantic inference. Synthetic data augmentation via SMOTE yields no meaningful gains, confirming that the limitation is semantic (feature ambiguity) rather than distributional. These findings indicate that for political fact-checking, increasing model complexity without incorporating external knowledge yields diminishing returns.
Fonte: arXiv cs.CL
NLP/LLMs • Score 96
Redes Neurais Gráficas Aprimoradas por Recursos para Classificação de Modelos Geradores de Grafos Sintéticos: Um Estudo de Benchmarking
A capacidade de discriminar entre modelos geradores de grafos é fundamental para entender padrões estruturais complexos em grafos sintéticos e nas estruturas do mundo real que eles emulam. Este trabalho investiga a classificação de famílias de grafos sintéticos usando uma abordagem híbrida que combina Graph Neural Networks (GNNs) com recursos teóricos de grafos.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
DeliveryBench: Agentes Podem Lucrar no Mundo Real?
arXiv:2512.19234v1 Tipo de Anúncio: novo. Resumo: LLMs e VLMs estão sendo cada vez mais utilizados como agentes incorporados, mas os benchmarks existentes se concentram em tarefas simples de curto prazo e têm dificuldade em capturar as ricas restrições realistas que moldam a tomada de decisão no mundo real. Para fechar essa lacuna, propomos o DeliveryBench, um benchmark incorporado em escala de cidade baseado na profissão real de entrega de alimentos.
Fonte: arXiv cs.AI
RL • Score 96
Inteligência Alinhada à Segurança Embutida via Embeddings de Alinhamento Interno Diferenciáveis
Apresentamos a Inteligência Alinhada à Segurança Embutida (ESAI), um framework teórico para aprendizado por reforço multi-agente que incorpora restrições de alinhamento diretamente nas representações internas dos agentes usando embeddings de alinhamento interno diferenciáveis. Este trabalho analisa condições de estabilidade e propriedades teóricas, posicionando o ESAI como uma contribuição conceitual para mecanismos de alinhamento diferenciáveis em sistemas multi-agente.
Fonte: arXiv cs.LG
Vision • Score 96
Mapas auto-organizáveis para avaliação da qualidade da água em reservatórios e lagos: Uma revisão sistemática da literatura
A qualidade da água sustentável é fundamental para o equilíbrio ecológico e a segurança hídrica. Esta revisão examina a aplicação do Self-Organizing Map (SOM), uma técnica de IA não supervisionada, na avaliação da qualidade da água, abordando seleção de parâmetros, estratégias de amostragem e abordagens de agrupamento.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Ensinando e Criticando a Conceituação e Operacionalização em NLP
Pesquisadores de NLP frequentemente invocam conceitos abstratos como 'interpretabilidade', 'viés', 'raciocínio' e 'estereótipos' sem defini-los. Este artigo descreve um seminário criado para estudantes explorarem questões de conceituação e operacionalização, com uma lista de leitura interdisciplinar e ênfase em discussão e crítica.
Fonte: arXiv cs.CL
Vision • Score 95
Data-driven particle dynamics: Structure-preserving coarse-graining for emergent behavior in non-equilibrium systems
arXiv:2508.12569v2 Announce Type: replace-cross
Abstract: Multiscale systems are ubiquitous in science and technology, but are notoriously challenging to simulate as short spatiotemporal scales must be appropriately linked to emergent bulk physics. When expensive high-dimensional dynamical systems are coarse-grained into low-dimensional models, the entropic loss of information leads to emergent physics which are dissipative, history-dependent, and stochastic. To machine learn coarse-grained dynamics from time-series observations of particle trajectories, we propose a framework using the metriplectic bracket formalism that preserves these properties by construction; most notably, the framework guarantees discrete notions of the first and second laws of thermodynamics, conservation of momentum, and a discrete fluctuation-dissipation balance crucial for capturing non-equilibrium statistics. We introduce the mathematical framework abstractly before specializing to a particle discretization. As labels are generally unavailable for entropic state variables, we introduce a novel self-supervised learning strategy to identify emergent structural variables. We validate the method on benchmark systems and demonstrate its utility on two challenging examples: (1) coarse-graining star polymers at challenging levels of coarse-graining while preserving non-equilibrium statistics, and (2) learning models from high-speed video of colloidal suspensions that capture coupling between local rearrangement events and emergent stochastic dynamics. We provide open-source implementations in both PyTorch and LAMMPS, enabling large-scale inference and extensibility to diverse particle-based systems.
Fonte: arXiv stat.ML
Vision • Score 96
Detecção de Out-of-Distribution em Complexos Moleculares via Modelos de Difusão para Grafos Irregulares
Modelos preditivos de machine learning geralmente se destacam em dados in-distribution, mas seu desempenho se degrada em entradas out-of-distribution (OOD). Apresentamos um framework probabilístico de detecção OOD para dados complexos de grafos 3D, construído sobre um modelo de difusão que aprende a densidade da distribuição de treinamento de maneira totalmente não supervisionada.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
HARBOR: Modelo Holístico e Adaptativo de Avaliação de Risco para Cuidados de Saúde Comportamental
A avaliação de risco em cuidados de saúde comportamental continua sendo um desafio devido à natureza multimodal dos dados dos pacientes e à dinâmica temporal dos transtornos de humor e afetivos. Neste trabalho, apresentamos o HARBOR, um modelo de linguagem consciente da saúde comportamental projetado para prever um escore de humor e risco discreto, denominado Harbor Risk Score (HRS).
Fonte: arXiv cs.AI
RL • Score 95
Neural CDEs como Corretores para Modelos de Séries Temporais Aprendidos
Modelos de séries temporais aprendidos, sejam contínuos ou discretos, são amplamente utilizados para prever os estados de um sistema dinâmico. Propomos um mecanismo de Predictor-Corrector, onde o Predictor é um modelo de série temporal aprendido e o Corrector é uma equação diferencial controlada neuralmente. A adição de erros às previsões melhora o desempenho das previsões.
Fonte: arXiv stat.ML
NLP/LLMs • Score 95
MemEvolve: Meta-Evolution of Agent Memory Systems
arXiv:2512.18746v1 Announce Type: new
Abstract: Self-evolving memory systems are unprecedentedly reshaping the evolutionary paradigm of large language model (LLM)-based agents. Prior work has predominantly relied on manually engineered memory architectures to store trajectories, distill experience, and synthesize reusable tools, enabling agents to evolve on the fly within environment interactions. However, this paradigm is fundamentally constrained by the staticity of the memory system itself: while memory facilitates agent-level evolving, the underlying memory architecture cannot be meta-adapted to diverse task contexts. To address this gap, we propose MemEvolve, a meta-evolutionary framework that jointly evolves agents' experiential knowledge and their memory architecture, allowing agent systems not only to accumulate experience but also to progressively refine how they learn from it. To ground MemEvolve in prior research and foster openness in future self-evolving systems, we introduce EvolveLab, a unified self-evolving memory codebase that distills twelve representative memory systems into a modular design space (encode, store, retrieve, manage), providing both a standardized implementation substrate and a fair experimental arena. Extensive evaluations on four challenging agentic benchmarks demonstrate that MemEvolve achieves (I) substantial performance gains, improving frameworks such as SmolAgent and Flash-Searcher by up to $17.06\%$; and (II) strong cross-task and cross-LLM generalization, designing memory architectures that transfer effectively across diverse benchmarks and backbone models.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations
arXiv:2512.18906v1 Announce Type: new
Abstract: Over the years, automatic MT metrics have hillclimbed benchmarks and presented strong and sometimes human-level agreement with human ratings. Yet they remain black-box, offering little insight into their decision-making and often failing under real-world out-of-distribution (OOD) inputs. We introduce Remedy-R, a reasoning-driven generative MT metric trained with reinforcement learning from pairwise translation preferences, without requiring error-span annotations or distillation from closed LLMs. Remedy-R produces step-by-step analyses of accuracy, fluency, and completeness, followed by a final score, enabling more interpretable assessments. With only 60K training pairs across two language pairs, Remedy-R remains competitive with top scalar metrics and GPT-4-based judges on WMT22-24 meta-evaluation, generalizes to other languages, and exhibits strong robustness on OOD stress tests. Moreover, Remedy-R models generate self-reflective feedback that can be reused for translation improvement. Building on this finding, we introduce Remedy-R Agent, a simple evaluate-revise pipeline that leverages Remedy-R's evaluation analysis to refine translations. This agent consistently improves translation quality across diverse models, including Qwen2.5, ALMA-R, GPT-4o-mini, and Gemini-2.0-Flash, suggesting that Remedy-R's reasoning captures translation-relevant information and is practically useful.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
A Large Language Model Based Method for Complex Logical Reasoning over Knowledge Graphs
arXiv:2512.19092v1 Announce Type: new
Abstract: Reasoning over knowledge graphs (KGs) with first-order logic (FOL) queries is challenging due to the inherent incompleteness of real-world KGs and the compositional complexity of logical query structures. Most existing methods rely on embedding entities and relations into continuous geometric spaces and answer queries via differentiable set operations. While effective for simple query patterns, these approaches often struggle to generalize to complex queries involving multiple operators, deeper reasoning chains, or heterogeneous KG schemas. We propose ROG (Reasoning Over knowledge Graphs with large language models), an ensemble-style framework that combines query-aware KG neighborhood retrieval with large language model (LLM)-based chain-of-thought reasoning. ROG decomposes complex FOL queries into sequences of simpler sub-queries, retrieves compact, query-relevant subgraphs as contextual evidence, and performs step-by-step logical inference using an LLM, avoiding the need for task-specific embedding optimization. Experiments on standard KG reasoning benchmarks demonstrate that ROG consistently outperforms strong embedding-based baselines in terms of mean reciprocal rank (MRR), with particularly notable gains on high-complexity query types. These results suggest that integrating structured KG retrieval with LLM-driven logical reasoning offers a robust and effective alternative for complex KG reasoning tasks.
Fonte: arXiv cs.CL
RL • Score 96
Monitoramento da Monitorabilidade
A observabilidade na tomada de decisões de sistemas de IA modernos pode ser necessária para implantar com segurança agentes cada vez mais capazes. Monitorar a cadeia de pensamento (CoT) dos modelos de raciocínio atuais tem se mostrado eficaz na detecção de comportamentos inadequados. No entanto, essa 'monitorabilidade' pode ser frágil sob diferentes procedimentos de treinamento e fontes de dados.
Fonte: arXiv cs.AI
NLP/LLMs • Score 92
InstructNet: Uma Abordagem Inovadora para Classificação de Instruções Multirrótulo através de Aprendizado Profundo Avançado
As pessoas utilizam motores de busca para diversos tópicos e itens, desde essenciais do dia a dia até objetos mais especializados. Este estudo utiliza artigos 'How To' para determinar a categoria de instrução multirrótulo, empregando arquiteturas de redes neurais profundas baseadas em transformers, como XLNet e BERT, e alcançando uma precisão de 97,30% com a arquitetura XLNet.
Fonte: arXiv cs.CL
RL • Score 96
APC-GNN++: Uma GNN Adaptativa Centrada no Paciente com Atenção Consciente do Contexto e Explicabilidade de Mini-Grafos para Classificação de Diabetes
Propomos o APC-GNN++, uma Rede Neural Gráfica centrada no paciente para classificação de diabetes. Nosso modelo integra atenção de arestas consciente do contexto, mistura guiada por confiança de características de nós e representações gráficas, e regularização de consistência de vizinhança para capturar melhor relações clinicamente significativas entre pacientes.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
MoE-TransMov: Um Modelo Baseado em Transformer para Previsão do Próximo Ponto de Interesse (POI) em Movimentos Familiares e Não Familiares
A previsão precisa do próximo ponto de interesse (POI) nas trajetórias de mobilidade humana é crucial para serviços baseados em localização, permitindo recomendações mais oportunas e personalizadas. Propomos o MoE-TransMov, um modelo baseado em Transformer com arquitetura Mixture-of-Experts (MoE) que captura padrões de mobilidade distintos em diferentes contextos de movimento, melhorando a precisão das previsões.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
TraCeR: Análise de Risco Competitivo Baseada em Transformer com Covariáveis Longitudinais
A análise de sobrevivência é uma ferramenta crítica para modelar dados de tempo até o evento. Modelos recentes baseados em deep learning reduziram várias suposições de modelagem, mas ainda há desafios na incorporação de covariáveis longitudinais. Apresentamos o TraCeR, um framework de análise de sobrevivência baseado em transformer que lida com covariáveis longitudinais e melhora a calibração do modelo.
Fonte: arXiv cs.LG
RL • Score 96
Treinamento de Modelos de Raciocínio Multimodal Grandes Necessita de Melhores Ideias: Um Framework de Três Estágios para Síntese e Seleção de Longas Cadeias de Pensamento
Modelos de Raciocínio Grandes (LRMs) demonstraram desempenho notável em tarefas complexas de raciocínio por meio de longas Cadeias de Pensamento (CoT). A extensão desses sucessos para raciocínio multimodal é desafiadora devido à complexidade de integrar diversas modalidades de entrada e à escassez de dados de treinamento de alta qualidade. Neste artigo, propomos o SynSelect, um novo framework de Síntese-Seleção em três estágios para gerar dados longos de CoT de alta qualidade voltados para tarefas de raciocínio multimodal.
Fonte: arXiv cs.AI
NLP/LLMs • Score 92
MDToC: Árvore Dinâmica Metacognitiva de Conceitos para Aumentar a Resolução de Problemas Matemáticos em Modelos de Linguagem de Grande Escala
Apesar dos avanços nas capacidades de raciocínio matemático, os Modelos de Linguagem de Grande Escala (LLMs) ainda enfrentam dificuldades na verificação de cálculos ao utilizar técnicas de prompting estabelecidas. Apresentamos o MDToC, uma abordagem em três fases que constrói uma árvore de conceitos, desenvolve cálculos verificados por precisão para cada conceito e utiliza votação majoritária para avaliar soluções concorrentes.
Fonte: arXiv cs.CL
Evaluation/Benchmarks • Score 92
Uma Função de Perda Convexa para Predição de Conjuntos com Compromissos Otimais Entre Tamanho e Cobertura Condicional
Consideramos problemas de aprendizado supervisionado em que previsões de conjuntos fornecem estimativas explícitas de incerteza. Usando integrais de Choquet (também conhecidas como extensões de Lov{á}sz), propomos uma função de perda convexa para funções de subconjunto não decrescentes obtidas como conjuntos de nível de uma função de valor real.
Fonte: arXiv stat.ML
Evaluation/Benchmarks • Score 95
Descida de Espelho Variacional Online para Aprendizado Robusto na Ponte de Schrödinger
A Ponte de Schrödinger (SB) evoluiu para uma classe universal de modelos generativos probabilísticos. No entanto, os sinais de aprendizado estimados são intrinsecamente incertos, e a confiabilidade prometida pelos métodos existentes muitas vezes se baseia em cenários ótimos especulativos. Neste trabalho, propomos um framework de Descida de Espelho Online Variacional (OMD) para os problemas de SB, que proporciona maior estabilidade aos solucionadores de SB.
Fonte: arXiv stat.ML
RL • Score 96
Aprendizado por Reforço Confiável e Explicável para Controle de Processos Seguro e Eficiente em Energia: Um Caso de Uso em Sistemas Industriais de Ar Comprimido
Este artigo apresenta uma abordagem confiável de aprendizado por reforço para o controle de sistemas industriais de ar comprimido. Desenvolvemos um framework que possibilita operação segura e eficiente em energia sob condições de contorno realistas e introduzimos um pipeline de explicabilidade em múltiplos níveis. Uma avaliação empírica mostra que a política aprendida é fisicamente plausível e respeita consistentemente os limites do sistema.
Fonte: arXiv cs.LG
NLP/LLMs • Score 92
LLMs on Drugs: Language Models Are Few-Shot Consumers
arXiv:2512.18546v1 Announce Type: new
Abstract: Large language models (LLMs) are sensitive to the personas imposed on them at inference time, yet prompt-level "drug" interventions have never been benchmarked rigorously. We present the first controlled study of psychoactive framings on GPT-5-mini using ARC-Challenge. Four single-sentence prompts -- LSD, cocaine, alcohol, and cannabis -- are compared against a sober control across 100 validation items per condition, with deterministic decoding, full logging, Wilson confidence intervals, and Fisher exact tests. Control accuracy is 0.45; alcohol collapses to 0.10 (p = 3.2e-8), cocaine to 0.21 (p = 4.9e-4), LSD to 0.19 (p = 1.3e-4), and cannabis to 0.30 (p = 0.041), largely because persona prompts disrupt the mandated "Answer: " template. Persona text therefore behaves like a "few-shot consumable" that can destroy reliability without touching model weights. All experimental code, raw results, and analysis scripts are available at https://github.com/lexdoudkin/llms-on-drugs.
Fonte: arXiv cs.CL
Vision • Score 96
Aprendizado de Cláusulas Direcionado por Conflito com Heurísticas VSIDS para Layout Discreto de Instalações
Este artigo estuda o uso de Aprendizado de Cláusulas Direcionado por Conflito (CDCL) com heurísticas VSIDS como um motor computacional para problemas de layout discreto de instalações. O problema de layout é modelado como um problema de atribuição combinatória com uma estrutura lógica densa, resultante de restrições de adjacência, separação e disponibilidade de slots.
Fonte: arXiv cs.AI
RL • Score 96
Explicações Fiéis e Estáveis de Neurônios para Interpretabilidade Mecanística Confiável
A identificação de neurônios é uma ferramenta popular na interpretabilidade mecanística, visando descobrir os conceitos interpretáveis por humanos representados por neurônios individuais em redes profundas. Embora algoritmos como Network Dissection e CLIP-Dissect tenham alcançado grande sucesso empírico, uma base teórica rigorosa ainda está ausente, o que é crucial para permitir explicações confiáveis. Este trabalho apresenta a primeira análise teórica de desafios fundamentais relacionados à fidelidade e estabilidade das explicações de neurônios.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Propor, Resolver, Verificar: Auto-Jogo Através da Verificação Formal
O treinamento de modelos apenas através de auto-jogo (sem dados humanos) tem sido um objetivo de longa data em IA, mas sua eficácia para treinar grandes modelos de linguagem permanece incerta, especialmente na geração de código. Estudamos o auto-jogo no contexto de geração de código verificado, onde a verificação formal fornece sinais de correção confiáveis.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
MSC-180: Um Benchmark para Prova Formal Automatizada de Teoremas a partir da Classificação de Assuntos Matemáticos
O Automated Theorem Proving (ATP) é uma direção de pesquisa central em inteligência artificial para alcançar raciocínio formal e verificação. Propomos o MSC-180, um benchmark baseado na classificação de assuntos matemáticos MSC2020, que compreende 180 problemas de verificação formal, abrangendo níveis de graduação e pós-graduação, para avaliar e impulsionar o desenvolvimento de sistemas de IA com habilidades genuínas de raciocínio matemático.
Fonte: arXiv cs.AI
Vision • Score 93
Detecção de Ameaças Internas Usando GCN e Bi-LSTM com Representações Gráficas Explícitas e Implícitas
A detecção de ameaças internas (ITD) é desafiadora devido à natureza sutil e oculta das atividades maliciosas realizadas por usuários confiáveis. Este artigo propõe um framework ITD pós-hoc que integra representações gráficas explícitas e implícitas com modelagem temporal para capturar padrões complexos de comportamento do usuário.
Fonte: arXiv cs.AI
Evaluation/Benchmarks • Score 93
Otimização de Roteamento de Atribuição: Solucionadores para Problemas Sob Restrições
Estudamos o problema de Roteamento-Atribuição Conjunto (JRA), onde itens devem ser atribuídos um a um a espaços reservados, enquanto determinamos simultaneamente um ciclo Hamiltoniano que visita todos os nós exatamente uma vez. Desenvolvemos um solucionador adaptado para cenários práticos de planejamento de embalagem com restrições mais ricas.
Fonte: arXiv cs.AI
Evaluation/Benchmarks • Score 93
Comparação Social sem Inferência Explícita dos Valores de Recompensa dos Outros: Uma Abordagem Construtiva Usando um Modelo Generativo Probabilístico
A comparação social — o processo de avaliar as recompensas de um indivíduo em relação às dos outros — desempenha um papel fundamental na cognição social dos primatas. Este estudo investiga se os macacos reconhecem apenas diferenças objetivas de recompensa ou inferem as avaliações subjetivas de recompensa dos outros, utilizando três modelos computacionais com diferentes graus de processamento de informações sociais.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Confiança Reflexiva: Corrigindo Falhas de Raciocínio via Auto-Correção Online
Modelos de linguagem grandes (LLMs) demonstraram forte desempenho em tarefas complexas de raciocínio utilizando técnicas como chain-of-thought e self-consistency. No entanto, abordagens baseadas em ensembles, especialmente self-consistency, frequentemente acarretam um overhead computacional substancial. Propomos a confiança reflexiva, um novo framework de raciocínio que transforma sinais de baixa confiança em gatilhos de reflexão.
Fonte: arXiv cs.AI
Evaluation/Benchmarks • Score 96
Aprendizado por Transferência Baseado em Clustering para Algoritmo Evolutivo Multimodal Multiobjetivo Dinâmico
A otimização multimodal multiobjetivo dinâmica enfrenta o desafio de rastrear simultaneamente múltiplos conjuntos pareto ótimos equivalentes e manter a diversidade populacional em ambientes variáveis. Este artigo apresenta um novo conjunto de funções de teste e um algoritmo inovador baseado em um mecanismo de resposta dinâmica de Clustering-based Autoencoder, visando melhorar a diversidade e a convergência em algoritmos evolutivos.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Adaptação Automática à Complexidade de Conceitos e Conceitos Naturais Subjetivos: Um Modelo Cognitivo Baseado em Chunking
Um problema central na ciência cognitiva diz respeito aos processos psicológicos fundamentais que sustentam a formação e recuperação de múltiplos tipos de conceitos na memória de curto e longo prazo (STM e LTM, respectivamente). Propomos que os mecanismos de chunking desempenham um papel essencial e mostramos como o modelo computacional CogAct fundamenta o aprendizado de conceitos em processos e estruturas cognitivas fundamentais.
Fonte: arXiv cs.AI
Evaluation/Benchmarks • Score 90
Aprendendo Operadores Neurais Generalizáveis para Problemas Inversos
Problemas inversos desafiam as arquiteturas existentes de operadores neurais devido a mapas inversos mal definidos que violam suposições de continuidade, unicidade e estabilidade. Apresentamos o B2B${}^{-1}$, um framework de operadores neurais de base para base que aborda essa limitação, permitindo a aprendizagem de modelos determinísticos, invertíveis e probabilísticos em um único framework.
Fonte: arXiv cs.LG
Evaluation/Benchmarks • Score 93
KeenKT: Desambiguação do Estado de Domínio do Conhecimento para Rastreio de Conhecimento
O Rastreio de Conhecimento (KT) visa modelar dinamicamente o domínio de conceitos de conhecimento de um estudante com base em suas interações de aprendizado históricas. A maioria dos métodos atuais depende de estimativas pontuais, que não conseguem distinguir a verdadeira habilidade de explosões ou desatenção, criando ambiguidade no julgamento do domínio.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
CORE: Reforço Orientado a Conceitos para Reduzir a Lacuna entre Definição e Aplicação em Raciocínio Matemático
Modelos de linguagem grandes (LLMs) frequentemente resolvem exercícios matemáticos desafiadores, mas falham em aplicar o conceito quando o problema exige compreensão genuína. Apresentamos o CORE (Reforço Orientado a Conceitos), um framework de treinamento de RL que transforma conceitos explícitos em um sinal de supervisão controlável.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Recontextualização Mitiga o Jogo de Especificação sem Modificar a Especificação
Os desenvolvedores frequentemente enfrentam dificuldades para especificar rótulos de treinamento e recompensas corretas. Propomos a recontextualização, que reduz a frequência com que modelos de linguagem 'jogam' com sinais de treinamento, realizando comportamentos inadequados que esses sinais reforçam erroneamente.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
CoPE: A Small Language Model for Steerable and Scalable Content Labeling
arXiv:2512.18027v1 Announce Type: new
Abstract: This paper details the methodology behind CoPE, a policy-steerable small language model capable of fast and accurate content labeling. We present a novel training curricula called Contradictory Example Training that enables the model to learn policy interpretation rather than mere policy memorization. We also present a novel method for generating content policies, called Binocular Labeling, which enables rapid construction of unambiguous training datasets. When evaluated across seven different harm areas, CoPE exhibits equal or superior accuracy to frontier models at only 1% of their size. We openly release a 9 billion parameter version of the model that can be run on a single consumer-grade GPU. Models like CoPE represent a paradigm shift for classifier systems. By turning an ML task into a policy writing task, CoPE opens up new design possibilities for the governance of online platforms.
Fonte: arXiv cs.CL
NLP/LLMs • Score 96
Vibe Reasoning: Extraindo Capacidades Matemáticas de IA de Fronteira -- Um Estudo de Caso sobre o Problema 6 do IMO 2025
Apresentamos o Vibe Reasoning, um paradigma colaborativo entre humanos e IA para resolver problemas matemáticos complexos. Nossa principal percepção é que modelos de IA de fronteira já possuem o conhecimento necessário para resolver problemas desafiadores, mas não sabem como, o que ou quando aplicá-lo. Este trabalho demonstra essa abordagem através do Problema 6 do IMO 2025.
Fonte: arXiv cs.AI
RL • Score 96
AL-GNN: Aprendizado Contínuo de Grafos Preservando a Privacidade e Livre de Replay via Aprendizado Analítico
O aprendizado contínuo de grafos (CGL) permite que redes neurais de grafos aprendam incrementalmente a partir de dados estruturados em grafos sem esquecer o conhecimento previamente adquirido. O AL-GNN é um novo framework que elimina a necessidade de retropropagação e buffers de replay, utilizando princípios da teoria do aprendizado analítico para otimizar o aprendizado.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Q-KVComm: Comunicação Multi-Agente Eficiente Via Compressão Adaptativa de Cache KV
Sistemas multi-agente de Large Language Model (LLM) enfrentam um gargalo crítico: a transmissão redundante de informações contextuais entre agentes consome largura de banda e recursos computacionais excessivos. Apresentamos o Q-KVComm, um novo protocolo que permite a transmissão direta de representações comprimidas de cache key-value (KV) entre agentes LLM.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
On Finding Inconsistencies in Documents
arXiv:2512.18601v1 Announce Type: new
Abstract: Professionals in academia, law, and finance audit their documents because inconsistencies can result in monetary, reputational, and scientific costs. Language models (LMs) have the potential to dramatically speed up this auditing process. To understand their abilities, we introduce a benchmark, FIND (Finding INconsistencies in Documents), where each example is a document with an inconsistency inserted manually by a domain expert. Despite the documents being long, technical, and complex, the best-performing model (gpt-5) recovered 64% of the inserted inconsistencies. Surprisingly, gpt-5 also found undiscovered inconsistencies present in the original documents. For example, on 50 arXiv papers, we judged 136 out of 196 of the model's suggestions to be legitimate inconsistencies missed by the original authors. However, despite these findings, even the best models miss almost half of the inconsistencies in FIND, demonstrating that inconsistency detection is still a challenging task.
Fonte: arXiv cs.CL
MLOps/Systems • Score 96
Mecanismos de Memória Dependentes de Modalidade em Computação Neuromórfica Cross-Modal
As redes neurais spiking (SNNs) com memória prometem computação neuromórfica energeticamente eficiente, mas sua generalização entre modalidades sensoriais permanece inexplorada. Apresentamos o primeiro estudo abrangente de ablação cross-modal dos mecanismos de memória em SNNs, avaliando redes de Hopfield, Redes Recorrentes Gated Hierárquicas (HGRNs) e aprendizado contrastivo supervisionado (SCL) em conjuntos de dados neuromórficos visuais (N-MNIST) e auditivos (SHD).
Fonte: arXiv cs.LG
Evaluation/Benchmarks • Score 93
Avaliação Comparativa de Machine Learning Explicável Versus Regressão Linear para Prever a Taxa de Mortalidade por Câncer de Pulmão em Nível de Condado nos Estados Unidos
O câncer de pulmão (CP) é uma das principais causas de mortalidade relacionada ao câncer nos Estados Unidos. A previsão precisa das taxas de mortalidade por CP é crucial para orientar intervenções direcionadas e abordar disparidades de saúde. Este estudo aplicou três modelos: random forest (RF), gradient boosting regression (GBR) e regressão linear (LR) para prever as taxas de mortalidade por CP em nível de condado.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Mantenha Leve! Simplificando o Agrupamento de Imagens Através de Adaptadores Sem Texto
No contexto de modelos pré-treinados, a classificação eficaz pode ser alcançada com camadas de leitura leves. Este trabalho demonstra que, no agrupamento profundo, é possível obter desempenho competitivo com métodos mais complexos utilizando um pipeline de treinamento altamente simplificado e sem texto. Nossa abordagem, Simple Clustering via Pre-trained models (SCP), utiliza representações de características de modelos de visão pré-treinados e pares de dados positivos.
Fonte: arXiv stat.ML
NLP/LLMs • Score 96
Ajuste Fino Eficiente em Parâmetros para HAR: Integrando LoRA e QLoRA em Modelos Transformer
O reconhecimento de atividades humanas (HAR) é uma tarefa fundamental em computação pervasiva. Este trabalho investiga técnicas de ajuste fino eficientes em parâmetros, especificamente Low-Rank Adaptation (LoRA) e Quantized LoRA, como alternativas escaláveis ao ajuste fino completo de modelos para HAR, demonstrando desempenho competitivo com menos parâmetros treináveis e menor uso de memória.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
Rumo à Avaliação de Vulnerabilidades de Privacidade no Esquecimento Seletivo com Modelos de Linguagem de Grande Escala
Os avanços rápidos em inteligência artificial (IA) têm se concentrado no aprendizado a partir de dados para desenvolver sistemas de aprendizado informados. Com a implementação desses sistemas em áreas críticas, garantir sua privacidade e alinhamento com valores humanos é essencial. O esquecimento seletivo, ou machine unlearning, surge como uma abordagem promissora, mas também levanta preocupações significativas de privacidade, especialmente em domínios sensíveis.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Formulação Automática de Problemas Independente de Solver via LLMs para Design Orientado a Simulação de Alto Custo
No domínio de design orientado a simulação de alto custo, traduzir requisitos de design ambíguos em uma formulação de otimização matemática é um gargalo para otimizar o desempenho do produto. Propomos o APF, um framework para formulação automatizada de problemas independente de solver via LLMs, que converte automaticamente os requisitos em linguagem natural dos engenheiros em modelos de otimização executáveis.
Fonte: arXiv cs.CL
NLP/LLMs • Score 92
Mitigating Spurious Correlations in NLI via LLM-Synthesized Counterfactuals and Dynamic Balanced Sampling
arXiv:2512.18462v1 Announce Type: new
Abstract: Natural Language Inference (NLI) models frequently rely on spurious correlations rather than semantic reasoning. Existing mitigation strategies often incur high annotation costs or trigger catastrophic forgetting during fine-tuning. We propose an automated, scalable pipeline to address these limitations. First, we introduce Log-Frequency LMI (LF-LMI) to accurately detect semantic artifacts. Second, we generate a high-quality synthetic contrast set via an LLM-synthesis pipeline with multi-judge verification. Finally, we introduce Dynamic Balanced Sampling, a training strategy that rotates the original data distribution to prevent forgetting. Our method improves consistency on a challenging benchmark from 63.5% to 81.0% while maintaining 88.4% in-domain accuracy, significantly outperforming naive fine-tuning.
Fonte: arXiv cs.CL
RL • Score 96
O Challenger: Quando Novas Fontes de Dados Justificam a Troca de Modelos de Machine Learning?
Estudamos o problema de decidir se e quando uma organização deve substituir um modelo incumbente treinado por um challenger que utiliza novas características disponíveis. Desenvolvemos um framework econômico e estatístico unificado que relaciona dinâmicas de curva de aprendizado, custos de aquisição de dados e re-treinamento, e desconto de ganhos futuros.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
MoE Pathfinder: Poda de Especialistas Orientada por Trajetória
As arquiteturas de Mixture-of-experts (MoE) em grandes modelos de linguagem (LLMs) alcançam desempenho de ponta em diversas tarefas, mas enfrentam desafios práticos como complexidade de implantação e baixa eficiência de ativação. A poda de especialistas surgiu como uma solução promissora para reduzir a sobrecarga computacional e simplificar a implantação de modelos MoE.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Learning to Prioritize IT Tickets: A Comparative Evaluation of Embedding-based Approaches and Fine-Tuned Transformer Models
arXiv:2512.17916v1 Announce Type: new
Abstract: Prioritizing service tickets in IT Service Management (ITSM) is critical for operational efficiency but remains challenging due to noisy textual inputs, subjective writing styles, and pronounced class imbalance. We evaluate two families of approaches for ticket prioritization: embedding-based pipelines that combine dimensionality reduction, clustering, and classical classifiers, and a fine-tuned multilingual transformer that processes both textual and numerical features. Embedding-based methods exhibit limited generalization across a wide range of thirty configurations, with clustering failing to uncover meaningful structures and supervised models highly sensitive to embedding quality. In contrast, the proposed transformer model achieves substantially higher performance, with an average F1-score of 78.5% and weighted Cohen's kappa values of nearly 0.80, indicating strong alignment with true labels. These results highlight the limitations of generic embeddings for ITSM data and demonstrate the effectiveness of domain-adapted transformer architectures for operational ticket prioritization.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3
arXiv:2512.18399v1 Announce Type: new
Abstract: Tokenization is a critical preprocessing step for large language models (LLMs), directly impacting training efficiency and downstream performance. General-purpose tokenizers trained predominantly on English and Latin-script languages exhibit suboptimal performance on morphologically rich languages such as Arabic, resulting in inflated token sequences and reduced compression efficiency. In this work, we present AraToken, an Arabic-optimized tokenizer built on SentencePiece Unigram algorithm with a comprehensive normalization pipeline addressing Arabic-specific orthographic variations including Alif variants, diacritics, and Arabic-Indic numerals. We systematically compare BPE, WordPiece, and SentencePiece algorithms across multiple configurations, demonstrating that SentencePiece with normalization achieves 18% lower fertility (1.199 vs 1.35 tokens/word) compared to unnormalized baselines. Furthermore, we introduce the Language Extension Pipeline (LEP), a method for integrating the optimized tokenizer into Qwen3-0.6B through vocabulary extension with mean subtoken initialization and selective transformer layer unfreezing. Our experiments show that LEP reduces evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. We release our tokenizer, training scripts, and model checkpoints to facilitate Arabic NLP research.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators
arXiv:2512.17267v1 Announce Type: new
Abstract: Evaluating user-facing AI applications remains a central challenge, especially in open-ended domains such as travel planning, clinical note generation, or dialogue. The gold standard is user feedback (e.g., thumbs up/down) or behavioral signals (e.g., retention), but these are often scarce in prototypes and research projects, or too-slow to use for system optimization. We present AutoMetrics, a framework for synthesizing evaluation metrics under low-data constraints. AutoMetrics combines retrieval from MetricBank, a collection of 48 metrics we curate, with automatically generated LLM-as-a-Judge criteria informed by lightweight human feedback. These metrics are composed via regression to maximize correlation with human signal. AutoMetrics takes you from expensive measures to interpretable automatic metrics. Across 5 diverse tasks, AutoMetrics improves Kendall correlation with human ratings by up to 33.4% over LLM-as-a-Judge while requiring fewer than 100 feedback points. We show that AutoMetrics can be used as a proxy reward to equal effect as a verifiable reward. We release the full AutoMetrics toolkit and MetricBank to accelerate adaptive evaluation of LLM applications.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
When the Gold Standard isn't Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content
arXiv:2512.17738v1 Announce Type: new
Abstract: User-generated content (UGC) is characterised by frequent use of non-standard language, from spelling errors to expressive choices such as slang, character repetitions, and emojis. This makes evaluating UGC translation particularly challenging: what counts as a "good" translation depends on the level of standardness desired in the output. To explore this, we examine the human translation guidelines of four UGC datasets, and derive a taxonomy of twelve non-standard phenomena and five translation actions (NORMALISE, COPY, TRANSFER, OMIT, CENSOR). Our analysis reveals notable differences in how UGC is treated, resulting in a spectrum of standardness in reference translations. Through a case study on large language models (LLMs), we show that translation scores are highly sensitive to prompts with explicit translation instructions for UGC, and that they improve when these align with the dataset's guidelines. We argue that when preserving UGC style is important, fair evaluation requires both models and metrics to be aware of translation guidelines. Finally, we call for clear guidelines during dataset creation and for the development of controllable, guideline-aware evaluation frameworks for UGC translation.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Confidence-Credibility Aware Weighted Ensembles of Small LLMs Outperform Large LLMs in Emotion Detection
arXiv:2512.17630v1 Announce Type: new
Abstract: This paper introduces a confidence-weighted, credibility-aware ensemble framework for text-based emotion detection, inspired by Condorcet's Jury Theorem (CJT). Unlike conventional ensembles that often rely on homogeneous architectures, our approach combines architecturally diverse small transformer-based large language models (sLLMs) - BERT, RoBERTa, DistilBERT, DeBERTa, and ELECTRA, each fully fine-tuned for emotion classification. To preserve error diversity, we minimize parameter convergence while taking advantage of the unique biases of each model. A dual-weighted voting mechanism integrates both global credibility (validation F1 score) and local confidence (instance-level probability) to dynamically weight model contributions. Experiments on the DAIR-AI dataset demonstrate that our credibility-confidence ensemble achieves a macro F1 score of 93.5 percent, surpassing state-of-the-art benchmarks and significantly outperforming large-scale LLMs, including Falcon, Mistral, Qwen, and Phi, even after task-specific Low-Rank Adaptation (LoRA). With only 595M parameters in total, our small LLMs ensemble proves more parameter-efficient and robust than models up to 7B parameters, establishing that carefully designed ensembles of small, fine-tuned models can outperform much larger LLMs in specialized natural language processing (NLP) tasks such as emotion detection.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding
arXiv:2512.17220v1 Announce Type: new
Abstract: Humans understand long and complex texts by relying on a holistic semantic representation of the content. This global view helps organize prior knowledge, interpret new information, and integrate evidence dispersed across a document, as revealed by the Mindscape-Aware Capability of humans in psychology. Current Retrieval-Augmented Generation (RAG) systems lack such guidance and therefore struggle with long-context tasks. In this paper, we propose Mindscape-Aware RAG (MiA-RAG), the first approach that equips LLM-based RAG systems with explicit global context awareness. MiA-RAG builds a mindscape through hierarchical summarization and conditions both retrieval and generation on this global semantic representation. This enables the retriever to form enriched query embeddings and the generator to reason over retrieved evidence within a coherent global context. We evaluate MiA-RAG across diverse long-context and bilingual benchmarks for evidence-based understanding and global sense-making. It consistently surpasses baselines, and further analysis shows that it aligns local details with a coherent global representation, enabling more human-like long-context retrieval and reasoning.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Globally Optimal Solution to the Generalized Relative Pose Estimation Problem using Affine Correspondences
arXiv:2512.17188v1 Announce Type: new
Abstract: Mobile devices equipped with a multi-camera system and an inertial measurement unit (IMU) are widely used nowadays, such as self-driving cars. The task of relative pose estimation using visual and inertial information has important applications in various fields. To improve the accuracy of relative pose estimation of multi-camera systems, we propose a globally optimal solver using affine correspondences to estimate the generalized relative pose with a known vertical direction. First, a cost function about the relative rotation angle is established after decoupling the rotation matrix and translation vector, which minimizes the algebraic error of geometric constraints from affine correspondences. Then, the global optimization problem is converted into two polynomials with two unknowns based on the characteristic equation and its first derivative is zero. Finally, the relative rotation angle can be solved using the polynomial eigenvalue solver, and the translation vector can be obtained from the eigenvector. Besides, a new linear solution is proposed when the relative rotation is small. The proposed solver is evaluated on synthetic data and real-world datasets. The experiment results demonstrate that our method outperforms comparable state-of-the-art methods in accuracy.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
AncientBench: Towards Comprehensive Evaluation on Excavated and Transmitted Chinese Corpora
arXiv:2512.17756v1 Announce Type: new
Abstract: Comprehension of ancient texts plays an important role in archaeology and understanding of Chinese history and civilization. The rapid development of large language models needs benchmarks that can evaluate their comprehension of ancient characters. Existing Chinese benchmarks are mostly targeted at modern Chinese and transmitted documents in ancient Chinese, but the part of excavated documents in ancient Chinese is not covered. To meet this need, we propose the AncientBench, which aims to evaluate the comprehension of ancient characters, especially in the scenario of excavated documents. The AncientBench is divided into four dimensions, which correspond to the four competencies of ancient character comprehension: glyph comprehension, pronunciation comprehension, meaning comprehension, and contextual comprehension. The benchmark also contains ten tasks, including radical, phonetic radical, homophone, cloze, translation, and more, providing a comprehensive framework for evaluation. We convened archaeological researchers to conduct experimental evaluations, proposed an ancient model as baseline, and conducted extensive experiments on the currently best-performing large language models. The experimental results reveal the great potential of large language models in ancient textual scenarios as well as the gap with humans. Our research aims to promote the development and application of large language models in the field of archaeology and ancient Chinese language.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Simulstream: Open-Source Toolkit for Evaluation and Demonstration of Streaming Speech-to-Text Translation Systems
arXiv:2512.17648v1 Announce Type: new
Abstract: Streaming Speech-to-Text Translation (StreamST) requires producing translations concurrently with incoming speech, imposing strict latency constraints and demanding models that balance partial-information decision-making with high translation quality. Research efforts on the topic have so far relied on the SimulEval repository, which is no longer maintained and does not support systems that revise their outputs. In addition, it has been designed for simulating the processing of short segments, rather than long-form audio streams, and it does not provide an easy method to showcase systems in a demo. As a solution, we introduce simulstream, the first open-source framework dedicated to unified evaluation and demonstration of StreamST systems. Designed for long-form speech processing, it supports not only incremental decoding approaches, but also re-translation methods, enabling for their comparison within the same framework both in terms of quality and latency. In addition, it also offers an interactive web interface to demo any system built within the tool.
Fonte: arXiv cs.CL
NLP/LLMs • Score 96
When F1 Fails: Granularity-Aware Evaluation for Dialogue Topic Segmentation
arXiv:2512.17083v1 Announce Type: cross
Abstract: Dialogue topic segmentation supports summarization, retrieval, memory management, and conversational continuity. Despite decades of prior work, evaluation practice in dialogue topic segmentation remains dominated by strict boundary matching and F1-based metrics, even as modern LLM-based conversational systems increasingly rely on segmentation to manage conversation history beyond the model's fixed context window, where unstructured context accumulation degrades efficiency and coherence.
This paper introduces an evaluation objective for dialogue topic segmentation that treats boundary density and segment coherence as primary criteria, alongside window-tolerant F1 (W-F1). Through extensive cross-dataset empirical evaluation, we show that reported performance differences across dialogue segmentation benchmarks are driven not by model quality, but by annotation granularity mismatches and sparse boundary labels. This indicates that many reported improvements arise from evaluation artifacts rather than improved boundary detection.
We evaluated multiple, structurally distinct dialogue segmentation strategies across eight dialogue datasets spanning task-oriented, open-domain, meeting-style, and synthetic interactions. Across these settings, we observe high segment coherence combined with extreme oversegmentation relative to sparse labels, producing misleadingly low exact-match F1 scores. We show that topic segmentation is best understood as selecting an appropriate granularity rather than predicting a single correct boundary set. We operationalize this view by explicitly separating boundary scoring from boundary selection.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
AVM: Towards Structure-Preserving Neural Response Modeling in the Visual Cortex Across Stimuli and Individuals
arXiv:2512.16948v1 Announce Type: new
Abstract: While deep learning models have shown strong performance in simulating neural responses, they often fail to clearly separate stable visual encoding from condition-specific adaptation, which limits their ability to generalize across stimuli and individuals. We introduce the Adaptive Visual Model (AVM), a structure-preserving framework that enables condition-aware adaptation through modular subnetworks, without modifying the core representation. AVM keeps a Vision Transformer-based encoder frozen to capture consistent visual features, while independently trained modulation paths account for neural response variations driven by stimulus content and subject identity. We evaluate AVM in three experimental settings, including stimulus-level variation, cross-subject generalization, and cross-dataset adaptation, all of which involve structured changes in inputs and individuals. Across two large-scale mouse V1 datasets, AVM outperforms the state-of-the-art V1T model by approximately 2% in predictive correlation, demonstrating robust generalization, interpretable condition-wise modulation, and high architectural efficiency. Specifically, AVM achieves a 9.1% improvement in explained variance (FEVE) under the cross-dataset adaptation setting. These results suggest that AVM provides a unified framework for adaptive neural modeling across biological and experimental conditions, offering a scalable solution under structural constraints. Its design may inform future approaches to cortical modeling in both neuroscience and biologically inspired AI systems.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Anatomical Region-Guided Contrastive Decoding: A Plug-and-Play Strategy for Mitigating Hallucinations in Medical VLMs
arXiv:2512.17189v1 Announce Type: new
Abstract: Medical Vision-Language Models (MedVLMs) show immense promise in clinical applicability. However, their reliability is hindered by hallucinations, where models often fail to derive answers from visual evidence, instead relying on learned textual priors. Existing mitigation strategies for MedVLMs have distinct limitations: training-based methods rely on costly expert annotations, limiting scalability, while training-free interventions like contrastive decoding, though data-efficient, apply a global, untargeted correction whose effects in complex real-world clinical settings can be unreliable. To address these challenges, we introduce Anatomical Region-Guided Contrastive Decoding (ARCD), a plug-and-play strategy that mitigates hallucinations by providing targeted, region-specific guidance. Our module leverages an anatomical mask to direct a three-tiered contrastive decoding process. By dynamically re-weighting at the token, attention, and logits levels, it verifiably steers the model's focus onto specified regions, reinforcing anatomical understanding and suppressing factually incorrect outputs. Extensive experiments across diverse datasets, including chest X-ray, CT, brain MRI, and ocular ultrasound, demonstrate our method's effectiveness in improving regional understanding, reducing hallucinations, and enhancing overall diagnostic accuracy.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos
arXiv:2512.16978v1 Announce Type: new
Abstract: Long-form multimodal video understanding requires integrating vision, speech, and ambient audio with coherent long-range reasoning. Existing benchmarks emphasize either temporal length or multimodal richness, but rarely both and while some incorporate open-ended questions and advanced metrics, they mostly rely on single-score accuracy, obscuring failure modes. We introduce LongShOTBench, a diagnostic benchmark with open-ended, intent-driven questions; single- and multi-turn dialogues; and tasks requiring multimodal reasoning and agentic tool use across video, audio, and speech. Each item includes a reference answer and graded rubric for interpretable, and traceable evaluation. LongShOTBench is produced via a scalable, human-validated pipeline to ensure coverage and reproducibility. All samples in our LongShOTBench are human-verified and corrected. Furthermore, we present LongShOTAgent, an agentic system that analyzes long videos via preprocessing, search, and iterative refinement. On LongShOTBench, state-of-the-art MLLMs show large gaps: Gemini-2.5-Flash achieves 52.95%, open-source models remain below 30%, and LongShOTAgent attains 44.66%. These results underscore the difficulty of real-world long-form video understanding. LongShOTBench provides a practical, reproducible foundation for evaluating and improving MLLMs. All resources are available on GitHub: https://github.com/mbzuai-oryx/longshot.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning
arXiv:2512.17312v1 Announce Type: new
Abstract: Recent releases such as o3 highlight human-like "thinking with images" reasoning that combines structured tool use with stepwise verification, yet most open-source approaches still rely on text-only chains, rigid visual schemas, or single-step pipelines, limiting flexibility, interpretability, and transferability on complex tasks. We introduce CodeDance, which explores executable code as a general solver for visual reasoning. Unlike fixed-schema calls (e.g., only predicting bounding-box coordinates), CodeDance defines, composes, and executes code to orchestrate multiple tools, compute intermediate results, and render visual artifacts (e.g., boxes, lines, plots) that support transparent, self-checkable reasoning. To guide this process, we introduce a reward for balanced and adaptive tool-call, which balances exploration with efficiency and mitigates tool overuse. Interestingly, beyond the expected capabilities taught by atomic supervision, we empirically observe novel emergent behaviors during RL training: CodeDance demonstrates novel tool invocations, unseen compositions, and cross-task transfer. These behaviors arise without task-specific fine-tuning, suggesting a general and scalable mechanism of executable visual reasoning. Extensive experiments across reasoning benchmarks (e.g., visual search, math, chart QA) show that CodeDance not only consistently outperforms schema-driven and text-only baselines, but also surpasses advanced closed models such as GPT-4o and larger open-source models.
Fonte: arXiv cs.CV
Vision • Score 93
Fose: Fusão de Difusão em Uma Etapa e Rede de Ponta a Ponta para Pansharpening
O pansharpening é uma tarefa significativa de fusão de imagens que combina imagens multiespectrais de baixa resolução (LRMSI) e imagens panchromáticas de alta resolução (PAN) para obter imagens multiespectrais de alta resolução (HRMSI). Este trabalho propõe uma nova estratégia de treinamento em quatro etapas para obter uma rede leve chamada Fose, que funde um modelo de difusão em uma etapa e um modelo E2E, melhorando significativamente a eficiência do processo.
Fonte: arXiv cs.AI
Vision • Score 95
Diagnostic Performance of Universal-Learning Ultrasound AI Across Multiple Organs and Tasks: the UUSIC25 Challenge
arXiv:2512.17279v1 Announce Type: new
Abstract: IMPORTANCE: Current ultrasound AI remains fragmented into single-task tools, limiting clinical utility compared to versatile modern ultrasound systems.
OBJECTIVE: To evaluate the diagnostic accuracy and efficiency of single general-purpose deep learning models for multi-organ classification and segmentation.
DESIGN: The Universal UltraSound Image Challenge 2025 (UUSIC25) involved developing algorithms on 11,644 images (public/private). Evaluation used an independent, multi-center test set of 2,479 images, including data from a center completely unseen during training to assess generalization.
OUTCOMES: Diagnostic performance (Dice Similarity Coefficient [DSC]; Area Under the Receiver Operating Characteristic Curve [AUC]) and computational efficiency (inference time, GPU memory).
RESULTS: Of 15 valid algorithms, the top model (SMART) achieved a macro-averaged DSC of 0.854 across 5 segmentation tasks and AUC of 0.766 for binary classification. Models showed high capability in segmentation (e.g., fetal head DSC: 0.942) but variability in complex tasks subject to domain shift. Notably, in breast cancer molecular subtyping, the top model's performance dropped from AUC 0.571 (internal) to 0.508 (unseen external center), highlighting generalization challenges.
CONCLUSIONS: General-purpose AI models achieve high accuracy and efficiency across multiple tasks using a single architecture. However, performance degradation on unseen data suggests domain generalization is critical for future clinical deployment.
Fonte: arXiv cs.CV
Evaluation/Benchmarks • Score 92
A Generic Machine Learning Framework for Radio Frequency Fingerprinting
arXiv:2510.09775v3 Announce Type: replace-cross
Abstract: Fingerprinting radio frequency (RF) emitters typically involves finding unique characteristics that are featured in their received signal. These fingerprints are nuanced, but sufficiently detailed, motivating the pursuit of methods that can successfully extract them. The downstream task that requires the most meticulous RF fingerprinting (RFF) is known as specific emitter identification (SEI), which entails recognising each individual transmitter. RFF and SEI have a long history, with numerous defence and civilian applications such as signal intelligence, electronic surveillance, physical-layer authentication of wireless devices, to name a few. In recent years, data-driven RFF approaches have become popular due to their ability to automatically learn intricate fingerprints. They generally deliver superior performance when compared to traditional RFF techniques that are often labour-intensive, inflexible, and only applicable to a particular emitter type or transmission scheme. In this paper, we present a generic and versatile machine learning (ML) framework for data-driven RFF with several popular downstream tasks such as SEI, data association (EDA) and RF emitter clustering (RFEC). It is emitter-type agnostic. We then demonstrate the introduced framework for several tasks using real RF datasets for spaceborne surveillance, signal intelligence and countering drones applications.
Fonte: arXiv stat.ML
NLP/LLMs • Score 95
Sample, Don't Search: Rethinking Test-Time Alignment for Language Models
arXiv:2504.03790v2 Announce Type: replace-cross
Abstract: Increasing test-time computation has emerged as a promising direction for improving language model performance, particularly in scenarios where model finetuning is impractical or impossible due to computational constraints or private model weights. However, existing test-time search methods using a reward model (RM) often degrade in quality as compute scales, due to the over-optimization of what are inherently imperfect reward proxies. We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access. We demonstrate the effectiveness of QAlign on mathematical reasoning benchmarks (GSM8K and GSM-Symbolic) using a task-specific RM, showing consistent improvements over existing test-time compute methods like best-of-n and majority voting. Furthermore, when applied with more realistic RMs trained on the Tulu 3 preference dataset, QAlign outperforms direct preference optimization (DPO), best-of-n, majority voting, and weighted majority voting on a diverse range of datasets (GSM8K, MATH500, IFEval, MMLU-Redux, and TruthfulQA). A practical solution to aligning language models at test time using additional computation without degradation, our approach expands the limits of the capability that can be obtained from off-the-shelf language models without further training.
Fonte: arXiv stat.ML
RL • Score 95
Safeguarded Stochastic Polyak Step Sizes for Non-smooth Optimization: Robust Performance Without Small (Sub)Gradients
arXiv:2512.02342v2 Announce Type: replace-cross
Abstract: The stochastic Polyak step size (SPS) has proven to be a promising choice for stochastic gradient descent (SGD), delivering competitive performance relative to state-of-the-art methods on smooth convex and non-convex optimization problems, including deep neural network training. However, extensions of this approach to non-smooth settings remain in their early stages, often relying on interpolation assumptions or requiring knowledge of the optimal solution. In this work, we propose a novel SPS variant, Safeguarded SPS (SPS$_{safe}$), for the stochastic subgradient method, and provide rigorous convergence guarantees for non-smooth convex optimization with no need for strong assumptions. We further incorporate momentum into the update rule, yielding equally tight theoretical results. On non-smooth convex benchmarks, our experiments are consistent with the theoretical predictions on how the safeguard affects the convergence neighborhood. On deep neural networks the proposed step size achieves competitive performance to existing adaptive baselines and exhibits stable behavior across a wide range of problem settings. Moreover, in these experiments, the gradient norms under our step size do not collapse to (near) zero, indicating robustness to vanishing gradients.
Fonte: arXiv stat.ML
Vision • Score 95
PhysFire-WM: A Physics-Informed World Model for Emulating Fire Spread Dynamics
arXiv:2512.17152v1 Announce Type: new
Abstract: Fine-grained fire prediction plays a crucial role in emergency response. Infrared images and fire masks provide complementary thermal and boundary information, yet current methods are predominantly limited to binary mask modeling with inherent signal sparsity, failing to capture the complex dynamics of fire. While world models show promise in video generation, their physical inconsistencies pose significant challenges for fire forecasting. This paper introduces PhysFire-WM, a Physics-informed World Model for emulating Fire spread dynamics. Our approach internalizes combustion dynamics by encoding structured priors from a Physical Simulator to rectify physical discrepancies, coupled with a Cross-task Collaborative Training strategy (CC-Train) that alleviates the issue of limited information in mask-based modeling. Through parameter sharing and gradient coordination, CC-Train effectively integrates thermal radiation dynamics and spatial boundary delineation, enhancing both physical realism and geometric accuracy. Extensive experiments on a fine-grained multimodal fire dataset demonstrate the superior accuracy of PhysFire-WM in fire spread prediction. Validation underscores the importance of physical priors and cross-task collaboration, providing new insights for applying physics-informed world models to disaster prediction.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Video Detective: Seek Critical Clues Recurrently to Answer Question from Long Videos
arXiv:2512.17229v1 Announce Type: new
Abstract: Long Video Question-Answering (LVQA) presents a significant challenge for Multi-modal Large Language Models (MLLMs) due to immense context and overloaded information, which could also lead to prohibitive memory consumption. While existing methods attempt to address these issues by reducing visual tokens or extending model's context length, they may miss useful information or take considerable computation. In fact, when answering given questions, only a small amount of crucial information is required. Therefore, we propose an efficient question-aware memory mechanism, enabling MLLMs to recurrently seek these critical clues. Our approach, named VideoDetective, simplifies this task by iteratively processing video sub-segments. For each sub-segment, a question-aware compression strategy is employed by introducing a few special memory tokens to achieve purposefully compression. This allows models to effectively seek critical clues while reducing visual tokens. Then, due to history context could have a significant impact, we recurrently aggregate and store these memory tokens to update history context, which would be reused for subsequent sub-segments. Furthermore, to more effectively measure model's long video understanding ability, we introduce GLVC (Grounding Long Video Clues), a long video question-answering dataset, which features grounding critical and concrete clues scattered throughout entire videos. Experimental results demonstrate our method enables MLLMs with limited context length of 32K to efficiently process 100K tokens (3600 frames, an hour-long video sampled at 1fps), requiring only 2 minutes and 37GB GPU memory usage. Evaluation results across multiple long video benchmarks illustrate our method can more effectively seek critical clues from massive information.
Fonte: arXiv cs.CV
RL • Score 95
A Certified Unlearning Approach without Access to Source Data
arXiv:2506.06486v3 Announce Type: replace-cross
Abstract: With the growing adoption of data privacy regulations, the ability to erase private or copyrighted information from trained models has become a crucial requirement. Traditional unlearning methods often assume access to the complete training dataset, which is unrealistic in scenarios where the source data is no longer available. To address this challenge, we propose a certified unlearning framework that enables effective data removal \final{without access to the original training data samples}. Our approach utilizes a surrogate dataset that approximates the statistical properties of the source data, allowing for controlled noise scaling based on the statistical distance between the two. \updated{While our theoretical guarantees assume knowledge of the exact statistical distance, practical implementations typically approximate this distance, resulting in potentially weaker but still meaningful privacy guarantees.} This ensures strong guarantees on the model's behavior post-unlearning while maintaining its overall utility. We establish theoretical bounds, introduce practical noise calibration techniques, and validate our method through extensive experiments on both synthetic and real-world datasets. The results demonstrate the effectiveness and reliability of our approach in privacy-sensitive settings.
Fonte: arXiv stat.ML
RL • Score 93
SFBD-OMNI: Bridge models for lossy measurement restoration with limited clean samples
arXiv:2512.17051v1 Announce Type: new
Abstract: In many real-world scenarios, obtaining fully observed samples is prohibitively expensive or even infeasible, while partial and noisy observations are comparatively easy to collect. In this work, we study distribution restoration with abundant noisy samples, assuming the corruption process is available as a black-box generator. We show that this task can be framed as a one-sided entropic optimal transport problem and solved via an EM-like algorithm. We further provide a test criterion to determine whether the true underlying distribution is recoverable under per-sample information loss, and show that in otherwise unrecoverable cases, a small number of clean samples can render the distribution largely recoverable. Building on these insights, we introduce SFBD-OMNI, a bridge model-based framework that maps corrupted sample distributions to the ground-truth distribution. Our method generalizes Stochastic Forward-Backward Deconvolution (SFBD; Lu et al., 2025) to handle arbitrary measurement models beyond Gaussian corruption. Experiments across benchmark datasets and diverse measurement settings demonstrate significant improvements in both qualitative and quantitative performance.
Fonte: arXiv cs.LG
RL • Score 96
SHARP-QoS: Sparsely-gated Hierarchical Adaptive Routing for joint Prediction of QoS
arXiv:2512.17262v1 Announce Type: new
Abstract: Dependable service-oriented computing relies on multiple Quality of Service (QoS) parameters that are essential to assess service optimality. However, real-world QoS data are extremely sparse, noisy, and shaped by hierarchical dependencies arising from QoS interactions, and geographical and network-level factors, making accurate QoS prediction challenging. Existing methods often predict each QoS parameter separately, requiring multiple similar models, which increases computational cost and leads to poor generalization. Although recent joint QoS prediction studies have explored shared architectures, they suffer from negative transfer due to loss-scaling caused by inconsistent numerical ranges across QoS parameters and further struggle with inadequate representation learning, resulting in degraded accuracy. This paper presents an unified strategy for joint QoS prediction, called SHARP-QoS, that addresses these issues using three components. First, we introduce a dual mechanism to extract the hierarchical features from both QoS and contextual structures via hyperbolic convolution formulated in the Poincar\'e ball. Second, we propose an adaptive feature-sharing mechanism that allows feature exchange across informative QoS and contextual signals. A gated feature fusion module is employed to support dynamic feature selection among structural and shared representations. Third, we design an EMA-based loss balancing strategy that allows stable joint optimization, thereby mitigating the negative transfer. Evaluations on three datasets with two, three, and four QoS parameters demonstrate that SHARP-QoS outperforms both single- and multi-task baselines. Extensive study shows that our model effectively addresses major challenges, including sparsity, robustness to outliers, and cold-start, while maintaining moderate computational overhead, underscoring its capability for reliable joint QoS prediction.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Towards Human-Guided, Data-Centric LLM Co-Pilots
arXiv:2501.10321v3 Announce Type: replace-cross
Abstract: Machine learning (ML) has the potential to revolutionize various domains, but its adoption is often hindered by the disconnect between the needs of domain experts and translating these needs into robust and valid ML tools. Despite recent advances in LLM-based co-pilots to democratize ML for non-technical domain experts, these systems remain predominantly focused on model-centric aspects while overlooking critical data-centric challenges. This limitation is problematic in complex real-world settings where raw data often contains complex issues, such as missing values, label noise, and domain-specific nuances requiring tailored handling. To address this we introduce CliMB-DC, a human-guided, data-centric framework for LLM co-pilots that combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing. At its core, CliMB-DC introduces a novel, multi-agent reasoning system that combines a strategic coordinator for dynamic planning and adaptation with a specialized worker agent for precise execution. Domain expertise is then systematically incorporated to guide the reasoning process using a human-in-the-loop approach. To guide development, we formalize a taxonomy of key data-centric challenges that co-pilots must address. Thereafter, to address the dimensions of the taxonomy, we integrate state-of-the-art data-centric tools into an extensible, open-source architecture, facilitating the addition of new tools from the research community. Empirically, using real-world healthcare datasets we demonstrate CliMB-DC's ability to transform uncurated datasets into ML-ready formats, significantly outperforming existing co-pilot baselines for handling data-centric challenges. CliMB-DC promises to empower domain experts from diverse domains -- healthcare, finance, social sciences and more -- to actively participate in driving real-world impact using ML.
Fonte: arXiv stat.ML
NLP/LLMs • Score 95
Studying the Effects of Collaboration in Interactive Theme Discovery Systems
arXiv:2408.09030v4 Announce Type: replace
Abstract: NLP-assisted solutions have gained considerable traction to support qualitative data analysis. However, there does not exist a unified evaluation framework that can account for the many different settings in which qualitative researchers may employ them. In this paper, we take a first step in this direction by proposing an evaluation framework to study the way in which different tools may result in different outcomes depending on the collaboration strategy employed. Specifically, we study the impact of synchronous vs. asynchronous collaboration using two different NLP-assisted qualitative research tools and present a comprehensive analysis of significant differences in the consistency, cohesiveness, and correctness of their outputs.
Fonte: arXiv cs.CL
RL • Score 95
Fairness via Independence: A (Conditional) Distance Covariance Framework
arXiv:2412.00720v2 Announce Type: replace-cross
Abstract: We explore fairness from a statistical perspective by selectively utilizing either conditional distance covariance or distance covariance statistics as measures to assess the independence between predictions and sensitive attributes. We boost fairness with independence by adding a distance covariance-based penalty to the model's training. Additionally, we present the matrix form of empirical (conditional) distance covariance for parallel calculations to enhance computational efficiency. Theoretically, we provide a proof for the convergence between empirical and population (conditional) distance covariance, establishing necessary guarantees for batch computations. Through experiments conducted on a range of real-world datasets, we have demonstrated that our method effectively bridges the fairness gap in machine learning. Our code is available at \url{https://github.com/liuhaixias1/Fair_dc/}.
Fonte: arXiv stat.ML
RL • Score 92
Quantifying Uncertainty in the Presence of Distribution Shifts
arXiv:2506.18283v2 Announce Type: replace
Abstract: Neural networks make accurate predictions but often fail to provide reliable uncertainty estimates, especially under covariate distribution shifts between training and testing. To address this problem, we propose a Bayesian framework for uncertainty estimation that explicitly accounts for covariate shifts. While conventional approaches rely on fixed priors, the key idea of our method is an adaptive prior, conditioned on both training and new covariates. This prior naturally increases uncertainty for inputs that lie far from the training distribution in regions where predictive performance is likely to degrade. To efficiently approximate the resulting posterior predictive distribution, we employ amortized variational inference. Finally, we construct synthetic environments by drawing small bootstrap samples from the training data, simulating a range of plausible covariate shift using only the original dataset. We evaluate our method on both synthetic and real-world data. It yields substantially improved uncertainty estimates under distribution shifts.
Fonte: arXiv stat.ML
NLP/LLMs • Score 95
The Stochastic Occupation Kernel (SOCK) Method for Learning Stochastic Differential Equations
arXiv:2505.11622v2 Announce Type: replace
Abstract: We present a novel kernel-based method for learning multivariate stochastic differential equations (SDEs). The method follows a two-step procedure: we first estimate the drift term function, then the (matrix-valued) diffusion function given the drift. Occupation kernels are integral functionals on a reproducing kernel Hilbert space (RKHS) that aggregate information over a trajectory. Our approach leverages vector-valued occupation kernels for estimating the drift component of the stochastic process. For diffusion estimation, we extend this framework by introducing operator-valued occupation kernels, enabling the estimation of an auxiliary matrix-valued function as a positive semi-definite operator, from which we readily derive the diffusion estimate. This enables us to avoid common challenges in SDE learning, such as intractable likelihoods, by optimizing a reconstruction-error-based objective. We propose a simple learning procedure that retains strong predictive accuracy while using Fenchel duality to promote efficiency. We validate the method on simulated benchmarks and a real-world dataset of Amyloid imaging in healthy and Alzheimer's disease subjects.
Fonte: arXiv stat.ML
RL • Score 92
On the Identification of Temporally Causal Representation with Instantaneous Dependence
arXiv:2405.15325v3 Announce Type: replace-cross
Abstract: Temporally causal representation learning aims to identify the latent causal process from time series observations, but most methods require the assumption that the latent causal processes do not have instantaneous relations. Although some recent methods achieve identifiability in the instantaneous causality case, they require either interventions on the latent variables or grouping of the observations, which are in general difficult to obtain in real-world scenarios. To fill this gap, we propose an \textbf{ID}entification framework for instantane\textbf{O}us \textbf{L}atent dynamics (\textbf{IDOL}) by imposing a sparse influence constraint that the latent causal processes have sparse time-delayed and instantaneous relations. Specifically, we establish identifiability results of the latent causal process based on sufficient variability and the sparse influence constraint by employing contextual information of time series data. Based on these theories, we incorporate a temporally variational inference architecture to estimate the latent variables and a gradient-based sparsity regularization to identify the latent causal process. Experimental results on simulation datasets illustrate that our method can identify the latent causal process. Furthermore, evaluations on multiple human motion forecasting benchmarks with instantaneous dependencies indicate the effectiveness of our method in real-world settings.
Fonte: arXiv stat.ML
Evaluation/Benchmarks • Score 89
Adjusting Model Size in Continual Gaussian Processes: How Big is Big Enough?
arXiv:2408.07588v5 Announce Type: replace
Abstract: Many machine learning models require setting a parameter that controls their size before training, e.g. number of neurons in DNNs, or inducing points in GPs. Increasing capacity typically improves performance until all the information from the dataset is captured. After this point, computational cost keeps increasing, without improved performance. This leads to the question "How big is big enough?" We investigate this problem for Gaussian processes (single-layer neural networks) in continual learning. Here, data becomes available incrementally, and the final dataset size will therefore not be known before training, preventing the use of heuristics for setting a fixed model size. We develop a method to automatically adjust model size while maintaining near-optimal performance. Our experimental procedure follows the constraint that any hyperparameters must be set without seeing dataset properties, and we show that our method performs well across diverse datasets without the need to adjust its hyperparameter, showing it requires less tuning than others.
Fonte: arXiv stat.ML
NLP/LLMs • Score 95
Generalized infinite dimensional Alpha-Procrustes based geometries
arXiv:2511.09801v2 Announce Type: replace
Abstract: This work extends the recently introduced Alpha-Procrustes family of Riemannian metrics for symmetric positive definite (SPD) matrices by incorporating generalized versions of the Bures-Wasserstein (GBW), Log-Euclidean, and Wasserstein distances. While the Alpha-Procrustes framework has unified many classical metrics in both finite- and infinite- dimensional settings, it previously lacked the structural components necessary to realize these generalized forms. We introduce a formalism based on unitized Hilbert-Schmidt operators and an extended Mahalanobis norm that allows the construction of robust, infinite-dimensional generalizations of GBW and Log-Hilbert-Schmidt distances. Our approach also incorporates a learnable regularization parameter that enhances geometric stability in high-dimensional comparisons. Preliminary experiments reproducing benchmarks from the literature demonstrate the improved performance of our generalized metrics, particularly in scenarios involving comparisons between datasets of varying dimension and scale. This work lays a theoretical and computational foundation for advancing robust geometric methods in machine learning, statistical inference, and functional data analysis.
Fonte: arXiv stat.ML
RL • Score 92
Constraint-based causal discovery with tiered background knowledge and latent variables in single or overlapping datasets
arXiv:2503.21526v3 Announce Type: replace
Abstract: In this paper we consider the use of tiered background knowledge within constraint based causal discovery. Our focus is on settings relaxing causal sufficiency, i.e. allowing for latent variables which may arise because relevant information could not be measured at all, or not jointly, as in the case of multiple overlapping datasets. We first present novel insights into the properties of the 'tiered FCI' (tFCI) algorithm. Building on this, we introduce a new extension of the IOD (integrating overlapping datasets) algorithm incorporating tiered background knowledge, the 'tiered IOD' (tIOD) algorithm. We show that under full usage of the tiered background knowledge tFCI and tIOD are sound, while simple versions of the tIOD and tFCI are sound and complete. We further show that the tIOD algorithm can often be expected to be considerably more efficient and informative than the IOD algorithm even beyond the obvious restriction of the Markov equivalence classes. We provide a formal result on the conditions for this gain in efficiency and informativeness. Our results are accompanied by a series of examples illustrating the exact role and usefulness of tiered background knowledge.
Fonte: arXiv stat.ML
NLP/LLMs • Score 95
Spatially-informed transformers: Injecting geostatistical covariance biases into self-attention for spatio-temporal forecasting
arXiv:2512.17696v1 Announce Type: cross
Abstract: The modeling of high-dimensional spatio-temporal processes presents a fundamental dichotomy between the probabilistic rigor of classical geostatistics and the flexible, high-capacity representations of deep learning. While Gaussian processes offer theoretical consistency and exact uncertainty quantification, their prohibitive computational scaling renders them impractical for massive sensor networks. Conversely, modern transformer architectures excel at sequence modeling but inherently lack a geometric inductive bias, treating spatial sensors as permutation-invariant tokens without a native understanding of distance. In this work, we propose a spatially-informed transformer, a hybrid architecture that injects a geostatistical inductive bias directly into the self-attention mechanism via a learnable covariance kernel. By formally decomposing the attention structure into a stationary physical prior and a non-stationary data-driven residual, we impose a soft topological constraint that favors spatially proximal interactions while retaining the capacity to model complex dynamics. We demonstrate the phenomenon of ``Deep Variography'', where the network successfully recovers the true spatial decay parameters of the underlying process end-to-end via backpropagation. Extensive experiments on synthetic Gaussian random fields and real-world traffic benchmarks confirm that our method outperforms state-of-the-art graph neural networks. Furthermore, rigorous statistical validation confirms that the proposed method delivers not only superior predictive accuracy but also well-calibrated probabilistic forecasts, effectively bridging the gap between physics-aware modeling and data-driven learning.
Fonte: arXiv stat.ML
RL • Score 95
Generative Multi-Objective Bayesian Optimization with Scalable Batch Evaluations for Sample-Efficient De Novo Molecular Design
arXiv:2512.17659v1 Announce Type: new
Abstract: Designing molecules that must satisfy multiple, often conflicting objectives is a central challenge in molecular discovery. The enormous size of chemical space and the cost of high-fidelity simulations have driven the development of machine learning-guided strategies for accelerating design with limited data. Among these, Bayesian optimization (BO) offers a principled framework for sample-efficient search, while generative models provide a mechanism to propose novel, diverse candidates beyond fixed libraries. However, existing methods that couple the two often rely on continuous latent spaces, which introduces both architectural entanglement and scalability challenges. This work introduces an alternative, modular "generate-then-optimize" framework for de novo multi-objective molecular design/discovery. At each iteration, a generative model is used to construct a large, diverse pool of candidate molecules, after which a novel acquisition function, qPMHI (multi-point Probability of Maximum Hypervolume Improvement), is used to optimally select a batch of candidates most likely to induce the largest Pareto front expansion. The key insight is that qPMHI decomposes additively, enabling exact, scalable batch selection via only simple ranking of probabilities that can be easily estimated with Monte Carlo sampling. We benchmark the framework against state-of-the-art latent-space and discrete molecular optimization methods, demonstrating significant improvements across synthetic benchmarks and application-driven tasks. Specifically, in a case study related to sustainable energy storage, we show that our approach quickly uncovers novel, diverse, and high-performing organic (quinone-based) cathode materials for aqueous redox flow battery applications.
Fonte: arXiv stat.ML
RL • Score 95
Método de Direção Alternada de Multiplicadores para Decomposições de Matrizes Não Lineares
Apresentamos um algoritmo baseado no método de direção alternada de multiplicadores (ADMM) para resolver decomposições de matrizes não lineares (NMD). Dada uma matriz de entrada $X \in \mathbb{R}^{m \times n}$ e um rank de fatoração $r \ll \min(m, n)$, NMD busca matrizes $W \in \mathbb{R}^{m \times r}$ e $H \in \mathbb{R}^{r \times n}$ de forma que $X \approx f(WH)$, onde $f$ é uma função não linear elemento a elemento.
Fonte: arXiv stat.ML
Vision • Score 95
Disentangled representations via score-based variational autoencoders
arXiv:2512.17127v1 Announce Type: new
Abstract: We present the Score-based Autoencoder for Multiscale Inference (SAMI), a method for unsupervised representation learning that combines the theoretical frameworks of diffusion models and VAEs. By unifying their respective evidence lower bounds, SAMI formulates a principled objective that learns representations through score-based guidance of the underlying diffusion process. The resulting representations automatically capture meaningful structure in the data: it recovers ground truth generative factors in our synthetic dataset, learns factorized, semantic latent dimensions from complex natural images, and encodes video sequences into latent trajectories that are straighter than those of alternative encoders, despite training exclusively on static images. Furthermore, SAMI can extract useful representations from pre-trained diffusion models with minimal additional training. Finally, the explicitly probabilistic formulation provides new ways to identify semantically meaningful axes in the absence of supervised labels, and its mathematical exactness allows us to make formal statements about the nature of the learned representation. Overall, these results indicate that implicit structural information in diffusion models can be made explicit and interpretable through synergistic combination with a variational autoencoder.
Fonte: arXiv stat.ML
RL • Score 96
Sharing Knowledge without Sharing Data: Stitches can improve ensembles of disjointly trained models
arXiv:2512.17592v1 Announce Type: new
Abstract: Deep learning has been shown to be very capable at performing many real-world tasks. However, this performance is often dependent on the presence of large and varied datasets. In some settings, like in the medical domain, data is often fragmented across parties, and cannot be readily shared. While federated learning addresses this situation, it is a solution that requires synchronicity of parties training a single model together, exchanging information about model weights. We investigate how asynchronous collaboration, where only already trained models are shared (e.g. as part of a publication), affects performance, and propose to use stitching as a method for combining models.
Through taking a multi-objective perspective, where performance on each parties' data is viewed independently, we find that training solely on a single parties' data results in similar performance when merging with another parties' data, when considering performance on that single parties' data, while performance on other parties' data is notably worse. Moreover, while an ensemble of such individually trained networks generalizes better, performance on each parties' own dataset suffers. We find that combining intermediate representations in individually trained models with a well placed pair of stitching layers allows this performance to recover to a competitive degree while maintaining improved generalization, showing that asynchronous collaboration can yield competitive results.
Fonte: arXiv cs.LG
MLOps/Systems • Score 96
NetworkFF: Unified Layer Optimization in Forward-Only Neural Networks
arXiv:2512.17531v1 Announce Type: new
Abstract: The Forward-Forward algorithm eliminates backpropagation's memory constraints and biological implausibility through dual forward passes with positive and negative data. However, conventional implementations suffer from critical inter-layer isolation, where layers optimize goodness functions independently without leveraging collective learning dynamics. This isolation constrains representational coordination and limits convergence efficiency in deeper architectures. This paper introduces Collaborative Forward-Forward (CFF) learning, extending the original algorithm through inter-layer cooperation mechanisms that preserve forward-only computation while enabling global context integration. Our framework implements two collaborative paradigms: Fixed CFF (F-CFF) with constant inter-layer coupling and Adaptive CFF (A-CFF) with learnable collaboration parameters that evolve during training. The collaborative goodness function incorporates weighted contributions from all layers, enabling coordinated feature learning while maintaining memory efficiency and biological plausibility. Comprehensive evaluation on MNIST and Fashion-MNIST demonstrates significant performance improvements over baseline Forward-Forward implementations. These findings establish inter-layer collaboration as a fundamental enhancement to Forward-Forward learning, with immediate applicability to neuromorphic computing architectures and energy-constrained AI systems.
Fonte: arXiv cs.LG
Evaluation/Benchmarks • Score 93
meval: A Statistical Toolbox for Fine-Grained Model Performance Analysis
arXiv:2512.17409v1 Announce Type: new
Abstract: Analyzing machine learning model performance stratified by patient and recording properties is becoming the accepted norm and often yields crucial insights about important model failure modes. Performing such analyses in a statistically rigorous manner is non-trivial, however. Appropriate performance metrics must be selected that allow for valid comparisons between groups of different sample sizes and base rates; metric uncertainty must be determined and multiple comparisons be corrected for, in order to assess whether any observed differences may be purely due to chance; and in the case of intersectional analyses, mechanisms must be implemented to find the most `interesting' subgroups within combinatorially many subgroup combinations. We here present a statistical toolbox that addresses these challenges and enables practitioners to easily yet rigorously assess their models for potential subgroup performance disparities. While broadly applicable, the toolbox is specifically designed for medical imaging applications. The analyses provided by the toolbox are illustrated in two case studies, one in skin lesion malignancy classification on the ISIC2020 dataset and one in chest X-ray-based disease classification on the MIMIC-CXR dataset.
Fonte: arXiv cs.LG
RL • Score 96
Learning Safe Autonomous Driving Policies Using Predictive Safety Representations
arXiv:2512.17586v1 Announce Type: new
Abstract: Safe reinforcement learning (SafeRL) is a prominent paradigm for autonomous driving, where agents are required to optimize performance under strict safety requirements. This dual objective creates a fundamental tension, as overly conservative policies limit driving efficiency while aggressive exploration risks safety violations. The Safety Representations for Safer Policy Learning (SRPL) framework addresses this challenge by equipping agents with a predictive model of future constraint violations and has shown promise in controlled environments. This paper investigates whether SRPL extends to real-world autonomous driving scenarios. Systematic experiments on the Waymo Open Motion Dataset (WOMD) and NuPlan demonstrate that SRPL can improve the reward-safety tradeoff, achieving statistically significant improvements in success rate (effect sizes r = 0.65-0.86) and cost reduction (effect sizes r = 0.70-0.83), with p < 0.05 for observed improvements. However, its effectiveness depends on the underlying policy optimizer and the dataset distribution. The results further show that predictive safety representations play a critical role in improving robustness to observation noise. Additionally, in zero-shot cross-dataset evaluation, SRPL-augmented agents demonstrate improved generalization compared to non-SRPL methods. These findings collectively demonstrate the potential of predictive safety representations to strengthen SafeRL for autonomous driving.
Fonte: arXiv cs.LG
RL • Score 96
SafeBench-Seq: A Homology-Clustered, CPU-Only Baseline for Protein Hazard Screening with Physicochemical/Composition Features and Cluster-Aware Confidence Intervals
arXiv:2512.17527v1 Announce Type: new
Abstract: Foundation models for protein design raise concrete biosecurity risks, yet the community lacks a simple, reproducible baseline for sequence-level hazard screening that is explicitly evaluated under homology control and runs on commodity CPUs. We introduce SafeBench-Seq, a metadata-only, reproducible benchmark and baseline classifier built entirely from public data (SafeProtein hazards and UniProt benigns) and interpretable features (global physicochemical descriptors and amino-acid composition). To approximate "never-before-seen" threats, we homology-cluster the combined dataset at <=40% identity and perform cluster-level holdouts (no cluster overlap between train/test). We report discrimination (AUROC/AUPRC) and screening-operating points (TPR@1% FPR; FPR@95% TPR) with 95% bootstrap confidence intervals (n=200), and we provide calibrated probabilities via CalibratedClassifierCV (isotonic for Logistic Regression / Random Forest; Platt sigmoid for Linear SVM). We quantify probability quality using Brier score, Expected Calibration Error (ECE; 15 bins), and reliability diagrams. Shortcut susceptibility is probed via composition-preserving residue shuffles and length-/composition-only ablations. Empirically, random splits substantially overestimate robustness relative to homology-clustered evaluation; calibrated linear models exhibit comparatively good calibration, while tree ensembles retain slightly higher Brier/ECE. SafeBench-Seq is CPU-only, reproducible, and releases metadata only (accessions, cluster IDs, split labels), enabling rigorous evaluation without distributing hazardous sequences.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
Deep Learning-Based Surrogate Creep Modelling in Inconel 625: A High-Temperature Alloy Study
arXiv:2512.17477v1 Announce Type: new
Abstract: Time-dependent deformation, particularly creep, in high-temperature alloys such as Inconel 625 is a key factor in the long-term reliability of components used in aerospace and energy systems. Although Inconel 625 shows excellent creep resistance, finite-element creep simulations in tools such as ANSYS remain computationally expensive, often requiring tens of minutes for a single 10,000-hour run. This work proposes deep learning based surrogate models to provide fast and accurate replacements for such simulations. Creep strain data was generated in ANSYS using the Norton law under uniaxial stresses of 50 to 150 MPa and temperatures of 700 to 1000 $^\circ$C, and this temporal dataset was used to train two architectures: a BiLSTM Variational Autoencoder for uncertainty-aware and generative predictions, and a BiLSTM Transformer hybrid that employs self-attention to capture long-range temporal behavior. Both models act as surrogate predictors, with the BiLSTM-VAE offering probabilistic output and the BiLSTM-Transformer delivering high deterministic accuracy. Performance is evaluated using RMSE, MAE, and $R^2$. Results show that the BiLSTM-VAE provides stable and reliable creep strain forecasts, while the BiLSTM-Transformer achieves strong accuracy across the full time range. Latency tests indicate substantial speedup: while each ANSYS simulation requires 30 to 40 minutes for a given stress-temperature condition, the surrogate models produce predictions within seconds. The proposed framework enables rapid creep assessment for design optimization and structural health monitoring, and provides a scalable solution for high-temperature alloy applications.
Fonte: arXiv cs.LG
Evaluation/Benchmarks • Score 89
Enhancing Long Document Long Form Summarisation with Self-Planning
arXiv:2512.17179v1 Announce Type: new
Abstract: We introduce a novel approach for long context summarisation, highlight-guided generation, that leverages sentence-level information as a content plan to improve the traceability and faithfulness of generated summaries. Our framework applies self-planning methods to identify important content and then generates a summary conditioned on the plan. We explore both an end-to-end and two-stage variants of the approach, finding that the two-stage pipeline performs better on long and information-dense documents. Experiments on long-form summarisation datasets demonstrate that our method consistently improves factual consistency while preserving relevance and overall quality. On GovReport, our best approach has improved ROUGE-L by 4.1 points and achieves about 35% gains in SummaC scores. Qualitative analysis shows that highlight-guided summarisation helps preserve important details, leading to more accurate and insightful summaries across domains.
Fonte: arXiv cs.CL
NLP/LLMs • Score 93
A lightweight Spatial-Temporal Graph Neural Network for Long-term Time Series Forecasting
arXiv:2512.17453v1 Announce Type: new
Abstract: We propose Lite-STGNN, a lightweight spatial-temporal graph neural network for long-term multivariate forecasting that integrates decomposition-based temporal modeling with learnable sparse graph structure. The temporal module applies trend-seasonal decomposition, while the spatial module performs message passing with low-rank Top-$K$ adjacency learning and conservative horizon-wise gating, enabling spatial corrections that enhance a strong linear baseline. Lite-STGNN achieves state-of-the-art accuracy on four benchmark datasets for horizons up to 720 steps, while being parameter-efficient and substantially faster to train than transformer-based methods. Ablation studies show that the spatial module yields 4.6% improvement over the temporal baseline, Top-$K$ enhances locality by 3.3%, and learned adjacency matrices reveal domain-specific interaction dynamics. Lite-STGNN thus offers a compact, interpretable, and efficient framework for long-term multivariate time series forecasting.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Data Augmentation Supporting a Conversational Agent Designed for Smoking Cessation Support Groups
arXiv:2512.17092v1 Announce Type: new
Abstract: Online support groups for smoking cessation are economical and accessible, yet they often face challenges with low user engagement and stigma. The use of an automatic conversational agent would improve engagement by ensuring that all user comments receive a timely response.). We address the challenge of insufficient high-quality data by employing a two-level data augmentation strategy: synthetic data augmentation and real data augmentation. First, we fine-tuned an open source LLM to classify posts from our existing smoking cessation support groups and identify intents with low F1 (precision+recall) scores. Then, for these intents, we generate additional synthetic data using prompt engineering with the GPT model, with an average of 87\% of the generated synthetic posts deemed high quality by human annotators. Overall, the synthetic augmentation process resulted in 43\% of the original posts being selected for augmentation, followed by 140\% synthetic expansion of these posts. Additionally, we scraped more than 10,000 real posts from a related online support context, of which 73\% were validated as good quality by human annotators. Each synthetic or scraped post underwent rigorous validation involving human reviewers to ensure quality and relevance. The validated new data, combined with the original support group posts, formed an augmented dataset used to retrain the intent classifier. Performance evaluation of the retrained model demonstrated a 32\% improvement in F1, confirming the effectiveness of our data augmentation approach. Synthetic and real post augmentation led to similar performance improvements. This study provides a replicable framework for enhancing conversational agent performance in domains where data scarcity is a critical issue.
Fonte: arXiv cs.CL
Evaluation/Benchmarks • Score 90
Bayesian Optimisation: Which Constraints Matter?
arXiv:2512.17569v1 Announce Type: new
Abstract: Bayesian optimisation has proven to be a powerful tool for expensive global black-box optimisation problems. In this paper, we propose new Bayesian optimisation variants of the popular Knowledge Gradient acquisition functions for problems with \emph{decoupled} black-box constraints, in which subsets of the objective and constraint functions may be evaluated independently. In particular, our methods aim to take into account that often only a handful of the constraints may be binding at the optimum, and hence we should evaluate only relevant constraints when trying to optimise a function. We empirically benchmark these methods against existing methods and demonstrate their superiority over the state-of-the-art.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
Adversarially Robust Detection of Harmful Online Content: A Computational Design Science Approach
arXiv:2512.17367v1 Announce Type: new
Abstract: Social media platforms are plagued by harmful content such as hate speech, misinformation, and extremist rhetoric. Machine learning (ML) models are widely adopted to detect such content; however, they remain highly vulnerable to adversarial attacks, wherein malicious users subtly modify text to evade detection. Enhancing adversarial robustness is therefore essential, requiring detectors that can defend against diverse attacks (generalizability) while maintaining high overall accuracy. However, simultaneously achieving both optimal generalizability and accuracy is challenging. Following the computational design science paradigm, this study takes a sequential approach that first proposes a novel framework (Large Language Model-based Sample Generation and Aggregation, LLM-SGA) by identifying the key invariances of textual adversarial attacks and leveraging them to ensure that a detector instantiated within the framework has strong generalizability. Second, we instantiate our detector (Adversarially Robust Harmful Online Content Detector, ARHOCD) with three novel design components to improve detection accuracy: (1) an ensemble of multiple base detectors that exploits their complementary strengths; (2) a novel weight assignment method that dynamically adjusts weights based on each sample's predictability and each base detector's capability, with weights initialized using domain knowledge and updated via Bayesian inference; and (3) a novel adversarial training strategy that iteratively optimizes both the base detectors and the weight assignor. We addressed several limitations of existing adversarial robustness enhancement research and empirically evaluated ARHOCD across three datasets spanning hate speech, rumor, and extremist content. Results show that ARHOCD offers strong generalizability and improves detection accuracy under adversarial conditions.
Fonte: arXiv cs.LG
RL • Score 96
DiffeoMorph: Aprendendo a Morfose de Formas 3D Usando Simulações Baseadas em Agentes Diferenciáveis
Sistemas biológicos podem formar estruturas tridimensionais complexas através do comportamento coletivo de agentes idênticos. Neste trabalho, apresentamos o DiffeoMorph, um framework diferenciável para aprender um protocolo de morfogênese que orienta uma população de agentes a se transformar em uma forma 3D alvo, utilizando uma rede neural gráfica SE(3)-equivariant baseada em atenção.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
Turn-PPO: Estimativa de Vantagem em Nível de Turno com PPO para Melhoria do RL Multi-Turno em LLMs Agentes
O aprendizado por reforço (RL) ressurgiu como uma abordagem natural para treinar agentes LLM interativos em ambientes do mundo real. No entanto, a aplicação direta do algoritmo Group Relative Policy Optimization (GRPO) em tarefas multi-turno revela limitações significativas. Para superar esses desafios, investigamos estratégias de estimativa de vantagem mais estáveis e eficazes, introduzindo o turn-PPO como uma variante que opera em uma formulação de MDP em nível de turno.
Fonte: arXiv cs.LG
Vision • Score 96
Machine Learning Leve e Informado por Física para Previsão de Visibilidade na Aviação em Diversos Regimes Climáticos
A previsão de curto prazo (nowcasting) de eventos de baixa visibilidade e precipitação é crucial para a segurança da aviação e eficiência operacional. Este estudo apresenta um framework leve de gradient boosting (XGBoost) treinado exclusivamente com dados de observação de superfície (METAR) e aprimorado por engenharia de características guiada por princípios termodinâmicos.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
QSMOTE-PGM/kPGM: Classificação de Conjuntos de Dados Desbalanceados Baseada em QSMOTE e kPGM
O aprendizado de máquina inspirado em quantum (QiML) utiliza estruturas matemáticas da teoria quântica para aprimorar algoritmos clássicos, com foco nas estruturas de produto interno em espaços de características de alta dimensão. Este trabalho apresenta uma comparação teórica e empírica unificada de classificadores baseados em PGM e KPGM, analisando seu desempenho em cenários de oversampling sintético usando variantes do Quantum SMOTE (QSMOTE).
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
Adversarial VR: Um Testbed Open-Source para Avaliação da Robustez Adversarial na Detecção e Mitigação de Ciberdoença em VR
Métodos automatizados de detecção de ciberdoença baseados em deep learning (DL) podem melhorar o conforto e a interação do usuário. No entanto, esses sistemas são suscetíveis a ataques adversariais, que podem degradar o desempenho do modelo e interromper a experiência imersiva. Este artigo apresenta o Adversarial-VR, um testbed em tempo real para avaliar estratégias de detecção e mitigação de ciberdoença sob condições adversariais.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Um Referencial de Saúde da Mulher para Modelos de Linguagem de Grande Escala
À medida que os modelos de linguagem de grande escala (LLMs) se tornam fontes primárias de informação em saúde para milhões, sua precisão em saúde da mulher permanece criticamente inexplorada. Apresentamos o Women's Health Benchmark (WHB), o primeiro referencial que avalia o desempenho dos LLMs especificamente em saúde da mulher.
Fonte: arXiv cs.AI
RL • Score 96
Conjunto de Dados Sintético que Preserva a Privacidade de Trajetórias Diárias Individuais para Análises de Mobilidade em Escala Urbana
Os dados de mobilidade urbana são essenciais para o planejamento urbano, previsão de demanda de transporte e modelagem de pandemias. Este estudo apresenta um conjunto de dados sintético que preserva a privacidade, reconstruindo trajetórias diárias a partir de entradas agregadas, sem a necessidade de identificadores pessoais.
Fonte: arXiv cs.AI
Vision • Score 96
Bots Não Ficam Parados: Um Estudo Longitudinal sobre Mudança de Comportamento de Bots, Deriva Temporal e Evolução da Estrutura de Recursos
Os bots sociais estão profundamente integrados nas plataformas online para promoção, persuasão e manipulação. A maioria dos sistemas de detecção de bots ainda trata características comportamentais como estáticas, assumindo implicitamente que os bots se comportam de maneira estacionária ao longo do tempo. Este estudo analisa a mudança em sinais comportamentais individuais e suas inter-relações, utilizando 2.615 contas de bots promocionais e 2,8 milhões de tweets.
Fonte: arXiv cs.AI
Vision • Score 96
Outro Fit Cai: Predição Conformal como um Padrão de Calibração para Machine Learning em Física de Altas Energias
As técnicas de machine learning são essenciais na pesquisa moderna de colisores, mas suas saídas probabilísticas frequentemente carecem de estimativas de incerteza calibradas. A predição conformal (CP) oferece um framework simples e livre de distribuição para calibrar modelos preditivos arbitrários, permitindo uma quantificação rigorosa da incerteza. Neste trabalho, investigamos a CP como uma camada de calibração unificadora para aplicações de machine learning em física de altas energias.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
FORMSpoT: A Decade of Tree-Level, Country-Scale Forest Monitoring
arXiv:2512.17021v1 Announce Type: new
Abstract: The recent decline of the European forest carbon sink highlights the need for spatially explicit and frequently updated forest monitoring tools. Yet, existing satellite-based disturbance products remain too coarse to detect changes at the scale of individual trees, typically below 100 m$^{2}$. Here, we introduce FORMSpoT (Forest Mapping with SPOT Time series), a decade-long (2014-2024) nationwide mapping of forest canopy height at 1.5 m resolution, together with annual disturbance polygons (FORMSpoT-$\Delta$) covering mainland France. Canopy heights were derived from annual SPOT-6/7 composites using a hierarchical transformer model (PVTv2) trained on high-resolution airborne laser scanning (ALS) data. To enable robust change detection across heterogeneous acquisitions, we developed a dedicated post-processing pipeline combining co-registration and spatio-temporal total variation denoising. Validation against ALS revisits across 19 sites and 5,087 National Forest Inventory plots shows that FORMSpoT-$\Delta$ substantially outperforms existing disturbance products. In mountainous forests, where disturbances are small and spatially fragmented, FORMSpoT-$\Delta$ achieves an F1-score of 0.44, representing an order of magnitude higher than existing benchmarks. By enabling tree-level monitoring of forest dynamics at national scale, FORMSpoT-$\Delta$ provides a unique tool to analyze management practices, detect early signals of forest decline, and better quantify carbon losses from subtle disturbances such as thinning or selective logging. These results underscore the critical importance of sustaining very high-resolution satellite missions like SPOT and open-data initiatives such as DINAMIS for monitoring forests under climate change.
Fonte: arXiv cs.CV
NLP/LLMs • Score 96
MemoryGraft: Comprometimento Persistente de Agentes LLM via Recuperação de Experiência Envenenada
Os agentes de Modelos de Linguagem Grande (LLM) dependem cada vez mais da memória de longo prazo e da Geração Aumentada por Recuperação (RAG) para persistir experiências e refinar o desempenho futuro. Este trabalho apresenta o MemoryGraft, um ataque de injeção indireta que compromete o comportamento do agente ao implantar experiências maliciosas em sua memória de longo prazo.
Fonte: arXiv cs.AI
RL • Score 96
Adaptive Graph Pruning with Sudden-Events Evaluation for Traffic Prediction using Online Semi-Decentralized ST-GNNs
arXiv:2512.17352v1 Announce Type: new
Abstract: Spatio-Temporal Graph Neural Networks (ST-GNNs) are well-suited for processing high-frequency data streams from geographically distributed sensors in smart mobility systems. However, their deployment at the edge across distributed compute nodes (cloudlets) createssubstantial communication overhead due to repeated transmission of overlapping node features between neighbouring cloudlets. To address this, we propose an adaptive pruning algorithm that dynamically filters redundant neighbour features while preserving the most informative spatial context for prediction. The algorithm adjusts pruning rates based on recent model performance, allowing each cloudlet to focus on regions experiencing traffic changes without compromising accuracy. Additionally, we introduce the Sudden Event Prediction Accuracy (SEPA), a novel event-focused metric designed to measure responsiveness to traffic slowdowns and recoveries, which are often missed by standard error metrics. We evaluate our approach in an online semi-decentralized setting with traditional FL, server-free FL, and Gossip Learning on two large-scale traffic datasets, PeMS-BAY and PeMSD7-M, across short-, mid-, and long-term prediction horizons. Experiments show that, in contrast to standard metrics, SEPA exposes the true value of spatial connectivity in predicting dynamic and irregular traffic. Our adaptive pruning algorithm maintains prediction accuracy while significantly lowering communication cost in all online semi-decentralized settings, demonstrating that communication can be reduced without compromising responsiveness to critical traffic events.
Fonte: arXiv cs.LG
MLOps/Systems • Score 96
UmniBench: Modelo Unificado de Compreensão e Geração Orientado a Benchmark Omnidimensional
O UmniBench é um benchmark projetado para Modelos Unificados Multimodais (UMMs), permitindo uma avaliação omnidimensional. Ele avalia a capacidade de compreensão, geração e edição em um único processo, utilizando prompts e pares de QA examinados por humanos. O benchmark abrange 13 domínios principais e mais de 200 conceitos, oferecendo uma visão abrangente e objetiva dos modelos unificados.
Fonte: arXiv cs.AI
Evaluation/Benchmarks • Score 93
Otimização da Busca de Texto: Um Novo Algoritmo de Correspondência de Padrões Baseado na Abordagem de Ukkonen
No campo da ciência da computação, a eficiência dos algoritmos de busca de texto é crucial para processar grandes volumes de dados em áreas como processamento de linguagem natural e bioinformática. Este estudo investiga algoritmos de busca de texto, focando na otimização de Suffix Trees através de métodos como Splitting e o Algoritmo de Ukkonen, demonstrando eficiências em tempo e espaço linear.
Fonte: arXiv cs.AI
Vision • Score 95
SynergyWarpNet: Attention-Guided Cooperative Warping for Neural Portrait Animation
arXiv:2512.17331v1 Announce Type: new
Abstract: Recent advances in neural portrait animation have demonstrated remarked potential for applications in virtual avatars, telepresence, and digital content creation. However, traditional explicit warping approaches often struggle with accurate motion transfer or recovering missing regions, while recent attention-based warping methods, though effective, frequently suffer from high complexity and weak geometric grounding. To address these issues, we propose SynergyWarpNet, an attention-guided cooperative warping framework designed for high-fidelity talking head synthesis. Given a source portrait, a driving image, and a set of reference images, our model progressively refines the animation in three stages. First, an explicit warping module performs coarse spatial alignment between the source and driving image using 3D dense optical flow. Next, a reference-augmented correction module leverages cross-attention across 3D keypoints and texture features from multiple reference images to semantically complete occluded or distorted regions. Finally, a confidence-guided fusion module integrates the warped outputs with spatially-adaptive fusing, using a learned confidence map to balance structural alignment and visual consistency. Comprehensive evaluations on benchmark datasets demonstrate state-of-the-art performance.
Fonte: arXiv cs.CV
RL • Score 96
Sobre Tempo: Aprendizado por Reforço Sem Modelo com Máquinas de Recompensa Temporizadas
A especificação de recompensas desempenha um papel central no aprendizado por reforço (RL), orientando o comportamento do agente. Neste artigo, propomos máquinas de recompensa temporizadas (TRMs), que estendem as máquinas de recompensa tradicionais ao incorporar restrições de tempo na estrutura de recompensa, permitindo especificações mais expressivas e ajustáveis.
Fonte: arXiv cs.AI
RL • Score 96
Alzheimer's Disease Brain Network Mining
arXiv:2512.17276v1 Announce Type: new
Abstract: Machine learning approaches for Alzheimer's disease (AD) diagnosis face a fundamental challenges. Clinical assessments are expensive and invasive, leaving ground truth labels available for only a fraction of neuroimaging datasets. We introduce Multi view Adaptive Transport Clustering for Heterogeneous Alzheimer's Disease (MATCH-AD), a semi supervised framework that integrates deep representation learning, graph-based label propagation, and optimal transport theory to address this limitation. The framework leverages manifold structure in neuroimaging data to propagate diagnostic information from limited labeled samples to larger unlabeled populations, while using Wasserstein distances to quantify disease progression between cognitive states. Evaluated on nearly five thousand subjects from the National Alzheimer's Coordinating Center, encompassing structural MRI measurements from hundreds of brain regions, cerebrospinal fluid biomarkers, and clinical variables MATCHAD achieves near-perfect diagnostic accuracy despite ground truth labels for less than one-third of subjects. The framework substantially outperforms all baseline methods, achieving kappa indicating almost perfect agreement compared to weak agreement for the best baseline, a qualitative transformation in diagnostic reliability. Performance remains clinically useful even under severe label scarcity, and we provide theoretical convergence guarantees with proven bounds on label propagation error and transport stability. These results demonstrate that principled semi-supervised learning can unlock the diagnostic potential of the vast repositories of partially annotated neuroimaging data accumulating worldwide, substantially reducing annotation burden while maintaining accuracy suitable for clinical deployment.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
ScoutGPT: Capturando o Impacto do Jogador a Partir de Sequências de Ação da Equipe Usando um Framework Baseado em GPT
Transferências desempenham um papel fundamental no sucesso de um clube de futebol, mas prever se uma transferência terá sucesso continua sendo difícil devido à forte dependência do contexto no desempenho em campo. Para abordar essa lacuna, introduzimos o EventGPT, um modelo de previsão de próximo evento condicionado ao jogador e ciente do valor, construído sobre um transformer autoregressivo estilo GPT.
Fonte: arXiv cs.AI
RL • Score 96
Electric Vehicle Charging Load Forecasting: An Experimental Comparison of Machine Learning Methods
arXiv:2512.17257v1 Announce Type: new
Abstract: With the growing popularity of electric vehicles as a means of addressing climate change, concerns have emerged regarding their impact on electric grid management. As a result, predicting EV charging demand has become a timely and important research problem. While substantial research has addressed energy load forecasting in transportation, relatively few studies systematically compare multiple forecasting methods across different temporal horizons and spatial aggregation levels in diverse urban settings. This work investigates the effectiveness of five time series forecasting models, ranging from traditional statistical approaches to machine learning and deep learning methods. Forecasting performance is evaluated for short-, mid-, and long-term horizons (on the order of minutes, hours, and days, respectively), and across spatial scales ranging from individual charging stations to regional and city-level aggregations. The analysis is conducted on four publicly available real-world datasets, with results reported independently for each dataset. To the best of our knowledge, this is the first work to systematically evaluate EV charging demand forecasting across such a wide range of temporal horizons and spatial aggregation levels using multiple real-world datasets.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
Um Framework Solver-in-the-Loop para Melhorar LLMs em Programação de Conjuntos de Respostas para Resolução de Quebra-Cabeças Lógicos
O surgimento de grandes modelos de linguagem (LLMs) despertou interesse em assistentes de codificação. Este artigo apresenta uma abordagem inovadora de ASP-solver-in-the-loop para o ajuste de instruções guiadas por solucionadores, focando na geração de código para Programação de Conjuntos de Respostas (ASP), visando resolver problemas complexos de busca combinatória.
Fonte: arXiv cs.AI
MLOps/Systems • Score 96
MINPO: Memory-Informed Neural Pseudo-Operator to Resolve Nonlocal Spatiotemporal Dynamics
arXiv:2512.17273v1 Announce Type: new
Abstract: Many physical systems exhibit nonlocal spatiotemporal behaviors described by integro-differential equations (IDEs). Classical methods for solving IDEs require repeatedly evaluating convolution integrals, whose cost increases quickly with kernel complexity and dimensionality. Existing neural solvers can accelerate selected instances of these computations, yet they do not generalize across diverse nonlocal structures. In this work, we introduce the Memory-Informed Neural Pseudo-Operator (MINPO), a unified framework for modeling nonlocal dynamics arising from long-range spatial interactions and/or long-term temporal memory. MINPO, employing either Kolmogorov-Arnold Networks (KANs) or multilayer perceptron networks (MLPs) as encoders, learns the nonlocal operator and its inverse directly through neural representations, and then explicitly reconstruct the unknown solution fields. The learning is guarded by a lightweight nonlocal consistency loss term to enforce coherence between the learned operator and reconstructed solution. The MINPO formulation allows to naturally capture and efficiently resolve nonlocal spatiotemporal dependencies governed by a wide spectrum of IDEs and their subsets, including fractional PDEs. We evaluate the efficacy of MINPO in comparison with classical techniques and state-of-the-art neural-based strategies based on MLPs, such as A-PINN and fPINN, along with their newly-developed KAN variants, A-PIKAN and fPIKAN, designed to facilitate a fair comparison. Our study offers compelling evidence of the accuracy of MINPO and demonstrates its robustness in handling (i) diverse kernel types, (ii) different kernel dimensionalities, and (iii) the substantial computational demands arising from repeated evaluations of kernel integrals. MINPO, thus, generalizes beyond problem-specific formulations, providing a unified framework for systems governed by nonlocal operators.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
The Effect of Negation on CLIP in Medical Imaging: Limitations of Contrastive Language-Image Pretraining
arXiv:2512.17121v1 Announce Type: new
Abstract: Large vision-language models like CLIP are increasingly used in medical imaging tasks due to their ability to align images and text without the need for extensive labeled data. This makes them particularly useful for applications like image retrieval, report generation, and classification in clinical settings. A potential issue to this approach is that CLIP-based models often under perform when interpreting negated phrases, which is especially problematic in the context of medical diagnosing. In this study, we evaluate the Stanford AIMI CheXagent model on its ability to correctly retrieve chest X-ray images using prompts with and without negation. The goal of this project is to understand where this model fails and then use it as a base model to improve its retrieval accuracy by fine tuning methods outlined in previous work. Results from this study show improvement in handling of negation in the CLIP model with a slight decrease in accuracy of positive prompt evaluation. Alongside retrieval accuracy, we examined internal model behavior through token attribution, t-SNE projection, and attention-head ablation to better characterize how each fine tuning approach reshaped the text encoders representation of negated clinical language. Through this work, we hope to better understand the internal behavior of CLIP and improve its handling of negation using clinically relevant language for improving its reliability in medical AI devices.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
MMRAG-RFT: Ajuste Fino de Reforço em Duas Etapas para Geração Aumentada por Recuperação Multi-modal Explicável
A Geração Aumentada por Recuperação Multi-modal (MMRAG) permite uma geração altamente confiável ao integrar conhecimento externo multi-modal, demonstrando desempenho impressionante em cenários complexos. No entanto, métodos existentes falham em esclarecer a lógica de raciocínio por trás da recuperação e geração de respostas. Propomos a introdução de aprendizado por reforço para aprimorar as capacidades de raciocínio de modelos de linguagem multi-modal.
Fonte: arXiv cs.AI
Vision • Score 95
Towards Pixel-Wise Anomaly Location for High-Resolution PCBA \\ via Self-Supervised Image Reconstruction
arXiv:2512.17296v1 Announce Type: new
Abstract: Automated defect inspection of assembled Printed Circuit Board Assemblies (PCBA) is quite challenging due to the insufficient labeled data, micro-defects with just a few pixels in visually-complex and high-resolution images. To address these challenges, we present HiSIR-Net, a High resolution, Self-supervised Reconstruction framework for pixel-wise PCBA localization. Our design combines two lightweight modules that make this practical on real 4K-resolution boards: (i) a Selective Input-Reconstruction Gate (SIR-Gate) that lets the model decide where to trust reconstruction versus the original input, thereby reducing irrelevant reconstruction artifacts and false alarms; and (ii) a Region-level Optimized Patch Selection (ROPS) scheme with positional cues to select overlapping patch reconstructions coherently across arbitrary resolutions. Organically integrating these mechanisms yields clean, high-resolution anomaly maps with low false positive (FP) rate. To bridge the gap in high-resolution PCBA datasets, we further contribute a self-collected dataset named SIPCBA-500 of 500 images. We conduct extensive experiments on our SIPCBA-500 as well as public benchmarks, demonstrating the superior localization performance of our method while running at practical speed. Full code and dataset will be made available upon acceptance.
Fonte: arXiv cs.CV
MLOps/Systems • Score 93
Digitizing Nepal's Written Heritage: A Comprehensive HTR Pipeline for Old Nepali Manuscripts
arXiv:2512.17111v1 Announce Type: new
Abstract: This paper presents the first end-to-end pipeline for Handwritten Text Recognition (HTR) for Old Nepali, a historically significant but low-resource language. We adopt a line-level transcription approach and systematically explore encoder-decoder architectures and data-centric techniques to improve recognition accuracy. Our best model achieves a Character Error Rate (CER) of 4.9\%. In addition, we implement and evaluate decoding strategies and analyze token-level confusions to better understand model behaviour and error patterns. While the dataset we used for evaluation is confidential, we release our training code, model configurations, and evaluation scripts to support further research in HTR for low-resource historical scripts.
Fonte: arXiv cs.LG
Vision • Score 95
Homografia Infinita como Condicionamento Robusto para Geração de Vídeo Controlada por Câmera
O progresso recente em modelos de difusão de vídeo gerou interesse crescente na geração de vídeo de nova perspectiva controlada por câmera para cenas dinâmicas. Um desafio chave é garantir a fidelidade à pose da câmera especificada, mantendo a consistência de visualização e raciocinando sobre a geometria ocluída a partir de observações limitadas. Apresentamos o InfCam, um framework de geração de vídeo para vídeo controlado por câmera com alta fidelidade de pose.
Fonte: arXiv cs.CV
NLP/LLMs • Score 96
Dynamic Tool Dependency Retrieval for Efficient Function Calling
arXiv:2512.17052v1 Announce Type: new
Abstract: Function calling agents powered by Large Language Models (LLMs) select external tools to automate complex tasks. On-device agents typically use a retrieval module to select relevant tools, improving performance and reducing context length. However, existing retrieval methods rely on static and limited inputs, failing to capture multi-step tool dependencies and evolving task context. This limitation often introduces irrelevant tools that mislead the agent, degrading efficiency and accuracy. We propose Dynamic Tool Dependency Retrieval (DTDR), a lightweight retrieval method that conditions on both the initial query and the evolving execution context. DTDR models tool dependencies from function calling demonstrations, enabling adaptive retrieval as plans unfold. We benchmark DTDR against state-of-the-art retrieval methods across multiple datasets and LLM backbones, evaluating retrieval precision, downstream task accuracy, and computational efficiency. Additionally, we explore strategies to integrate retrieved tools into prompts. Our results show that dynamic tool retrieval improves function calling success rates between $23\%$ and $104\%$ compared to state-of-the-art static retrievers.
Fonte: arXiv cs.LG
MLOps/Systems • Score 95
Regularized Random Fourier Features and Finite Element Reconstruction for Operator Learning in Sobolev Space
arXiv:2512.17884v1 Announce Type: cross
Abstract: Operator learning is a data-driven approximation of mappings between infinite-dimensional function spaces, such as the solution operators of partial differential equations. Kernel-based operator learning can offer accurate, theoretically justified approximations that require less training than standard methods. However, they can become computationally prohibitive for large training sets and can be sensitive to noise. We propose a regularized random Fourier feature (RRFF) approach, coupled with a finite element reconstruction map (RRFF-FEM), for learning operators from noisy data. The method uses random features drawn from multivariate Student's $t$ distributions, together with frequency-weighted Tikhonov regularization that suppresses high-frequency noise. We establish high-probability bounds on the extreme singular values of the associated random feature matrix and show that when the number of features $N$ scales like $m \log m$ with the number of training samples $m$, the system is well-conditioned, which yields estimation and generalization guarantees. Detailed numerical experiments on benchmark PDE problems, including advection, Burgers', Darcy flow, Helmholtz, Navier-Stokes, and structural mechanics, demonstrate that RRFF and RRFF-FEM are robust to noise and achieve improved performance with reduced training time compared to the unregularized random feature model, while maintaining competitive accuracy relative to kernel and neural operator tests.
Fonte: arXiv stat.ML
Vision • Score 95
Any-Optical-Model: Um Modelo de Fundação Universal para Sensoriamento Remoto Óptico
Os satélites ópticos, com suas diversas configurações de bandas e distâncias de amostragem no solo, fornecem evidências indispensáveis para tarefas que vão desde a vigilância de ecossistemas até a resposta a emergências. No entanto, discrepâncias significativas na composição de bandas e na resolução espacial entre diferentes sensores ópticos apresentam grandes desafios para os Modelos de Fundação de Sensoriamento Remoto (RSFMs) existentes.
Fonte: arXiv cs.CV
NLP/LLMs • Score 96
Classificação de Hipóteses Inspirada em Solomonoff com LLMs para Previsão sob Incerteza
O raciocínio sob incerteza é um desafio fundamental em IA, especialmente em tarefas do mundo real, onde problemas com dados escassos exigem generalização sistemática. Propomos um método inspirado em Solomonoff que pondera hipóteses geradas por LLM com base na simplicidade e no ajuste preditivo, produzindo previsões conservadoras e conscientes da incerteza.
Fonte: arXiv cs.AI
Vision • Score 95
ABE-CLIP: Training-Free Attribute Binding Enhancement for Compositional Image-Text Matching
arXiv:2512.17178v1 Announce Type: new
Abstract: Contrastive Language-Image Pretraining (CLIP) has achieved remarkable performance in various multimodal tasks. However, it still struggles with compositional image-text matching, particularly in accurately associating objects with their corresponding attributes, because its inherent global representation often overlooks fine-grained semantics for attribute binding. Existing methods often require additional training or extensive hard negative sampling, yet they frequently show limited generalization to novel compositional concepts and fail to fundamentally address the drawbacks of global representations. In this paper, we propose ABE-CLIP, a novel training-free Attribute Binding Enhancement method designed to strengthen attribute-object binding in CLIP-like models. Specifically, we employ a Semantic Refinement Mechanism to refine token embeddings for both object and attribute phrases in the text, thereby mitigating attribute confusion and improving semantic precision. We further introduce a Local Token-Patch Alignment strategy that computes similarity scores between refined textual tokens and their most relevant image patches. By aggregating localized similarity scores, ABE-CLIP computes the final image-text similarity. Experiments on multiple datasets demonstrate that ABE-CLIP significantly improves attribute-object binding performance, even surpassing methods that require extensive training.
Fonte: arXiv cs.CV
Vision • Score 95
AnyCXR: Segmentação da Anatomia Humana em Radiografias de Tórax em Qualquer Posição de Aquisição Usando Dados Sintéticos Randomizados de Domínio em Múltiplas Etapas com Anotações Imperfeitas e Aprendizado de Regularização de Anotação Conjunta Condicional
A segmentação anatômica robusta de radiografias de tórax (CXRs) continua desafiadora devido à escassez de anotações abrangentes e à variabilidade substancial das condições de aquisição no mundo real. Propomos o AnyCXR, um framework unificado que permite a segmentação multi-orgânica generalizável em ângulos de projeção arbitrários de CXR usando apenas supervisão sintética.
Fonte: arXiv cs.CV
Vision • Score 92
MatLat: Material Latent Space for PBR Texture Generation
arXiv:2512.17302v1 Announce Type: new
Abstract: We propose a generative framework for producing high-quality PBR textures on a given 3D mesh. As large-scale PBR texture datasets are scarce, our approach focuses on effectively leveraging the embedding space and diffusion priors of pretrained latent image generative models while learning a material latent space, MatLat, through targeted fine-tuning. Unlike prior methods that freeze the embedding network and thus lead to distribution shifts when encoding additional PBR channels and hinder subsequent diffusion training, we fine-tune the pretrained VAE so that new material channels can be incorporated with minimal latent distribution deviation. We further show that correspondence-aware attention alone is insufficient for cross-view consistency unless the latent-to-image mapping preserves locality. To enforce this locality, we introduce a regularization in the VAE fine-tuning that crops latent patches, decodes them, and aligns the corresponding image regions to maintain strong pixel-latent spatial correspondence. Ablation studies and comparison with previous baselines demonstrate that our framework improves PBR texture fidelity and that each component is critical for achieving state-of-the-art performance.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images
arXiv:2512.17306v1 Announce Type: new
Abstract: Recent advances in large Vision-Language Models (VLMs) have exhibited strong reasoning capabilities on complex visual tasks by thinking with images in their Chain-of-Thought (CoT), which is achieved by actively invoking tools to analyze visual inputs rather than merely perceiving them. However, existing models often struggle to reflect on and correct themselves when attempting incorrect reasoning trajectories. To address this limitation, we propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT. Our pipeline comprises three stages: data construction, cold-start SFT and RL. Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs, where solving each task requires multi-turn tool calls to reach the correct answer. In the SFT stage, we collect tool trajectories as cold-start data, guiding a multi-turn reasoning pattern. In the RL stage, we introduce redundancy-penalized policy optimization, which incentivizes the model to develop a self-reflective reasoning pattern. The basic idea is to impose judgment on reasoning trajectories and penalize those that produce incorrect answers without sufficient multi-scale exploration. Extensive experiments demonstrate that DRIM achieves superior performance on visual understanding benchmarks.
Fonte: arXiv cs.CV
Vision • Score 95
Rede deformável ciente de distorção em múltiplos níveis para super-resolução de imagens omnidirecionais
Com o aumento da popularidade de aplicações de realidade aumentada e virtual, o processamento de imagens para Imagens Omnidirecionais (ODIs) tem atraído atenção crescente. A Super-Resolução de Imagens Omnidirecionais (ODISR) é uma técnica promissora para melhorar a qualidade visual das ODIs. Propomos uma nova Rede Deformável Ciente de Distorção em Múltiplos Níveis (MDDN) para ODISR, projetada para expandir o alcance de amostragem e o campo receptivo.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Mitty: Geração de Vídeo de Humano para Robô Baseada em Difusão
Aprender diretamente com vídeos de demonstração humana é um marco importante para o aprendizado de robôs escalável e generalizável. Apresentamos Mitty, um Diffusion Transformer que possibilita o In-Context Learning de vídeo para geração de vídeo Human2Robot, utilizando um modelo de difusão de vídeo pré-treinado e sem a necessidade de rótulos de ação.
Fonte: arXiv cs.CV
Vision • Score 95
Robust Scene Coordinate Regression via Geometrically-Consistent Global Descriptors
arXiv:2512.17226v1 Announce Type: new
Abstract: Recent learning-based visual localization methods use global descriptors to disambiguate visually similar places, but existing approaches often derive these descriptors from geometric cues alone (e.g., covisibility graphs), limiting their discriminative power and reducing robustness in the presence of noisy geometric constraints. We propose an aggregator module that learns global descriptors consistent with both geometrical structure and visual similarity, ensuring that images are close in descriptor space only when they are visually similar and spatially connected. This corrects erroneous associations caused by unreliable overlap scores. Using a batch-mining strategy based solely on the overlap scores and a modified contrastive loss, our method trains without manual place labels and generalizes across diverse environments. Experiments on challenging benchmarks show substantial localization gains in large-scale environments while preserving computational and memory efficiency. Code is available at \href{https://github.com/sontung/robust\_scr}{github.com/sontung/robust\_scr}.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
arXiv:2512.17012v1 Announce Type: new
Abstract: Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Rotterdam artery-vein segmentation (RAV) dataset
arXiv:2512.17322v1 Announce Type: new
Abstract: Purpose: To provide a diverse, high-quality dataset of color fundus images (CFIs) with detailed artery-vein (A/V) segmentation annotations, supporting the development and evaluation of machine learning algorithms for vascular analysis in ophthalmology.
Methods: CFIs were sampled from the longitudinal Rotterdam Study (RS), encompassing a wide range of ages, devices, and capture conditions. Images were annotated using a custom interface that allowed graders to label arteries, veins, and unknown vessels on separate layers, starting from an initial vessel segmentation mask. Connectivity was explicitly verified and corrected using connected component visualization tools.
Results: The dataset includes 1024x1024-pixel PNG images in three modalities: original RGB fundus images, contrast-enhanced versions, and RGB-encoded A/V masks. Image quality varied widely, including challenging samples typically excluded by automated quality assessment systems, but judged to contain valuable vascular information.
Conclusion: This dataset offers a rich and heterogeneous source of CFIs with high-quality segmentations. It supports robust benchmarking and training of machine learning models under real-world variability in image quality and acquisition settings.
Translational Relevance: By including connectivity-validated A/V masks and diverse image conditions, this dataset enables the development of clinically applicable, generalizable machine learning tools for retinal vascular analysis, potentially improving automated screening and diagnosis of systemic and ocular diseases.
Fonte: arXiv cs.CV
Vision • Score 95
EMMA: Concept Erasure Benchmark with Comprehensive Semantic Metrics and Diverse Categories
arXiv:2512.17320v1 Announce Type: new
Abstract: The widespread adoption of text-to-image (T2I) generation has raised concerns about privacy, bias, and copyright violations. Concept erasure techniques offer a promising solution by selectively removing undesired concepts from pre-trained models without requiring full retraining. However, these methods are often evaluated on a limited set of concepts, relying on overly simplistic and direct prompts. To test the boundaries of concept erasure techniques, and assess whether they truly remove targeted concepts from model representations, we introduce EMMA, a benchmark that evaluates five key dimensions of concept erasure over 12 metrics. EMMA goes beyond standard metrics like image quality and time efficiency, testing robustness under challenging conditions, including indirect descriptions, visually similar non-target concepts, and potential gender and ethnicity bias, providing a socially aware analysis of method behavior. Using EMMA, we analyze five concept erasure methods across five domains (objects, celebrities, art styles, NSFW, and copyright). Our results show that existing methods struggle with implicit prompts (i.e., generating the erased concept when it is indirectly referenced) and visually similar non-target concepts (i.e., failing to generate non-targeted concepts resembling the erased one), while some amplify gender and ethnicity bias compared to the original model.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs
arXiv:2512.17319v1 Announce Type: new
Abstract: Multimodal large language models (MLLMs) demonstrate strong perception and reasoning performance on existing remote sensing (RS) benchmarks. However, most prior benchmarks rely on low-resolution imagery, and some high-resolution benchmarks suffer from flawed reasoning-task designs. We show that text-only LLMs can perform competitively with multimodal vision-language models on RS reasoning tasks without access to images, revealing a critical mismatch between current benchmarks and the intended evaluation of visual understanding. To enable faithful assessment, we introduce RSHR-Bench, a super-high-resolution benchmark for RS visual understanding and reasoning. RSHR-Bench contains 5,329 full-scene images with a long side of at least 4,000 pixels, with up to about 3 x 10^8 pixels per image, sourced from widely used RS corpora and UAV collections. We design four task families: multiple-choice VQA, open-ended VQA, image captioning, and single-image evaluation. These tasks cover nine perception categories and four reasoning types, supporting multi-turn and multi-image dialog. To reduce reliance on language priors, we apply adversarial filtering with strong LLMs followed by rigorous human verification. Overall, we construct 3,864 VQA tasks, 3,913 image captioning tasks, and 500 fully human-written or verified single-image evaluation VQA pairs. Evaluations across open-source, closed-source, and RS-specific VLMs reveal persistent performance gaps in super-high-resolution scenarios. Code: https://github.com/Yunkaidang/RSHR
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
CheXPO-v2: Preference Optimization for Chest X-ray VLMs with Knowledge Graph Consistency
arXiv:2512.17213v1 Announce Type: new
Abstract: Medical Vision-Language Models (VLMs) are prone to hallucinations, compromising clinical reliability. While reinforcement learning methods like Group Relative Policy Optimization (GRPO) offer a low-cost alignment solution, their reliance on sparse, outcome-based rewards inadvertently encourages models to "overthink" -- generating verbose, convoluted, and unverifiable Chain-of-Thought reasoning to justify answers. This focus on outcomes obscures factual errors and poses significant safety risks. To address this, we propose CheXPO-v2, a novel alignment framework that shifts from outcome to process supervision. Our core innovation is a Knowledge Graph Consistency Reward mechanism driven by Entity-Relation Matching. By explicitly parsing reasoning steps into structured "Disease, Relation, Anatomy" triplets, we provide fine-grained supervision that penalizes incoherent logic and hallucinations at the atomic level. Integrating this with a hard-example mining strategy, our approach significantly outperforms GRPO and state-of-the-art models on benchmarks like MIMIC-CXR-VQA. Crucially, CheXPO-v2 achieves new state-of-the-art accuracy using only 5k samples, demonstrating exceptional data efficiency while producing clinically sound and verifiable reasoning. The project source code is publicly available at: https://github.com/ecoxial2007/CheX-Phi4MM.
Fonte: arXiv cs.CV
Vision • Score 95
WDFFU-Mamba: A Wavelet-guided Dual-attention Feature Fusion Mamba for Breast Tumor Segmentation in Ultrasound Images
arXiv:2512.17278v1 Announce Type: new
Abstract: Breast ultrasound (BUS) image segmentation plays a vital role in assisting clinical diagnosis and early tumor screening. However, challenges such as speckle noise, imaging artifacts, irregular lesion morphology, and blurred boundaries severely hinder accurate segmentation. To address these challenges, this work aims to design a robust and efficient model capable of automatically segmenting breast tumors in BUS images.We propose a novel segmentation network named WDFFU-Mamba, which integrates wavelet-guided enhancement and dual-attention feature fusion within a U-shaped Mamba architecture. A Wavelet-denoised High-Frequency-guided Feature (WHF) module is employed to enhance low-level representations through noise-suppressed high-frequency cues. A Dual Attention Feature Fusion (DAFF) module is also introduced to effectively merge skip-connected and semantic features, improving contextual consistency.Extensive experiments on two public BUS datasets demonstrate that WDFFU-Mamba achieves superior segmentation accuracy, significantly outperforming existing methods in terms of Dice coefficient and 95th percentile Hausdorff Distance (HD95).The combination of wavelet-domain enhancement and attention-based fusion greatly improves both the accuracy and robustness of BUS image segmentation, while maintaining computational efficiency.The proposed WDFFU-Mamba model not only delivers strong segmentation performance but also exhibits desirable generalization ability across datasets, making it a promising solution for real-world clinical applications in breast tumor ultrasound analysis.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
DAVE: A VLM Vision Encoder for Document Understanding and Web Agents
arXiv:2512.17221v1 Announce Type: new
Abstract: While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial information essential for document understanding and web agents. To bridge this gap, we introduce DAVE, a vision encoder purpose-built for VLMs and tailored for these tasks. Our training pipeline is designed to leverage abundant unlabeled data to bypass the need for costly large-scale annotations for document and web images. We begin with a self-supervised pretraining stage on unlabeled images, followed by a supervised autoregressive pretraining stage, where the model learns tasks like parsing and localization from limited, high-quality data. Within the supervised stage, we adopt two strategies to improve our encoder's alignment with both general visual knowledge and diverse document and web agentic tasks: (i) We introduce a novel model-merging scheme, combining encoders trained with different text decoders to ensure broad compatibility with different web agentic architectures. (ii) We use ensemble training to fuse features from pretrained generalist encoders (e.g., SigLIP2) with our own document and web-specific representations. Extensive experiments on classic document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness of our approach, establishing DAVE as a strong vision encoder for document and web applications.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Imagens Sintéticas Podem Servir como Protótipos de Classe Eficazes e Eficientes?
Modelos de Visão-Linguagem (VLMs) demonstraram forte desempenho em tarefas de classificação de imagens zero-shot. No entanto, métodos existentes, como o Contrastive Language-Image Pre-training (CLIP), dependem de pares de texto-imagem anotados, o que aumenta os custos e requisitos de precisão na preparação de conjuntos de dados de alta qualidade. Apresentamos o framework LGCLIP, que utiliza um Large Language Model (LLM) para gerar prompts específicos de classe, permitindo a síntese de imagens de referência de forma leve e eficiente.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
ResSVD: Residual Compensated SVD for Large Language Model Compression
arXiv:2505.20112v3 Announce Type: replace
Abstract: Large language models (LLMs) have demonstrated impressive capabilities in a wide range of downstream natural language processing tasks. Nevertheless, their considerable sizes and memory demands hinder practical deployment, underscoring the importance of developing efficient compression strategies. Singular value decomposition (SVD) decomposes a matrix into orthogonal components, enabling efficient low-rank approximation. This is particularly suitable for LLM compression, where weight matrices often exhibit significant redundancy. However, current SVD-based methods neglect the residual matrix from truncation, resulting in significant truncation loss. Additionally, compressing all layers of the model results in severe performance degradation. To overcome these limitations, we propose ResSVD, a new post-training SVD-based LLM compression method. Specifically, we leverage the residual matrix generated during the truncation process to reduce truncation loss. Moreover, under a fixed overall compression ratio, we selectively compress the last few layers of the model, which mitigates error propagation and significantly improves the performance of compressed models. Comprehensive evaluations of ResSVD on diverse LLM families and multiple benchmark datasets indicate that ResSVD consistently achieves superior performance over existing counterpart methods, demonstrating its practical effectiveness.
Fonte: arXiv cs.CL
NLP/LLMs • Score 96
V-Agent: Um Sistema de Busca de Vídeo Interativo Usando Modelos de Visão-Linguagem
Apresentamos o V-Agent, uma nova plataforma multi-agente projetada para busca avançada de vídeos e conversas interativas entre usuário e sistema. Ao ajustar um modelo de visão-linguagem (VLM) com um pequeno conjunto de dados de preferência de vídeo e aprimorá-lo com um vetor de recuperação de um modelo de recuperação de imagem-texto, superamos as limitações dos sistemas tradicionais de recuperação baseados em texto em cenários multimodais.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
Democratizing Pathology Co-Pilots: An Open Pipeline and Dataset for Whole-Slide Vision-Language Modelling
arXiv:2512.17326v1 Announce Type: new
Abstract: Vision-language models (VLMs) have the potential to become co-pilots for pathologists. However, most VLMs either focus on small regions of interest within whole-slide images, provide only static slide-level outputs, or rely on data that is not publicly available, limiting reproducibility. Furthermore, training data containing WSIs paired with detailed clinical reports is scarce, restricting progress toward transparent and generalisable VLMs. We address these limitations with three main contributions. First, we introduce Polysome, a standardised tool for synthetic instruction generation. Second, we apply Polysome to the public HISTAI dataset, generating HISTAI-Instruct, a large whole-slide instruction tuning dataset spanning 24,259 slides and over 1.1 million instruction-response pairs. Finally, we use HISTAI-Instruct to train ANTONI-{\alpha}, a VLM capable of visual-question answering (VQA). We show that ANTONI-{\alpha} outperforms MedGemma on WSI-level VQA tasks of tissue identification, neoplasm detection, and differential diagnosis. We also compare the performance of multiple incarnations of ANTONI-{\alpha} trained with different amounts of data. All methods, data, and code are publicly available.
Fonte: arXiv cs.CV
Vision • Score 95
Pro-Pose: Unpaired Full-Body Portrait Synthesis via Canonical UV Maps
arXiv:2512.17143v1 Announce Type: new
Abstract: Photographs of people taken by professional photographers typically present the person in beautiful lighting, with an interesting pose, and flattering quality. This is unlike common photos people can take of themselves. In this paper, we explore how to create a ``professional'' version of a person's photograph, i.e., in a chosen pose, in a simple environment, with good lighting, and standard black top/bottom clothing. A key challenge is to preserve the person's unique identity, face and body features while transforming the photo. If there would exist a large paired dataset of the same person photographed both ``in the wild'' and by a professional photographer, the problem would potentially be easier to solve. However, such data does not exist, especially for a large variety of identities. To that end, we propose two key insights: 1) Our method transforms the input photo and person's face to a canonical UV space, which is further coupled with reposing methodology to model occlusions and novel view synthesis. Operating in UV space allows us to leverage existing unpaired datasets. 2) We personalize the output photo via multi image finetuning. Our approach yields high-quality, reposed portraits and achieves strong qualitative and quantitative performance on real-world imagery.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Towards Safer Chatbots: Automated Policy Compliance Evaluation of Custom GPTs
arXiv:2502.01436v3 Announce Type: replace
Abstract: User-configured chatbots built on top of large language models are increasingly available through centralized marketplaces such as OpenAI's GPT Store. While these platforms enforce usage policies intended to prevent harmful or inappropriate behavior, the scale and opacity of customized chatbots make systematic policy enforcement challenging. As a result, policy-violating chatbots continue to remain publicly accessible despite existing review processes. This paper presents a fully automated method for evaluating the compliance of Custom GPTs with its marketplace usage policy using black-box interaction. The method combines large-scale GPT discovery, policy-driven red-teaming prompts, and automated compliance assessment using an LLM-as-a-judge. We focus on three policy-relevant domains explicitly addressed in OpenAI's usage policies: Romantic, Cybersecurity, and Academic GPTs. We validate our compliance assessment component against a human-annotated ground-truth dataset, achieving an F1 score of 0.975 for binary policy violation detection. We then apply the method in a large-scale empirical study of 782 Custom GPTs retrieved from the GPT Store. The results show that 58.7% of the evaluated GPTs exhibit at least one policy-violating response, with substantial variation across policy domains. A comparison with the base models (GPT-4 and GPT-4o) indicates that most violations originate from model-level behavior, while customization tends to amplify these tendencies rather than create new failure modes. Our findings reveal limitations in current review mechanisms for user-configured chatbots and demonstrate the feasibility of scalable, behavior-based policy compliance evaluation.
Fonte: arXiv cs.CL
MLOps/Systems • Score 95
Mapping the Podcast Ecosystem with the Structured Podcast Research Corpus
arXiv:2411.07892v2 Announce Type: replace
Abstract: Podcasts provide highly diverse content to a massive listener base through a unique on-demand modality. However, limited data has prevented large-scale computational analysis of the podcast ecosystem. To fill this gap, we introduce a massive dataset of over 1.1M podcast transcripts that is largely comprehensive of all English language podcasts available through public RSS feeds from May and June of 2020. This data is not limited to text, but rather includes audio features and speaker turns for a subset of 370K episodes, and speaker role inferences and other metadata for all 1.1M episodes. Using this data, we also conduct a foundational investigation into the content, structure, and responsiveness of this ecosystem. Together, our data and analyses open the door to continued computational research of this popular and impactful medium.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering
arXiv:2512.17396v1 Announce Type: cross
Abstract: In this work, we introduce RadImageNet-VQA, a large-scale dataset designed to advance radiologic visual question answering (VQA) on CT and MRI exams. Existing medical VQA datasets are limited in scale, dominated by X-ray imaging or biomedical illustrations, and often prone to text-based shortcuts. RadImageNet-VQA is built from expert-curated annotations and provides 750K images paired with 7.5M question-answer samples. It covers three key tasks - abnormality detection, anatomy recognition, and pathology identification - spanning eight anatomical regions and 97 pathology categories, and supports open-ended, closed-ended, and multiple-choice questions. Extensive experiments show that state-of-the-art vision-language models still struggle with fine-grained pathology identification, particularly in open-ended settings and even after fine-tuning. Text-only analysis further reveals that model performance collapses to near-random without image inputs, confirming that RadImageNet-VQA is free from linguistic shortcuts. The full dataset and benchmark are publicly available at https://huggingface.co/datasets/raidium/RadImageNet-VQA.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
LLM-as-a-qualitative-judge: automating error analysis in natural language generation
arXiv:2506.09147v4 Announce Type: replace
Abstract: Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that instance-specific issues output by LLM-as-a-qualitative-judge match those annotated by humans in 2/3 cases, and that LLM-as-a-qualitative-judge is capable of producing error type reports resembling the reports composed by human annotators. We also demonstrate in a case study how the use of LLM-as-a-qualitative-judge can substantially improve NLG systems performance. Our code and data are publicly available at https://github.com/tunde-ajayi/llm-as-a-qualitative-judge.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents
arXiv:2503.10689v2 Announce Type: replace
Abstract: Recent advances in large language models (LLMs) have led to a growing interest in developing LLM-based agents for automating web tasks. However, these agents often struggle with even simple tasks on real-world websites due to their limited capability to understand and process complex web page structures. In this work, we introduce LCoW, a framework for Learning language models to Contextualize complex Web pages into a more comprehensible form, thereby enhancing decision making by LLM agents. LCoW decouples web page understanding from decision making by training a separate contextualization module to transform complex web pages into comprehensible format, which are then utilized by the decision-making agent. We demonstrate that our contextualization module effectively integrates with LLM agents of various scales to significantly enhance their decision-making capabilities in web automation tasks. Notably, LCoW improves the success rates of closed-source LLMs (e.g., Gemini-1.5-flash, GPT-4o, Claude-3.5-Sonnet) by an average of 15.6%, and demonstrates a 23.7% average improvement in success rates for open-source LMs (e.g., Llama-3.1-8B, Llama-3.1-70B) on the WorkArena benchmark. Moreover, the Gemini-1.5-flash agent with LCoW achieves state-of-the-art results on the WebShop benchmark, outperforming human experts. The relevant code materials are available at our project page: https://lcowiclr2025.github.io.
Fonte: arXiv cs.CL
RL • Score 96
Quando o Raciocínio Encontra Suas Leis
arXiv:2512.17901v1 Tipo de Anúncio: novo. Este artigo apresenta as Leis do Raciocínio (LoRe), um framework unificado que caracteriza padrões intrínsecos de raciocínio em Large Reasoning Models (LRMs). Propomos a lei de computação e uma lei de precisão suplementar, introduzindo o LoRe-Bench para medir essas propriedades em modelos de raciocínio. Avaliações mostram que a maioria dos modelos apresenta monotonicidade razoável, mas carece de composicionalidade.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens
arXiv:2512.17375v1 Announce Type: new
Abstract: Reward models and LLM-as-a-Judge systems are central to modern post-training pipelines such as RLHF, DPO, and RLAIF, where they provide scalar feedback and binary decisions that guide model selection and RL-based fine-tuning. We show that these judge systems exhibit a recurring vulnerability: short sequences of low-perplexity control tokens can flip many binary evaluations from correct ``No'' judgments to incorrect ``Yes'' judgments by steering the last-layer logit gap. These control tokens are patterns that a policy model could plausibly generate during post-training, and thus represent realistic reward-hacking risks rather than worst-case adversarial strings. Our method, AdvJudge-Zero, uses the model's next-token distribution and beam-search exploration to discover diverse control-token sequences from scratch, and our analysis shows that the induced hidden-state perturbations concentrate in a low-rank ``soft mode'' that is anti-aligned with the judge's refusal direction. Empirically, these tokens cause very high false positive rates when large open-weight and specialized judge models score incorrect answers on math and reasoning benchmarks. Finally, we show that LoRA-based adversarial training on small sets of control-token-augmented examples can markedly reduce these false positives while preserving evaluation quality.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
PAACE: Um Framework de Engenharia de Contexto Automatizado e Consciente de Planejamento
Modelos de Linguagem de Grande Escala (LLM) estão sendo cada vez mais utilizados em fluxos de trabalho complexos que envolvem planejamento, uso de ferramentas, reflexão e interação com sistemas de conhecimento externos. Este trabalho apresenta o PAACE, um framework unificado para otimizar o estado evolutivo de agentes LLM através de modelagem de relevância de próximas tarefas, análise de estrutura de planejamento, co-refinamento de instruções e compressão que preserva funções.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
Strategic Planning and Rationalizing on Trees Make LLMs Better Debaters
arXiv:2505.14886v2 Announce Type: replace
Abstract: Winning competitive debates requires sophisticated reasoning and argument skills. There are unique challenges in the competitive debate: (1) The time constraints force debaters to make strategic choices about which points to pursue rather than covering all possible arguments; (2) The persuasiveness of the debate relies on the back-and-forth interaction between arguments, which a single final game status cannot evaluate. To address these challenges, we propose TreeDebater, a novel debate framework that excels in competitive debate. We introduce two tree structures: the Rehearsal Tree and Debate Flow Tree. The Rehearsal Tree anticipates the attack and defenses to evaluate the strength of the claim, while the Debate Flow Tree tracks the debate status to identify the active actions. TreeDebater allocates its time budget among candidate actions and uses the speech time controller and feedback from the simulated audience to revise its statement. The human evaluation on both the stage-level and the debate-level comparison shows that our TreeDebater outperforms the state-of-the-art multi-agent debate system, with a +15.6% improvement in stage-level persuasiveness with DeepSeek and +10% debate-level opinion shift win. Further investigation shows that TreeDebater shows better strategies in limiting time to important debate actions, aligning with the strategies of human debate experts.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs
arXiv:2512.17206v1 Announce Type: new
Abstract: Exploration capacity shapes both inference-time performance and reinforcement learning (RL) training for large (vision-) language models, as stochastic sampling often yields redundant reasoning paths with little high-level diversity. This paper proposes Reasoning Palette, a novel latent-modulation framework that endows the model with a stochastic latent variable for strategic contextualization, guiding its internal planning prior to token generation. This latent context is inferred from the mean-pooled embedding of a question-answer pair via a variational autoencoder (VAE), where each sampled latent potentially encodes a distinct reasoning context. During inference, a sampled latent is decoded into learnable token prefixes and prepended to the input prompt, modulating the model's internal reasoning trajectory. In this way, the model performs internal sampling over reasoning strategies prior to output generation, which shapes the style and structure of the entire response sequence. A brief supervised fine-tuning (SFT) warm-up phase allows the model to adapt to this latent conditioning. Within RL optimization, Reasoning Palette facilitates structured exploration by enabling on-demand injection for diverse reasoning modes, significantly enhancing exploration efficiency and sustained learning capability. Experiments across multiple reasoning benchmarks demonstrate that our method enables interpretable and controllable control over the (vision-) language model's strategic behavior, thereby achieving consistent performance gains over standard RL methods.
Fonte: arXiv cs.CV
NLP/LLMs • Score 92
SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories
arXiv:2512.17419v1 Announce Type: cross
Abstract: Benchmarks like SWE-bench have standardized the evaluation of Large Language Models (LLMs) on repository-level software engineering tasks. However, these efforts remain limited by manual curation, static datasets, and a focus on Python-based bug fixes. We introduce SWE-Bench++, an automated framework that generates repository-level coding tasks from open-source GitHub projects. Unlike synthetic approaches, our pipeline harvests live pull requests to cover both bug fixes and feature requests across 11 languages. SWE-Bench++ turns GitHub pull requests (PRs) into reproducible, execution-based tasks via four stages: programmatic sourcing, environment synthesis, test oracle extraction, and quality assurance. A final hint-guided trajectory synthesis step converts instances that strong models fail on into training trajectories. Our initial benchmark consists of 11,133 instances from 3,971 repositories across 11 languages. On a subset of 1,782 instances of this benchmark, today's strongest models perform as follows: claude-sonnet-4.5 achieves 36.20% pass@10, gpt-5-2025-08-07 34.57%, gemini/gemini-2.5-pro 24.92%, and gpt-4o 16.89%. We further demonstrate the utility of our dataset by showing that fine-tuning on SWE-Bench++ instances yields measurable improvements on the SWE-bench Multilingual benchmark. SWE-Bench++ provides a scalable, multilingual benchmark for evaluating and improving repository-level code generation.
Fonte: arXiv cs.CL
Vision • Score 95
Nem sempre é mais verde do outro lado: Percepção de áreas verdes entre diferentes demografias e personalidades em várias cidades
Quantificar e avaliar a vegetação urbana é crucial para o planejamento e desenvolvimento, refletindo a importância contínua dos espaços verdes para múltiplas dimensões climáticas e de bem-estar nas cidades. Este trabalho mede as diferenças entre percepções subjetivas e medidas objetivas de áreas verdes, utilizando dados de uma pesquisa abrangente com 1.000 pessoas em cinco países.
Fonte: arXiv cs.CV
NLP/LLMs • Score 96
Large Language Models as Pok\'emon Battle Agents: Strategic Play and Content Generation
arXiv:2512.17308v1 Announce Type: new
Abstract: Strategic decision-making in Pok\'emon battles presents a unique testbed for evaluating large language models. Pok\'emon battles demand reasoning about type matchups, statistical trade-offs, and risk assessment, skills that mirror human strategic thinking. This work examines whether Large Language Models (LLMs) can serve as competent battle agents, capable of both making tactically sound decisions and generating novel, balanced game content. We developed a turn-based Pok\'emon battle system where LLMs select moves based on battle state rather than pre-programmed logic. The framework captures essential Pok\'emon mechanics: type effectiveness multipliers, stat-based damage calculations, and multi-Pok\'emon team management. Through systematic evaluation across multiple model architectures we measured win rates, decision latency, type-alignment accuracy, and token efficiency. These results suggest LLMs can function as dynamic game opponents without domain-specific training, offering a practical alternative to reinforcement learning for turn-based strategic games. The dual capability of tactical reasoning and content creation, positions LLMs as both players and designers, with implications for procedural generation and adaptive difficulty systems in interactive entertainment.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
ShareChat: A Dataset of Chatbot Conversations in the Wild
arXiv:2512.17843v1 Announce Type: new
Abstract: While Large Language Models (LLMs) have evolved into distinct platforms with unique interface designs and capabilities, existing public datasets treat models as generic text generators, stripping away the interface context that actively shapes user interaction. To address this limitation, we present ShareChat, a large-scale, cross-platform corpus comprising 142,808 conversations and over 660,000 turns collected from publicly shared URLs across five major platforms: ChatGPT, Claude, Gemini, Perplexity, and Grok. ShareChat distinguishes itself by preserving native platform affordances often lost in standard logs, including reasoning traces, source links, and code artifacts, while spanning 101 languages over the period from April 2023 to October 2025. Furthermore, ShareChat offers substantially longer context windows and greater interaction depth than prior datasets. We demonstrate the dataset's multifaceted utility through three representative analyses: (1) analyzing conversation completeness to measure user intent satisfaction; (2) evaluating source citation behaviors in content generation; and (3) conducting temporal analysis to track evolving usage patterns. This work provides the community with a vital and timely resource for understanding authentic user-LLM chatbot interactions in the wild.
Fonte: arXiv cs.CL
NLP/LLMs • Score 96
Investigando a Inteligência Geral Científica de LLMs com Fluxos de Trabalho Alinhados a Cientistas
Apesar dos avanços em IA científica, falta um framework coerente para Inteligência Geral Científica (SGI) — a capacidade de conceber, investigar e raciocinar autonomamente em domínios científicos. Apresentamos uma definição operacional de SGI baseada no Modelo de Investigação Prática (PIM) e a operacionalizamos através de quatro tarefas alinhadas a cientistas: pesquisa profunda, geração de ideias, experimentos secos/molhados e raciocínio experimental.
Fonte: arXiv cs.AI
RL • Score 95
Affect, Body, Cognition, Demographics, and Emotion: The ABCDE of Text Features for Computational Affective Science
arXiv:2512.17752v1 Announce Type: new
Abstract: Work in Computational Affective Science and Computational Social Science explores a wide variety of research questions about people, emotions, behavior, and health. Such work often relies on language data that is first labeled with relevant information, such as the use of emotion words or the age of the speaker. Although many resources and algorithms exist to enable this type of labeling, discovering, accessing, and using them remains a substantial impediment, particularly for practitioners outside of computer science. Here, we present the ABCDE dataset (Affect, Body, Cognition, Demographics, and Emotion), a large-scale collection of over 400 million text utterances drawn from social media, blogs, books, and AI-generated sources. The dataset is annotated with a wide range of features relevant to computational affective and social science. ABCDE facilitates interdisciplinary research across numerous fields, including affective science, cognitive science, the digital humanities, sociology, political science, and computational linguistics.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports
arXiv:2512.17776v1 Announce Type: new
Abstract: As large language models (LLMs) advance, deep research systems can generate expert-level reports via multi-step reasoning and evidence-based synthesis, but evaluating such reports remains challenging. Existing benchmarks often lack systematic criteria for expert reporting, evaluations that rely heavily on LLM judges can fail to capture issues that require expert judgment, and source verification typically covers only a limited subset of explicitly cited statements rather than report-wide factual reliability. We introduce DEER, a benchmark for evaluating expert-level deep research reports. DEER comprises 50 report-writing tasks spanning 13 domains and an expert-grounded evaluation taxonomy (7 dimensions, 25 sub-dimension) operationalized into 130 fine-grained rubric items. DEER further provides task-specific expert guidance to help LLM judges assess expert-level report quality more consistently. Complementing rubric-based assessment, we propose a document-level fact-checking architecture that extracts and verifies all claims across the entire report, including both cited and uncited ones, and quantifies external-evidence quality. DEER correlates closely with human expert judgments and yields interpretable diagnostics of system strengths and weaknesses.
Fonte: arXiv cs.CL
Vision • Score 95
Interpretable Similarity of Synthetic Image Utility
arXiv:2512.17080v1 Announce Type: new
Abstract: Synthetic medical image data can unlock the potential of deep learning (DL)-based clinical decision support (CDS) systems through the creation of large scale, privacy-preserving, training sets. Despite the significant progress in this field, there is still a largely unanswered research question: "How can we quantitatively assess the similarity of a synthetically generated set of images with a set of real images in a given application domain?". Today, answers to this question are mainly provided via user evaluation studies, inception-based measures, and the classification performance achieved on synthetic images. This paper proposes a novel measure to assess the similarity between synthetically generated and real sets of images, in terms of their utility for the development of DL-based CDS systems. Inspired by generalized neural additive models, and unlike inception-based measures, the proposed measure is interpretable (Interpretable Utility Similarity, IUS), explaining why a synthetic dataset could be more useful than another one in the context of a CDS system based on clinically relevant image features. The experimental results on publicly available datasets from various color medical imaging modalities including endoscopic, dermoscopic and fundus imaging, indicate that selecting synthetic images of high utility similarity using IUS can result in relative improvements of up to 54.6% in terms of classification performance. The generality of IUS for synthetic data assessment is demonstrated also for greyscale X-ray and ultrasound imaging modalities. IUS implementation is available at https://github.com/innoisys/ius
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Toward Ethical AI Through Bayesian Uncertainty in Neural Question Answering
arXiv:2512.17677v1 Announce Type: new
Abstract: We explore Bayesian reasoning as a means to quantify uncertainty in neural networks for question answering. Starting with a multilayer perceptron on the Iris dataset, we show how posterior inference conveys confidence in predictions. We then extend this to language models, applying Bayesian inference first to a frozen head and finally to LoRA-adapted transformers, evaluated on the CommonsenseQA benchmark. Rather than aiming for state-of-the-art accuracy, we compare Laplace approximations against maximum a posteriori (MAP) estimates to highlight uncertainty calibration and selective prediction. This allows models to abstain when confidence is low. An ``I don't know'' response not only improves interpretability but also illustrates how Bayesian methods can contribute to more responsible and ethical deployment of neural question-answering systems.
Fonte: arXiv cs.CL
Evaluation/Benchmarks • Score 96
Bridging Training and Merging Through Momentum-Aware Optimization
arXiv:2512.17109v1 Announce Type: new
Abstract: Training large neural networks and merging task-specific models both exploit low-rank structure and require parameter importance estimation, yet these challenges have been pursued in isolation. Current workflows compute curvature information during training, discard it, then recompute similar information for merging -- wasting computation and discarding valuable trajectory data. We introduce a unified framework that maintains factorized momentum and curvature statistics during training, then reuses this information for geometry-aware model composition. The proposed method achieves memory efficiency comparable to state-of-the-art approaches while accumulating task saliency scores that enable curvature-aware merging without post-hoc Fisher computation. We establish convergence guarantees for non-convex objectives with approximation error bounded by gradient singular value decay. On natural language understanding benchmarks, curvature-aware parameter selection outperforms magnitude-only baselines across all sparsity levels, with multi-task merging improving over strong baselines. The proposed framework exhibits rank-invariant convergence and superior hyperparameter robustness compared to existing low-rank optimizers. By treating the optimization trajectory as a reusable asset rather than discarding it, our approach eliminates redundant computation while enabling more principled model composition.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Are Vision Language Models Cross-Cultural Theory of Mind Reasoners?
arXiv:2512.17394v1 Announce Type: new
Abstract: Theory of Mind (ToM) -- the ability to attribute beliefs, desires, and emotions to others -- is fundamental for human social intelligence, yet remains a major challenge for artificial agents. Existing Vision-Language Models (VLMs) are increasingly applied in socially grounded tasks, but their capacity for cross-cultural ToM reasoning is largely unexplored. In this work, we introduce CulturalToM-VQA, a new evaluation benchmark containing 5095 questions designed to probe ToM reasoning across diverse cultural contexts through visual question answering. The dataset captures culturally grounded cues such as rituals, attire, gestures, and interpersonal dynamics, enabling systematic evaluation of ToM reasoning beyond Western-centric benchmarks. Our dataset is built through a VLM-assisted human-in-the-loop pipeline, where human experts first curate culturally rich images across traditions, rituals, and social interactions; a VLM then assist in generating structured ToM-focused scene descriptions, which are refined into question-answer pairs spanning a taxonomy of six ToM tasks and four graded complexity levels. The resulting dataset covers diverse theory of mind facets such as mental state attribution, false belief reasoning, non-literal communication, social norm violations, perspective coordination, and multi-agent reasoning.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Subjective Question Generation and Answer Evaluation using NLP
arXiv:2512.17289v1 Announce Type: new
Abstract: Natural Language Processing (NLP) is one of the most revolutionary technologies today. It uses artificial intelligence to understand human text and spoken words. It is used for text summarization, grammar checking, sentiment analysis, and advanced chatbots and has many more potential use cases. Furthermore, it has also made its mark on the education sector. Much research and advancements have already been conducted on objective question generation; however, automated subjective question generation and answer evaluation are still in progress. An automated system to generate subjective questions and evaluate the answers can help teachers assess student work and enhance the student's learning experience by allowing them to self-assess their understanding after reading an article or a chapter of a book. This research aims to improve current NLP models or make a novel one for automated subjective question generation and answer evaluation from text input.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
CIFE: Code Instruction-Following Evaluation
arXiv:2512.17387v1 Announce Type: cross
Abstract: Large Language Models (LLMs) are increasingly applied to real-world code generation, where functional correctness alone is insufficient for reliable deployment, developers also expect adherence to explicit requirements for robustness, formatting, and security. Existing benchmarks primarily assess correctness through test-case execution, offering limited insight into how reliably models follow such constraints. We introduce a benchmark of 1,000 Python tasks, each paired with an average of 7 developer-specified constraints spanning 13 categories. Constraints are curated through a four-stage human-LLM pipeline to ensure they are atomic, relevant, and objective. We evaluate 14 open- and closed-source models using complementary adherence metrics and propose the C2A Score, a composite measure that jointly captures correctness and constraint compliance. Results reveal a substantial gap between partial and strict satisfaction, while strong models achieve over 90% partial adherence, strict adherence remains between 39-66%. These findings highlight that trustworthy code generation requires not only correctness but also consistent adherence to developer intent.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Bangla MedER: Multi-BERT Ensemble Approach for the Recognition of Bangla Medical Entity
arXiv:2512.17769v1 Announce Type: new
Abstract: Medical Entity Recognition (MedER) is an essential NLP task for extracting meaningful entities from the medical corpus. Nowadays, MedER-based research outcomes can remarkably contribute to the development of automated systems in the medical sector, ultimately enhancing patient care and outcomes. While extensive research has been conducted on MedER in English, low-resource languages like Bangla remain underexplored. Our work aims to bridge this gap. For Bangla medical entity recognition, this study first examined a number of transformer models, including BERT, DistilBERT, ELECTRA, and RoBERTa. We also propose a novel Multi-BERT Ensemble approach that outperformed all baseline models with the highest accuracy of 89.58%. Notably, it provides an 11.80% accuracy improvement over the single-layer BERT model, demonstrating its effectiveness for this task. A major challenge in MedER for low-resource languages is the lack of annotated datasets. To address this issue, we developed a high-quality dataset tailored for the Bangla MedER task. The dataset was used to evaluate the effectiveness of our model through multiple performance metrics, demonstrating its robustness and applicability. Our findings highlight the potential of Multi-BERT Ensemble models in improving MedER for Bangla and set the foundation for further advancements in low-resource medical NLP.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models
arXiv:2512.17385v1 Announce Type: new
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, their effectiveness heavily relies on supervised training with extensive labeled (e.g., question-answering pairs) or unlabeled datasets (e.g., code snippets), which are often expensive and difficult to obtain at scale. To address this limitation, this paper introduces a method IPC, an unsupervised framework that leverages Internal Probing of LLMs for Code generation without any external corpus, even unlabeled code snippets. We introduce the problem space probing, test understanding probing, solution space probing, and knowledge consolidation and reinforcement to probe the internal knowledge and confidence patterns existing in LLMs. Further, IPC identifies reliable code candidates through self-consistency mechanisms and representation-based quality estimation to train UCoder (coder with unsupervised learning). We validate the proposed approach across multiple code benchmarks, demonstrating that unsupervised methods can achieve competitive performance compared to supervised approaches while significantly reducing the dependency on labeled data and computational resources. Analytic experiments reveal that internal model states contain rich signals about code quality and correctness, and that properly harnessing these signals enables effective unsupervised learning for code generation tasks, opening new directions for training code LLMs in resource-constrained scenarios.
Fonte: arXiv cs.CL
Vision • Score 95
Endo-SemiS: Rumo à Segmentação de Imagem Semi-Supervisionada Robusta para Vídeo Endoscópico
Neste artigo, apresentamos o Endo-SemiS, um framework de segmentação semi-supervisionada para fornecer segmentação confiável de quadros de vídeo endoscópico com anotação limitada. O Endo-SemiS utiliza 4 estratégias para melhorar o desempenho, aproveitando efetivamente todos os dados disponíveis, especialmente dados não rotulados.
Fonte: arXiv cs.CV