Aplicações

Veja os artigos deste label, com traduções para PT-BR.

Ver todos os labels

Artigos

🚀Aplicações169 artigo(s) encontrados

Limpar filtro

NLP/LLMs • Score 85

ElecTwit: A Framework for Studying Persuasion in Multi-Agent Social Systems

arXiv:2601.00994v1 Announce Type: new Abstract: This paper introduces ElecTwit, a simulation framework designed to study persuasion within multi-agent systems, specifically emulating the interactions on social media platforms during a political election. By grounding our experiments in a realistic environment, we aimed to overcome the limitations of game-based simulations often used in prior research. We observed the comprehensive use of 25 specific persuasion techniques across most tested LLMs, encompassing a wider range than previously reported. The variations in technique usage and overall persuasion output between models highlight how different model architectures and training can impact the dynamics in realistic social simulations. Additionally, we observed unique phenomena such as "kernel of truth" messages and spontaneous developments with an "ink" obsession, where agents collectively demanded written proof. Our study provides a foundation for evaluating persuasive LLM agents in real-world contexts, ensuring alignment and preventing dangerous outcomes.

Fonte: arXiv cs.AI

RL • Score 85

Regularização de Ações de Ordem Superior em Aprendizado por Reforço Profundo: Do Controle Contínuo à Gestão de Energia em Edifícios

Agentes de aprendizado por reforço profundo frequentemente apresentam comportamentos de controle erráticos e de alta frequência que dificultam a implementação no mundo real devido ao consumo excessivo de energia e desgaste mecânico. Investigamos sistematicamente a regularização da suavidade das ações através de penalidades de derivadas de ordem superior, validando teoricamente em benchmarks de controle contínuo e na prática em gestão de energia em edifícios.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

AI Agente para Tomada de Decisão de Risco de Crédito Autônoma, Explicável e em Tempo Real

A digitalização significativa dos serviços financeiros gerou uma demanda urgente por sistemas de tomada de decisão de risco de crédito que sejam autônomos, transparentes e em tempo real. Este artigo apresenta um framework de AI Agente, onde agentes de IA avaliam o crédito de forma dinâmica, minimizando a intervenção humana e melhorando a velocidade e transparência das decisões.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Simulated Reasoning is Reasoning

arXiv:2601.02043v1 Announce Type: new Abstract: Reasoning has long been understood as a pathway between stages of understanding. Proper reasoning leads to understanding of a given subject. This reasoning was conceptualized as a process of understanding in a particular way, i.e., "symbolic reasoning". Foundational Models (FM) demonstrate that this is not a necessary condition for many reasoning tasks: they can "reason" by way of imitating the process of "thinking out loud", testing the produced pathways, and iterating on these pathways on their own. This leads to some form of reasoning that can solve problems on its own or with few-shot learning, but appears fundamentally different from human reasoning due to its lack of grounding and common sense, leading to brittleness of the reasoning process. These insights promise to substantially alter our assessment of reasoning and its necessary conditions, but also inform the approaches to safety and robust defences against this brittleness of FMs. This paper offers and discusses several philosophical interpretations of this phenomenon, argues that the previously apt metaphor of the "stochastic parrot" has lost its relevance and thus should be abandoned, and reflects on different normative elements in the safety- and appropriateness-considerations emerging from these reasoning models and their growing capacity.

Fonte: arXiv cs.AI

MLOps/Systems • Score 85

A New Benchmark for the Appropriate Evaluation of RTL Code Optimization

arXiv:2601.01765v1 Announce Type: new Abstract: The rapid progress of artificial intelligence increasingly relies on efficient integrated circuit (IC) design. Recent studies have explored the use of large language models (LLMs) for generating Register Transfer Level (RTL) code, but existing benchmarks mainly evaluate syntactic correctness rather than optimization quality in terms of power, performance, and area (PPA). This work introduces RTL-OPT, a benchmark for assessing the capability of LLMs in RTL optimization. RTL-OPT contains 36 handcrafted digital designs that cover diverse implementation categories including combinational logic, pipelined datapaths, finite state machines, and memory interfaces. Each task provides a pair of RTL codes, a suboptimal version and a human-optimized reference that reflects industry-proven optimization patterns not captured by conventional synthesis tools. Furthermore, RTL-OPT integrates an automated evaluation framework to verify functional correctness and quantify PPA improvements, enabling standardized and meaningful assessment of generative models for hardware design optimization.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Empowering Small Language Models with Factual Hallucination-Aware Reasoning for Financial Classification

arXiv:2601.01378v1 Announce Type: new Abstract: Small language models (SLMs) are increasingly used for financial classification due to their fast inference and local deployability. However, compared with large language models, SLMs are more prone to factual hallucinations in reasoning and exhibit weaker classification performance. This raises a natural question: Can mitigating factual hallucinations improve SLMs' financial classification? To address this, we propose a three-step pipeline named AAAI (Association Identification, Automated Detection, and Adaptive Inference). Experiments on three representative SLMs reveal that: (1) factual hallucinations are positively correlated with misclassifications; (2) encoder-based verifiers effectively detect factual hallucinations; and (3) incorporating feedback on factual errors enables SLMs' adaptive inference that enhances classification performance. We hope this pipeline contributes to trustworthy and effective applications of SLMs in finance.

Fonte: arXiv cs.AI

Multimodal • Score 85

KGCE: Knowledge-Augmented Dual-Graph Evaluator for Cross-Platform Educational Agent Benchmarking with Multimodal Language Models

arXiv:2601.01366v1 Announce Type: new Abstract: With the rapid adoption of multimodal large language models (MLMs) in autonomous agents, cross-platform task execution capabilities in educational settings have garnered significant attention. However, existing benchmark frameworks still exhibit notable deficiencies in supporting cross-platform tasks in educational contexts, especially when dealing with school-specific software (such as XiaoYa Intelligent Assistant, HuaShi XiaZi, etc.), where the efficiency of agents often significantly decreases due to a lack of understanding of the structural specifics of these private-domain software. Additionally, current evaluation methods heavily rely on coarse-grained metrics like goal orientation or trajectory matching, making it challenging to capture the detailed execution and efficiency of agents in complex tasks. To address these issues, we propose KGCE (Knowledge-Augmented Dual-Graph Evaluator for Cross-Platform Educational Agent Benchmarking with Multimodal Language Models), a novel benchmarking platform that integrates knowledge base enhancement and a dual-graph evaluation framework. We first constructed a dataset comprising 104 education-related tasks, covering Windows, Android, and cross-platform collaborative tasks. KGCE introduces a dual-graph evaluation framework that decomposes tasks into multiple sub-goals and verifies their completion status, providing fine-grained evaluation metrics. To overcome the execution bottlenecks of existing agents in private-domain tasks, we developed an enhanced agent system incorporating a knowledge base specific to school-specific software. The code can be found at https://github.com/Kinginlife/KGCE.

Fonte: arXiv cs.AI

RL • Score 85

Reinforcement Learning Enhanced Multi-hop Reasoning for Temporal Knowledge Question Answering

arXiv:2601.01195v1 Announce Type: new Abstract: Temporal knowledge graph question answering (TKGQA) involves multi-hop reasoning over temporally constrained entity relationships in the knowledge graph to answer a given question. However, at each hop, large language models (LLMs) retrieve subgraphs with numerous temporally similar and semantically complex relations, increasing the risk of suboptimal decisions and error propagation. To address these challenges, we propose the multi-hop reasoning enhanced (MRE) framework, which enhances both forward and backward reasoning to improve the identification of globally optimal reasoning trajectories. Specifically, MRE begins with prompt engineering to guide the LLM in generating diverse reasoning trajectories for a given question. Valid reasoning trajectories are then selected for supervised fine-tuning, serving as a cold-start strategy. Finally, we introduce Tree-Group Relative Policy Optimization (T-GRPO), a recursive, tree-structured learning-by-exploration approach. At each hop, exploration establishes strong causal dependencies on the previous hop, while evaluation is informed by multi-path exploration feedback from subsequent hops. Experimental results on two TKGQA benchmarks indicate that the proposed MRE-based model consistently surpasses state-of-the-art (SOTA) approaches in handling complex multi-hop queries. Further analysis highlights improved interpretability and robustness to noisy temporal annotations.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

CogCanvas: Extração de Artefatos Fundamentados Verbatim para Longas Conversas com LLM

O resumo de arXiv:2601.00821v2 apresenta o CogCanvas, um framework livre de treinamento que extrai artefatos fundamentados verbatim de conversas longas, superando métodos tradicionais em precisão. Em benchmarks, o CogCanvas alcança a maior precisão entre métodos sem treinamento, destacando-se em tarefas de raciocínio complexo.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

OpenSocInt: A Multi-modal Training Environment for Human-Aware Social Navigation

arXiv:2601.01939v1 Announce Type: new Abstract: In this paper, we introduce OpenSocInt, an open-source software package providing a simulator for multi-modal social interactions and a modular architecture to train social agents. We described the software package and showcased its interest via an experimental protocol based on the task of social navigation. Our framework allows for exploring the use of different perceptual features, their encoding and fusion, as well as the use of different agents. The software is already publicly available under GPL at https://gitlab.inria.fr/robotlearn/OpenSocInt/.

Fonte: arXiv cs.AI

Multimodal • Score 92

Um modelo unificado de compreensão e geração multimodal para pesquisa científica interdisciplinar

A descoberta científica depende cada vez mais da integração de dados heterogêneos e de alta dimensão entre disciplinas. Apresentamos o FuXi-Uni, um modelo nativo unificado que suporta a compreensão científica e a geração de dados multimodais de alta fidelidade, alinhando tokens científicos interdisciplinares com tokens de linguagem natural e empregando um decodificador científico. Validamos o FuXi-Uni em ciências da Terra e biomedicina.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Toward Auditable Neuro-Symbolic Reasoning in Pathology: SQL as an Explicit Trace of Evidence

arXiv:2601.01875v1 Announce Type: new Abstract: Automated pathology image analysis is central to clinical diagnosis, but clinicians still ask which slide features drive a model's decision and why. Vision-language models can produce natural language explanations, but these are often correlational and lack verifiable evidence. In this paper, we introduce an SQL-centered agentic framework that enables both feature measurement and reasoning to be auditable. Specifically, after extracting human-interpretable cellular features, Feature Reasoning Agents compose and execute SQL queries over feature tables to aggregate visual evidence into quantitative findings. A Knowledge Comparison Agent then evaluates these findings against established pathological knowledge, mirroring how pathologists justify diagnoses from measurable observations. Extensive experiments evaluated on two pathology visual question answering datasets demonstrate our method improves interpretability and decision traceability while producing executable SQL traces that link cellular measurements to diagnostic conclusions.

Fonte: arXiv cs.AI

Multimodal • Score 85

OmniNeuro: Um Framework HCI Multimodal para Feedback Explicável de BCI via IA Generativa e Sonificação

Embora o Deep Learning tenha melhorado a precisão de decodificação de Interfaces Cérebro-Máquina (BCI), a adoção clínica é dificultada pela natureza de 'caixa-preta' desses algoritmos, levando à frustração do usuário e a resultados ruins de neuroplasticidade. Propomos o OmniNeuro, um novo framework HCI que transforma o BCI de um decodificador silencioso em um parceiro de feedback transparente.

Fonte: arXiv cs.AI

RL • Score 85

Acelerando a Busca em Árvores de Monte-Carlo com Políticas Posteriores Otimizadas

Introduzimos um algoritmo recursivo de busca em árvore de Monte-Carlo estilo AlphaZero, chamado 'RMCTS'. A vantagem do RMCTS sobre o MCTS-UCB do AlphaZero é a velocidade. O RMCTS explora a árvore de busca de maneira em largura, permitindo que as inferências da rede ocorram em grandes lotes, reduzindo significativamente o custo de latência da GPU.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

CaveAgent: Transforming LLMs into Stateful Runtime Operators

arXiv:2601.01569v1 Announce Type: new Abstract: LLM-based agents are increasingly capable of complex task execution, yet current agentic systems remain constrained by text-centric paradigms. Traditional approaches rely on procedural JSON-based function calling, which often struggles with long-horizon tasks due to fragile multi-turn dependencies and context drift. In this paper, we present CaveAgent, a framework that transforms the paradigm from "LLM-as-Text-Generator" to "LLM-as-Runtime-Operator." We introduce a Dual-stream Context Architecture that decouples state management into a lightweight semantic stream for reasoning and a persistent, deterministic Python Runtime stream for execution. In addition to leveraging code generation to efficiently resolve interdependent sub-tasks (e.g., loops, conditionals) in a single step, we introduce \textit{Stateful Runtime Management} in CaveAgent. Distinct from existing code-based approaches that remain text-bound and lack the support for external object injection and retrieval, CaveAgent injects, manipulates, and retrieves complex Python objects (e.g., DataFrames, database connections) that persist across turns. This persistence mechanism acts as a high-fidelity external memory to eliminate context drift, avoid catastrophic forgetting, while ensuring that processed data flows losslessly to downstream applications. Comprehensive evaluations on Tau$^2$-bench, BFCL and various case studies across representative SOTA LLMs demonstrate CaveAgent's superiority. Specifically, our framework achieves a 10.5\% success rate improvement on retail tasks and reduces total token consumption by 28.4\% in multi-turn scenarios. On data-intensive tasks, direct variable storage and retrieval reduces token consumption by 59\%, allowing CaveAgent to handle large-scale data that causes context overflow failures in both JSON-based and Code-based agents.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

MindChat: A Privacy-preserving Large Language Model for Mental Health Support

arXiv:2601.01993v1 Announce Type: new Abstract: Large language models (LLMs) have shown promise for mental health support, yet training such models is constrained by the scarcity and sensitivity of real counseling dialogues. In this article, we present MindChat, a privacy-preserving LLM for mental health support, together with MindCorpus, a synthetic multi-turn counseling dataset constructed via a multi-agent role-playing framework. To synthesize high-quality counseling data, the developed dialogue-construction framework employs a dual closed-loop feedback design to integrate psychological expertise and counseling techniques through role-playing: (i) turn-level critique-and-revision to improve coherence and counseling appropriateness within a session, and (ii) session-level strategy refinement to progressively enrich counselor behaviors across sessions. To mitigate privacy risks under decentralized data ownership, we fine-tune the base model using federated learning with parameter-efficient LoRA adapters and incorporate differentially private optimization to reduce membership and memorization risks. Experiments on synthetic-data quality assessment and counseling capability evaluation show that MindCorpus improves training effectiveness and that MindChat is competitive with existing general and counseling-oriented LLM baselines under both automatic LLM-judge and human evaluation protocols, while exhibiting reduced privacy leakage under membership inference attacks.

Fonte: arXiv cs.AI

Multimodal • Score 85

MMP-A*: Multimodal Perception Enhanced Incremental Heuristic Search on Path Planning

arXiv:2601.01910v1 Announce Type: new Abstract: Autonomous path planning requires a synergy between global reasoning and geometric precision, especially in complex or cluttered environments. While classical A* is valued for its optimality, it incurs prohibitive computational and memory costs in large-scale scenarios. Recent attempts to mitigate these limitations by using Large Language Models for waypoint guidance remain insufficient, as they rely only on text-based reasoning without spatial grounding. As a result, such models often produce incorrect waypoints in topologically complex environments with dead ends, and lack the perceptual capacity to interpret ambiguous physical boundaries. These inconsistencies lead to costly corrective expansions and undermine the intended computational efficiency. We introduce MMP-A*, a multimodal framework that integrates the spatial grounding capabilities of vision-language models with a novel adaptive decay mechanism. By anchoring high-level reasoning in physical geometry, the framework produces coherent waypoint guidance that addresses the limitations of text-only planners. The adaptive decay mechanism dynamically regulates the influence of uncertain waypoints within the heuristic, ensuring geometric validity while substantially reducing memory overhead. To evaluate robustness, we test the framework in challenging environments characterized by severe clutter and topological complexity. Experimental results show that MMP-A* achieves near-optimal trajectories with significantly reduced operational costs, demonstrating its potential as a perception-grounded and computationally efficient paradigm for autonomous navigation.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Alinhamento de Admissibilidade

Este artigo introduz o Alinhamento de Admissibilidade: uma reformulação do alinhamento de IA como uma propriedade de seleção de ações e decisões admissíveis sobre distribuições de resultados sob incerteza, avaliada através do comportamento de políticas candidatas.

Fonte: arXiv cs.AI

Theory/Optimization • Score 85

CNC-TP: Classifier Nominal Concept Based on Top-Pertinent Attributes

arXiv:2601.01976v1 Announce Type: new Abstract: Knowledge Discovery in Databases (KDD) aims to exploit the vast amounts of data generated daily across various domains of computer applications. Its objective is to extract hidden and meaningful knowledge from datasets through a structured process comprising several key steps: data selection, preprocessing, transformation, data mining, and visualization. Among the core data mining techniques are classification and clustering. Classification involves predicting the class of new instances using a classifier trained on labeled data. Several approaches have been proposed in the literature, including Decision Tree Induction, Bayesian classifiers, Nearest Neighbor search, Neural Networks, Support Vector Machines, and Formal Concept Analysis (FCA). The last one is recognized as an effective approach for interpretable and explainable learning. It is grounded in the mathematical structure of the concept lattice, which enables the generation of formal concepts and the discovery of hidden relationships among them. In this paper, we present a state-of-theart review of FCA-based classifiers. We explore various methods for computing closure operators from nominal data and introduce a novel approach for constructing a partial concept lattice that focuses on the most relevant concepts. Experimental results are provided to demonstrate the efficiency of the proposed method.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

XAI-MeD: Explainable Knowledge Guided Neuro-Symbolic Framework for Domain Generalization and Rare Class Detection in Medical Imaging

arXiv:2601.02008v1 Announce Type: new Abstract: Explainability domain generalization and rare class reliability are critical challenges in medical AI where deep models often fail under real world distribution shifts and exhibit bias against infrequent clinical conditions This paper introduces XAIMeD an explainable medical AI framework that integrates clinically accurate expert knowledge into deep learning through a unified neuro symbolic architecture XAIMeD is designed to improve robustness under distribution shift enhance rare class sensitivity and deliver transparent clinically aligned interpretations The framework encodes clinical expertise as logical connectives over atomic medical propositions transforming them into machine checkable class specific rules Their diagnostic utility is quantified through weighted feature satisfaction scores enabling a symbolic reasoning branch that complements neural predictions A confidence weighted fusion integrates symbolic and deep outputs while a Hunt inspired adaptive routing mechanism guided by Entropy Imbalance Gain EIG and Rare Class Gini mitigates class imbalance high intra class variability and uncertainty We evaluate XAIMeD across diverse modalities on four challenging tasks i Seizure Onset Zone SOZ localization from rs fMRI ii Diabetic Retinopathy grading across 6 multicenter datasets demonstrate substantial performance improvements including 6 percent gains in cross domain generalization and a 10 percent improved rare class F1 score far outperforming state of the art deep learning baselines Ablation studies confirm that the clinically grounded symbolic components act as effective regularizers ensuring robustness to distribution shifts XAIMeD thus provides a principled clinically faithful and interpretable approach to multimodal medical AI.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Improving Behavioral Alignment in LLM Social Simulations via Context Formation and Navigation

arXiv:2601.01546v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used to simulate human behavior in experimental settings, but they systematically diverge from human decisions in complex decision-making environments, where participants must anticipate others' actions and form beliefs based on observed behavior. We propose a two-stage framework for improving behavioral alignment. The first stage, context formation, explicitly specifies the experimental design to establish an accurate representation of the decision task and its context. The second stage, context navigation, guides the reasoning process within that representation to make decisions. We validate this framework through a focal replication of a sequential purchasing game with quality signaling (Kremer and Debo, 2016), extending to a crowdfunding game with costly signaling (Cason et al., 2025) and a demand-estimation task (Gui and Toubia, 2025) to test generalizability across decision environments. Across four state-of-the-art (SOTA) models (GPT-4o, GPT-5, Claude-4.0-Sonnet-Thinking, DeepSeek-R1), we find that complex decision-making environments require both stages to achieve behavioral alignment with human benchmarks, whereas the simpler demand-estimation task requires only context formation. Our findings clarify when each stage is necessary and provide a systematic approach for designing and diagnosing LLM social simulations as complements to human subjects in behavioral research.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Cultural Encoding in Large Language Models: The Existence Gap in AI-Mediated Brand Discovery

arXiv:2601.00869v1 Announce Type: new Abstract: As artificial intelligence systems increasingly mediate consumer information discovery, brands face algorithmic invisibility. This study investigates Cultural Encoding in Large Language Models (LLMs) -- systematic differences in brand recommendations arising from training data composition. Analyzing 1,909 pure-English queries across 6 LLMs (GPT-4o, Claude, Gemini, Qwen3, DeepSeek, Doubao) and 30 brands, we find Chinese LLMs exhibit 30.6 percentage points higher brand mention rates than International LLMs (88.9% vs. 58.3%, p<.001). This disparity persists in identical English queries, indicating training data geography -- not language -- drives the effect. We introduce the Existence Gap: brands absent from LLM training corpora lack "existence" in AI responses regardless of quality. Through a case study of Zhizibianjie (OmniEdge), a collaboration platform with 65.6% mention rate in Chinese LLMs but 0% in International models (p<.001), we demonstrate how Linguistic Boundary Barriers create invisible market entry obstacles. Theoretically, we contribute the Data Moat Framework, conceptualizing AI-visible content as a VRIN strategic resource. We operationalize Algorithmic Omnipresence -- comprehensive brand visibility across LLM knowledge bases -- as the strategic objective for Generative Engine Optimization (GEO). Managerially, we provide an 18-month roadmap for brands to build Data Moats through semantic coverage, technical depth, and cultural localization. Our findings reveal that in AI-mediated markets, the limits of a brand's "Data Boundaries" define the limits of its "Market Frontiers."

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Clinical Knowledge Graph Construction and Evaluation with Multi-LLMs via Retrieval-Augmented Generation

arXiv:2601.01844v1 Announce Type: new Abstract: Large language models (LLMs) offer new opportunities for constructing knowledge graphs (KGs) from unstructured clinical narratives. However, existing approaches often rely on structured inputs and lack robust validation of factual accuracy and semantic consistency, limitations that are especially problematic in oncology. We introduce an end-to-end framework for clinical KG construction and evaluation directly from free text using multi-agent prompting and a schema-constrained Retrieval-Augmented Generation (KG-RAG) strategy. Our pipeline integrates (1) prompt-driven entity, attribute, and relation extraction; (2) entropy-based uncertainty scoring; (3) ontology-aligned RDF/OWL schema generation; and (4) multi-LLM consensus validation for hallucination detection and semantic refinement. Beyond static graph construction, the framework supports continuous refinement and self-supervised evaluation, enabling iterative improvement of graph quality. Applied to two oncology cohorts (PDAC and BRCA), our method produces interpretable, SPARQL-compatible, and clinically grounded knowledge graphs without relying on gold-standard annotations. Experimental results demonstrate consistent gains in precision, relevance, and ontology compliance over baseline methods.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Can Large Language Models Solve Engineering Equations? A Systematic Comparison of Direct Prediction and Solver-Assisted Approaches

arXiv:2601.01774v1 Announce Type: new Abstract: Transcendental equations requiring iterative numerical solution pervade engineering practice, from fluid mechanics friction factor calculations to orbital position determination. We systematically evaluate whether Large Language Models can solve these equations through direct numerical prediction or whether a hybrid architecture combining LLM symbolic manipulation with classical iterative solvers proves more effective. Testing six state-of-the-art models (GPT-5.1, GPT-5.2, Gemini-3-Flash, Gemini-2.5-Lite, Claude-Sonnet-4.5, Claude-Opus-4.5) on 100 problems spanning seven engineering domains, we compare direct prediction against solver-assisted computation where LLMs formulate governing equations and provide initial conditions while Newton-Raphson iteration performs numerical solution. Direct prediction yields mean relative errors of 0.765 to 1.262 across models, while solver-assisted computation achieves 0.225 to 0.301, representing error reductions of 67.9% to 81.8%. Domain-specific analysis reveals dramatic improvements in Electronics (93.1%) due to exponential equation sensitivity, contrasted with modest gains in Fluid Mechanics (7.2%) where LLMs exhibit effective pattern recognition. These findings establish that contemporary LLMs excel at symbolic manipulation and domain knowledge retrieval but struggle with precision-critical iterative arithmetic, suggesting their optimal deployment as intelligent interfaces to classical numerical solvers rather than standalone computational engines.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Jenius Agent: Towards Experience-Driven Accuracy Optimization in Real-World Scenarios

arXiv:2601.01857v1 Announce Type: new Abstract: As agent systems powered by large language models (LLMs) advance, improving the task performance of an autonomous agent, especially in context understanding, tool usage, and response generation, has become increasingly critical. Although prior studies have advanced the overall design of LLM-based agents, systematic optimization of their internal reasoning and tool-use pipelines remains underexplored. This paper introduces an agent framework grounded in real-world practical experience, with three key innovations: (1) an adaptive prompt generation strategy that aligns with the agent's state and task goals to improve reliability and robustness; (2) a context-aware tool orchestration module that performs tool categorization, semantic retrieval, and adaptive invocation based on user intent and context; and (3) a layered memory mechanism that integrates session memory, task history, and external summaries to improve relevance and efficiency through dynamic summarization and compression. An end-to-end framework named Jenius-Agent has been integrated with three key optimizations, including tools based on the Model Context Protocol (MCP), file input/output (I/O), and execution feedback. The experiments show a 20 percent improvement in task accuracy, along with a reduced token cost, response latency, and invocation failures. The framework is already deployed in Jenius (https://www.jenius.cn), providing a lightweight and scalable solution for robust, protocol-compatible autonomous agents.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Yuan3.0 Flash: An Open Multimodal Large Language Model for Enterprise Applications

arXiv:2601.01718v1 Announce Type: new Abstract: We introduce Yuan3.0 Flash, an open-source Mixture-of-Experts (MoE) MultiModal Large Language Model featuring 3.7B activated parameters and 40B total parameters, specifically designed to enhance performance on enterprise-oriented tasks while maintaining competitive capabilities on general-purpose tasks. To address the overthinking phenomenon commonly observed in Large Reasoning Models (LRMs), we propose Reflection-aware Adaptive Policy Optimization (RAPO), a novel RL training algorithm that effectively regulates overthinking behaviors. In enterprise-oriented tasks such as retrieval-augmented generation (RAG), complex table understanding, and summarization, Yuan3.0 Flash consistently achieves superior performance. Moreover, it also demonstrates strong reasoning capabilities in domains such as mathematics, science, etc., attaining accuracy comparable to frontier model while requiring only approximately 1/4 to 1/2 of the average tokens. Yuan3.0 Flash has been fully open-sourced to facilitate further research and real-world deployment: https://github.com/Yuan-lab-LLM/Yuan3.0.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Structured Decomposition for LLM Reasoning: Cross-Domain Validation and Semantic Web Integration

arXiv:2601.01609v1 Announce Type: new Abstract: Rule-based reasoning over natural language input arises in domains where decisions must be auditable and justifiable: clinical protocols specify eligibility criteria in prose, evidence rules define admissibility through textual conditions, and scientific standards dictate methodological requirements. Applying rules to such inputs demands both interpretive flexibility and formal guarantees. Large language models (LLMs) provide flexibility but cannot ensure consistent rule application; symbolic systems provide guarantees but require structured input. This paper presents an integration pattern that combines these strengths: LLMs serve as ontology population engines, translating unstructured text into ABox assertions according to expert-authored TBox specifications, while SWRL-based reasoners apply rules with deterministic guarantees. The framework decomposes reasoning into entity identification, assertion extraction, and symbolic verification, with task definitions grounded in OWL 2 ontologies. Experiments across three domains (legal hearsay determination, scientific method-task application, clinical trial eligibility) and eleven language models validate the approach. Structured decomposition achieves statistically significant improvements over few-shot prompting in aggregate, with gains observed across all three domains. An ablation study confirms that symbolic verification provides substantial benefit beyond structured prompting alone. The populated ABox integrates with standard semantic web tooling for inspection and querying, positioning the framework for richer inference patterns that simpler formalisms cannot express.

Fonte: arXiv cs.AI

NLP/LLMs • Score 90

Logics-STEM: Empowering LLM Reasoning via Failure-Driven Post-Training and Document Knowledge Enhancement

arXiv:2601.01562v1 Announce Type: new Abstract: We present Logics-STEM, a state-of-the-art reasoning model fine-tuned on Logics-STEM-SFT-Dataset, a high-quality and diverse dataset at 10M scale that represents one of the largest-scale open-source long chain-of-thought corpora. Logics-STEM targets reasoning tasks in the domains of Science, Technology, Engineering, and Mathematics (STEM), and exhibits exceptional performance on STEM-related benchmarks with an average improvement of 4.68% over the next-best model at 8B scale. We attribute the gains to our data-algorithm co-design engine, where they are jointly optimized to fit a gold-standard distribution behind reasoning. Data-wise, the Logics-STEM-SFT-Dataset is constructed from a meticulously designed data curation engine with 5 stages to ensure the quality, diversity, and scalability, including annotation, deduplication, decontamination, distillation, and stratified sampling. Algorithm-wise, our failure-driven post-training framework leverages targeted knowledge retrieval and data synthesis around model failure regions in the Supervised Fine-tuning (SFT) stage to effectively guide the second-stage SFT or the reinforcement learning (RL) for better fitting the target distribution. The superior empirical performance of Logics-STEM reveals the vast potential of combining large-scale open-source data with carefully designed synthetic data, underscoring the critical role of data-algorithm co-design in enhancing reasoning capabilities through post-training. We make both the Logics-STEM models (8B and 32B) and the Logics-STEM-SFT-Dataset (10M and downsampled 2.2M versions) publicly available to support future research in the open-source community.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

AI Agent Systems: Architectures, Applications, and Evaluation

arXiv:2601.01743v1 Announce Type: new Abstract: AI agents -- systems that combine foundation models with reasoning, planning, memory, and tool use -- are rapidly becoming a practical interface between natural-language intent and real-world computation. This survey synthesizes the emerging landscape of AI agent architectures across: (i) deliberation and reasoning (e.g., chain-of-thought-style decomposition, self-reflection and verification, and constraint-aware decision making), (ii) planning and control (from reactive policies to hierarchical and multi-step planners), and (iii) tool calling and environment interaction (retrieval, code execution, APIs, and multimodal perception). We organize prior work into a unified taxonomy spanning agent components (policy/LLM core, memory, world models, planners, tool routers, and critics), orchestration patterns (single-agent vs.\ multi-agent; centralized vs.\ decentralized coordination), and deployment settings (offline analysis vs.\ online interactive assistance; safety-critical vs.\ open-ended tasks). We discuss key design trade-offs -- latency vs.\ accuracy, autonomy vs.\ controllability, and capability vs.\ reliability -- and highlight how evaluation is complicated by non-determinism, long-horizon credit assignment, tool and environment variability, and hidden costs such as retries and context growth. Finally, we summarize measurement and benchmarking practices (task suites, human preference and utility metrics, success under constraints, robustness and security) and identify open challenges including verification and guardrails for tool actions, scalable memory and context management, interpretability of agent decisions, and reproducible evaluation under realistic workloads.

Fonte: arXiv cs.AI

Applications • Score 85

PsychEval: A Multi-Session and Multi-Therapy Benchmark for High-Realism AI Psychological Counselor

arXiv:2601.01802v2 Announce Type: new Abstract: To develop a reliable AI for psychological assessment, we introduce \texttt{PsychEval}, a multi-session, multi-therapy, and highly realistic benchmark designed to address three key challenges: \textbf{1) Can we train a highly realistic AI counselor?} Realistic counseling is a longitudinal task requiring sustained memory and dynamic goal tracking. We propose a multi-session benchmark (spanning 6-10 sessions across three distinct stages) that demands critical capabilities such as memory continuity, adaptive reasoning, and longitudinal planning. The dataset is annotated with extensive professional skills, comprising over 677 meta-skills and 4577 atomic skills. \textbf{2) How to train a multi-therapy AI counselor?} While existing models often focus on a single therapy, complex cases frequently require flexible strategies among various therapies. We construct a diverse dataset covering five therapeutic modalities (Psychodynamic, Behaviorism, CBT, Humanistic Existentialist, and Postmodernist) alongside an integrative therapy with a unified three-stage clinical framework across six core psychological topics. \textbf{3) How to systematically evaluate an AI counselor?} We establish a holistic evaluation framework with 18 therapy-specific and therapy-shared metrics across Client-Level and Counselor-Level dimensions. To support this, we also construct over 2,000 diverse client profiles. Extensive experimental analysis fully validates the superior quality and clinical fidelity of our dataset. Crucially, \texttt{PsychEval} transcends static benchmarking to serve as a high-fidelity reinforcement learning environment that enables the self-evolutionary training of clinically responsible and adaptive AI counselors.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Digital Twin AI: Opportunities and Challenges from Large Language Models to World Models

arXiv:2601.01321v1 Announce Type: new Abstract: Digital twins, as precise digital representations of physical systems, have evolved from passive simulation tools into intelligent and autonomous entities through the integration of artificial intelligence technologies. This paper presents a unified four-stage framework that systematically characterizes AI integration across the digital twin lifecycle, spanning modeling, mirroring, intervention, and autonomous management. By synthesizing existing technologies and practices, we distill a unified four-stage framework that systematically characterizes how AI methodologies are embedded across the digital twin lifecycle: (1) modeling the physical twin through physics-based and physics-informed AI approaches, (2) mirroring the physical system into a digital twin with real-time synchronization, (3) intervening in the physical twin through predictive modeling, anomaly detection, and optimization strategies, and (4) achieving autonomous management through large language models, foundation models, and intelligent agents. We analyze the synergy between physics-based modeling and data-driven learning, highlighting the shift from traditional numerical solvers to physics-informed and foundation models for physical systems. Furthermore, we examine how generative AI technologies, including large language models and generative world models, transform digital twins into proactive and self-improving cognitive systems capable of reasoning, communication, and creative scenario generation. Through a cross-domain review spanning eleven application domains, including healthcare, aerospace, smart manufacturing, robotics, and smart cities, we identify common challenges related to scalability, explainability, and trustworthiness, and outline directions for responsible AI-driven digital twin systems.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Bayesian Orchestration of Multi-LLM Agents for Cost-Aware Sequential Decision-Making

arXiv:2601.01522v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as autonomous decision agents in settings with asymmetric error costs: hiring (missed talent vs wasted interviews), medical triage (missed emergencies vs unnecessary escalation), and fraud detection (approved fraud vs declined legitimate payments). The dominant design queries a single LLM for a posterior over states, thresholds "confidence," and acts; we prove this is inadequate for sequential decisions with costs. We propose a Bayesian, cost-aware multi-LLM orchestration framework that treats LLMs as approximate likelihood models rather than classifiers. For each candidate state, we elicit likelihoods via contrastive prompting, aggregate across diverse models with robust statistics, and update beliefs with Bayes rule under explicit priors as new evidence arrives. This enables coherent belief updating, expected-cost action selection, principled information gathering via value of information, and fairness gains via ensemble bias mitigation. In resume screening with costs of 40000 USD per missed hire, 2500 USD per interview, and 150 USD per phone screen, experiments on 1000 resumes using five LLMs (GPT-4o, Claude 4.5 Sonnet, Gemini Pro, Grok, DeepSeek) reduce total cost by 294000 USD (34 percent) versus the best single-LLM baseline and improve demographic parity by 45 percent (max group gap 22 to 5 percentage points). Ablations attribute 51 percent of savings to multi-LLM aggregation, 43 percent to sequential updating, and 20 percent to disagreement-triggered information gathering, consistent with the theoretical benefits of correct probabilistic foundations.

Fonte: arXiv cs.AI

Applications • Score 89

MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing

arXiv:2601.00204v1 Announce Type: new Abstract: 3D morphing remains challenging due to the difficulty of generating semantically consistent and temporally smooth deformations, especially across categories. We present MorphAny3D, a training-free framework that leverages Structured Latent (SLAT) representations for high-quality 3D morphing. Our key insight is that intelligently blending source and target SLAT features within the attention mechanisms of 3D generators naturally produces plausible morphing sequences. To this end, we introduce Morphing Cross-Attention (MCA), which fuses source and target information for structural coherence, and Temporal-Fused Self-Attention (TFSA), which enhances temporal consistency by incorporating features from preceding frames. An orientation correction strategy further mitigates the pose ambiguity within the morphing steps. Extensive experiments show that our method generates state-of-the-art morphing sequences, even for challenging cross-category cases. MorphAny3D further supports advanced applications such as decoupled morphing and 3D style transfer, and can be generalized to other SLAT-based generative models. Project page: https://xiaokunsun.github.io/MorphAny3D.github.io/.

Fonte: arXiv cs.CV

RL • Score 96

Métodos Semânticos Podem Aprimorar Táticas em Esportes Coletivos? Uma Metodologia para Futebol com Aplicações Mais Amplas

Este artigo explora como o raciocínio em espaço semântico, tradicionalmente utilizado em linguística computacional, pode ser estendido à tomada de decisão tática em esportes coletivos. A metodologia proposta modela configurações táticas como estruturas semânticas composicionais, representando cada jogador como um vetor multidimensional que integra atributos técnicos, físicos e psicológicos.

Fonte: arXiv cs.AI

Vision • Score 96

Detecção Adaptativa de Coordenação Causal para Mídias Sociais: Um Framework Guiado por Memória com Aprendizado Semi-Supervisionado

Detectar comportamentos inautênticos coordenados em mídias sociais é um desafio crítico. Propomos o framework Adaptive Causal Coordination Detection (ACCD), que utiliza uma arquitetura progressiva em três estágios para aprender e reter configurações de detecção otimizadas. O ACCD melhora a identificação de relações causais e reduz a necessidade de rotulagem manual, alcançando um F1-score de 87,3% em detecção de ataques coordenados.

Fonte: arXiv cs.AI

Vision • Score 96

Avaliação de Detectores de Anomalias para Problemas de Classificação Industrial Altamente Desequilibrados Simulados

O machine learning oferece soluções potenciais para problemas atuais em sistemas industriais, como controle de qualidade e manutenção preditiva, mas enfrenta barreiras únicas em aplicações industriais. Este artigo apresenta uma avaliação abrangente de algoritmos de detecção de anomalias usando um conjunto de dados simulado que reflete restrições de engenharia do mundo real.

Fonte: arXiv cs.AI

RL • Score 93

Implantações Econômicas Inteligentes de Baixa Altitude da Próxima Geração: A Perspectiva O-RAN

Apesar do crescente interesse em aplicações de economia de baixa altitude (LAE), como logística baseada em UAV e resposta a emergências, desafios fundamentais permanecem na orquestração dessas missões em ambientes complexos e com restrições de sinal. Este artigo apresenta um framework LAE habilitado para O-RAN que otimiza operações críticas por meio de coordenação entre a arquitetura RAN desagregada e controladores inteligentes.

Fonte: arXiv cs.AI

RL • Score 96

Predição Precoce de Cirrose Hepática com Antecedência de Até Três Anos: Um Estudo de Machine Learning Comparando com o FIB-4

Objetivo: Desenvolver e avaliar modelos de machine learning (ML) para prever a cirrose hepática incidente um, dois e três anos antes do diagnóstico, utilizando dados de registros eletrônicos de saúde (EHR) coletados rotineiramente, e comparar seu desempenho com o escore FIB-4. Métodos: Realizamos um estudo de coorte retrospectivo usando dados de EHR desidentificados de um grande sistema de saúde acadêmico.

Fonte: arXiv cs.LG

RL • Score 96

O Transporte Ótimo Pode Melhorar o Aprendizado por Reforço Inverso Federado?

Neste artigo, introduzimos uma abordagem baseada em transporte ótimo para o aprendizado por reforço inverso federado (IRL). Cada cliente realiza localmente um IRL de Máxima Entropia, respeitando suas limitações computacionais e de privacidade. As funções de recompensa resultantes são fundidas via um barycenter de Wasserstein, que considera sua estrutura geométrica subjacente. Este trabalho oferece um framework eficiente em comunicação para derivar uma recompensa compartilhada que se generaliza entre agentes e ambientes heterogêneos.

Fonte: arXiv cs.LG

RL • Score 96

Computação de Reservatório Sequencial para Previsão Espacial e Temporal de Alta Dimensionalidade de Forma Eficiente

A previsão de sistemas espaciais e temporais de alta dimensionalidade continua sendo um desafio computacional para redes neurais recorrentes (RNNs) e modelos de memória de longo e curto prazo (LSTM). Introduzimos uma arquitetura de Computação de Reservatório Sequencial (Sequential RC) que decompõe um grande reservatório em uma série de reservatórios menores e interconectados, melhorando a eficiência e reduzindo custos computacionais.

Fonte: arXiv cs.LG

RL • Score 96

Uma Análise Comparativa de Métodos de Machine Learning Interpretabéis

Nos últimos anos, o Machine Learning (ML) tem sido amplamente adotado em diversos setores, incluindo áreas críticas como saúde, finanças e direito. Essa dependência crescente levantou preocupações sobre a interpretabilidade e a responsabilidade dos modelos, especialmente com a imposição de restrições legais e regulatórias sobre o uso de modelos black-box. Este estudo apresenta uma avaliação comparativa de 16 métodos inerentemente interpretabéis, abrangendo 216 conjuntos de dados tabulares do mundo real.

Fonte: arXiv cs.LG

Vision • Score 95

Bandidos Contextuais Aditivos Esparsos: Uma Abordagem Não Paramétrica para Tomada de Decisão Online com Covariáveis de Alta Dimensionalidade

Serviços personalizados são centrais para a economia digital atual, e suas decisões sequenciais são frequentemente modeladas como bandidos contextuais. Aplicações modernas enfrentam dois desafios principais: covariáveis de alta dimensionalidade e a necessidade de modelos não paramétricos para capturar relações complexas entre recompensa e covariáveis. Propomos um algoritmo de bandido contextual baseado em um modelo de recompensa aditiva esparsa que aborda ambos os desafios.

Fonte: arXiv stat.ML

NLP/LLMs • Score 96

Cadeias Neurais e Sistemas Dinâmicos Discretos

Neste trabalho, inspecionamos a analogia entre aplicações de machine learning (ML) baseadas na arquitetura transformer sem autoatenção, denominadas { t cadeias neurais}, e sistemas dinâmicos discretos associados a versões discretizadas de equações integrais e diferenciais parciais neurais (NIE, PDE). Apresentamos uma análise comparativa da solução numérica das equações de Burgers e Eikonal via discretização numérica padrão e aprendizado PINN.

Fonte: arXiv cs.LG

Applications • Score 90

Redes Profundas Aprendem Modelos Hierárquicos Profundos

Consideramos o aprendizado supervisionado com $n$ rótulos e mostramos que o SGD em camadas em redes residuais pode aprender de forma eficiente uma classe de modelos hierárquicos. Essa classe de modelos supera aqueles que foram anteriormente demonstrados como aprendíveis por algoritmos de deep learning, alcançando o limite de profundidade da aprendibilidade eficiente.

Fonte: arXiv cs.LG

Evaluation/Benchmarks • Score 96

Um Modelo de Aprendizado Profundo com Atenção Esparsa Integrando Recursos Multimodais Heterogêneos para o Perfil de Gravidade da Doença de Parkinson

Caracterizar a apresentação heterogênea da doença de Parkinson (DP) requer a integração de marcadores biológicos e clínicos dentro de um framework preditivo unificado. Propomos a Class-Weighted Sparse-Attention Fusion Network (SAFN), um framework de aprendizado profundo interpretável para o perfil multimodal robusto, que supera limitações de interpretabilidade e desequilíbrio de classes.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Knowledge Distillation for Temporal Knowledge Graph Reasoning with Large Language Models

arXiv:2601.00202v1 Announce Type: new Abstract: Reasoning over temporal knowledge graphs (TKGs) is fundamental to improving the efficiency and reliability of intelligent decision-making systems and has become a key technological foundation for future artificial intelligence applications. Despite recent progress, existing TKG reasoning models typically rely on large parameter sizes and intensive computation, leading to high hardware costs and energy consumption. These constraints hinder their deployment on resource-constrained, low-power, and distributed platforms that require real-time inference. Moreover, most existing model compression and distillation techniques are designed for static knowledge graphs and fail to adequately capture the temporal dependencies inherent in TKGs, often resulting in degraded reasoning performance. To address these challenges, we propose a distillation framework specifically tailored for temporal knowledge graph reasoning. Our approach leverages large language models as teacher models to guide the distillation process, enabling effective transfer of both structural and temporal reasoning capabilities to lightweight student models. By integrating large-scale public knowledge with task-specific temporal information, the proposed framework enhances the student model's ability to model temporal dynamics while maintaining a compact and efficient architecture. Extensive experiments on multiple publicly available benchmark datasets demonstrate that our method consistently outperforms strong baselines, achieving a favorable trade-off between reasoning accuracy, computational efficiency, and practical deployability.

Fonte: arXiv cs.CL

Theory/Optimization • Score 92

Rede Neural de Entrada Esparsa usando Regularização Côncava em Grupo

Neste artigo, investigamos o problema da seleção de características em redes neurais, propondo um framework de redes neurais de entrada esparsa com regularização côncava em grupo. Este método visa superar a seleção de variáveis irrelevantes, utilizando uma penalização côncava adequada na norma $l_2$ dos pesos, resultando em um modelo que utiliza apenas um subconjunto reduzido das variáveis originais.

Fonte: arXiv stat.ML

Vision • Score 95

Estimativa de densidade espectral de séries temporais funcionais em grandes domínios usando deep learning

Derivamos um estimador da densidade espectral de uma série temporal funcional que é a saída de uma rede neural multilayer perceptron. O estimador é motivado por dificuldades na computação de estimadores de densidade espectral existentes para séries temporais de funções definidas em grades muito grandes, como em modelos climáticos e exames médicos.

Fonte: arXiv stat.ML

Vision • Score 96

Do Barro ao Código: Raciocínio Tipológico e Material nas Interpretações de IA das Torres de Pombos Iranianas

Este estudo investiga como sistemas de IA generativa interpretam a inteligência arquitetônica embutida na forma vernacular. Usando a torre de pombos iraniana como estudo de caso, a pesquisa testa três modelos de difusão: Midjourney v6, DALL-E 3 e DreamStudio baseado em Stable Diffusion XL (SDXL), em três estágios de prompt: referencial, adaptativo e especulativo.

Fonte: arXiv cs.AI

RL • Score 95

Mitigando o viés otimista na estimativa e otimização de risco entrópico

A medida de risco entrópico é amplamente utilizada em decisões críticas em economia, ciência da gestão, finanças e sistemas de controle críticos, pois captura riscos extremos associados a perdas incertas. Este trabalho apresenta um procedimento de bootstrap paramétrico que corrige o viés do estimador empírico de risco entrópico, melhorando a precisão na tomada de decisões.

Fonte: arXiv stat.ML

NLP/LLMs • Score 96

Um Modelo de Linguagem Grande Aprimorado por Visão e Conhecimento para Inferência Generalizável do Comportamento de Travessia de Pedestres

Os paradigmas existentes para inferir o comportamento de travessia de pedestres apresentam generalização limitada e desempenho inadequado em novos locais. Este estudo introduz o Pedestrian Crossing LLM (PedX-LLM), um framework aprimorado que transforma a inferência de travessia de padrões específicos do local para raciocínio comportamental generalizável, alcançando 82,0% de precisão balanceada e superando métodos tradicionais.

Fonte: arXiv cs.AI

RL • Score 95

Projetando uma Rede de Sensores Ótima Através da Minimização da Perda de Informação

O design experimental ótimo é um tópico clássico em estatística, com muitos problemas e soluções bem estudados. Este trabalho investiga o posicionamento de sensores para monitorar processos espaço-temporais, considerando a dimensão temporal em nossa modelagem e otimização. Apresentamos um novo critério de posicionamento de sensores baseado em modelo, juntamente com um algoritmo de otimização altamente eficiente.

Fonte: arXiv stat.ML

NLP/LLMs • Score 96

Grande Estudo de Caso Empírico: Go-Explore adaptado para Testes de Red Team de IA

Agentes LLM de produção com capacidades de uso de ferramentas requerem testes de segurança, apesar de seu treinamento em segurança. Adaptamos o Go-Explore para avaliar o GPT-4o-mini em 28 execuções experimentais, abordando seis questões de pesquisa. Nossos resultados mostram que a variação de sementes aleatórias domina os parâmetros algorítmicos, resultando em uma variação de 8x nos resultados.

Fonte: arXiv cs.AI

Vision • Score 95

DepFlow: Disentangled Speech Generation to Mitigate Semantic Bias in Depression Detection

arXiv:2601.00303v1 Announce Type: new Abstract: Speech is a scalable and non-invasive biomarker for early mental health screening. However, widely used depression datasets like DAIC-WOZ exhibit strong coupling between linguistic sentiment and diagnostic labels, encouraging models to learn semantic shortcuts. As a result, model robustness may be compromised in real-world scenarios, such as Camouflaged Depression, where individuals maintain socially positive or neutral language despite underlying depressive states. To mitigate this semantic bias, we propose DepFlow, a three-stage depression-conditioned text-to-speech framework. First, a Depression Acoustic Encoder learns speaker- and content-invariant depression embeddings through adversarial training, achieving effective disentanglement while preserving depression discriminability (ROC-AUC: 0.693). Second, a flow-matching TTS model with FiLM modulation injects these embeddings into synthesis, enabling control over depressive severity while preserving content and speaker identity. Third, a prototype-based severity mapping mechanism provides smooth and interpretable manipulation across the depression continuum. Using DepFlow, we construct a Camouflage Depression-oriented Augmentation (CDoA) dataset that pairs depressed acoustic patterns with positive/neutral content from a sentiment-stratified text bank, creating acoustic-semantic mismatches underrepresented in natural data. Evaluated across three depression detection architectures, CDoA improves macro-F1 by 9%, 12%, and 5%, respectively, consistently outperforming conventional augmentation strategies in depression Detection. Beyond enhancing robustness, DepFlow provides a controllable synthesis platform for conversational systems and simulation-based evaluation, where real clinical data remains limited by ethical and coverage constraints.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Comparative Efficiency Analysis of Lightweight Transformer Models: A Multi-Domain Empirical Benchmark for Enterprise NLP Deployment

arXiv:2601.00444v1 Announce Type: new Abstract: In the rapidly evolving landscape of enterprise natural language processing (NLP), the demand for efficient, lightweight models capable of handling multi-domain text automation tasks has intensified. This study conducts a comparative analysis of three prominent lightweight Transformer models - DistilBERT, MiniLM, and ALBERT - across three distinct domains: customer sentiment classification, news topic classification, and toxicity and hate speech detection. Utilizing datasets from IMDB, AG News, and the Measuring Hate Speech corpus, we evaluated performance using accuracy-based metrics including accuracy, precision, recall, and F1-score, as well as efficiency metrics such as model size, inference time, throughput, and memory usage. Key findings reveal that no single model dominates all performance dimensions. ALBERT achieves the highest task-specific accuracy in multiple domains, MiniLM excels in inference speed and throughput, and DistilBERT demonstrates the most consistent accuracy across tasks while maintaining competitive efficiency. All results reflect controlled fine-tuning under fixed enterprise-oriented constraints rather than exhaustive hyperparameter optimization. These results highlight trade-offs between accuracy and efficiency, recommending MiniLM for latency-sensitive enterprise applications, DistilBERT for balanced performance, and ALBERT for resource-constrained environments.

Fonte: arXiv cs.CL

NLP/LLMs • Score 93

Modelos de Linguagem de Grande Escala Ainda Podem Explicar a Si Mesmos? Investigando o Impacto da Quantização nas Autoexplicações

A quantização é amplamente utilizada para acelerar a inferência e otimizar a implementação de modelos de linguagem de grande escala (LLMs), mas seus efeitos nas autoexplicações (SEs) permanecem inexplorados. Este estudo investiga a degradação da qualidade e fidelidade das SEs devido à quantização, analisando explicações em linguagem natural (NLEs) e exemplos contrafactuais gerados por LLMs quantizados com três técnicas comuns.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

A Language-Agnostic Hierarchical LoRA-MoE Architecture for CTC-based Multilingual ASR

arXiv:2601.00557v1 Announce Type: new Abstract: Large-scale multilingual ASR (mASR) models such as Whisper achieve strong performance but incur high computational and latency costs, limiting their deployment on resource-constrained edge devices. In this study, we propose a lightweight and language-agnostic multilingual ASR system based on a CTC architecture with domain adaptation. Specifically, we introduce a Language-agnostic Hierarchical LoRA-MoE (HLoRA) framework integrated into an mHuBERT-CTC model, enabling end-to-end decoding via LID-posterior-driven LoRA routing. The hierarchical design consists of a multilingual shared LoRA for learning language-invariant acoustic representations and language-specific LoRA experts for modeling language-dependent characteristics. The proposed routing mechanism removes the need for prior language identity information or explicit language labels during inference, achieving true language-agnostic decoding. Experiments on MSR-86K and the MLC-SLM 2025 Challenge datasets demonstrate that HLoRA achieves competitive performance with state-of-the-art two-stage inference methods using only single-pass decoding, significantly improving decoding efficiency for low-resource mASR applications.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak

arXiv:2601.00213v1 Announce Type: cross Abstract: The widespread deployment of large language models (LLMs) has raised growing concerns about their misuse risks and associated safety issues. While prior studies have examined the safety of LLMs in general usage, code generation, and agent-based applications, their vulnerabilities in automated algorithm design remain underexplored. To fill this gap, this study investigates this overlooked safety vulnerability, with a particular focus on intelligent optimization algorithm design, given its prevalent use in complex decision-making scenarios. We introduce MalOptBench, a benchmark consisting of 60 malicious optimization algorithm requests, and propose MOBjailbreak, a jailbreak method tailored for this scenario. Through extensive evaluation of 13 mainstream LLMs including the latest GPT-5 and DeepSeek-V3.1, we reveal that most models remain highly susceptible to such attacks, with an average attack success rate of 83.59% and an average harmfulness score of 4.28 out of 5 on original harmful prompts, and near-complete failure under MOBjailbreak. Furthermore, we assess state-of-the-art plug-and-play defenses that can be applied to closed-source models, and find that they are only marginally effective against MOBjailbreak and prone to exaggerated safety behaviors. These findings highlight the urgent need for stronger alignment techniques to safeguard LLMs against misuse in algorithm design.

Fonte: arXiv cs.CL

Applications • Score 90

A Ilusão de Insight em Modelos de Raciocínio

Modelos de raciocínio, como o DeepSeek-R1-Zero, podem ter momentos de 'Aha!', mas a relação entre mudanças intrínsecas na estratégia de raciocínio e a melhoria de desempenho ainda não está clara. Este estudo analisa mudanças durante o raciocínio em mais de 1 milhão de traços, revelando que essas mudanças são raras e não melhoram a precisão, embora seu efeito varie com a incerteza do modelo.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns

arXiv:2601.00588v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed in cost-sensitive and on-device scenarios, and safety guardrails have advanced mainly in English. However, real-world Chinese malicious queries typically conceal intent via homophones, pinyin, symbol-based splitting, and other Chinese-specific patterns. These Chinese-specific adversarial patterns create the safety evaluation gap that is not well captured by existing benchmarks focused on English. This gap is particularly concerning for lightweight models, which may be more vulnerable to such specific adversarial perturbations. To bridge this gap, we introduce the Chinese-Specific Safety Benchmark (CSSBench) that emphasizes these adversarial patterns and evaluates the safety of lightweight LLMs in Chinese. Our benchmark covers six domains that are common in real Chinese scenarios, including illegal activities and compliance, privacy leakage, health and medical misinformation, fraud and hate, adult content, and public and political safety, and organizes queries into multiple task types. We evaluate a set of popular lightweight LLMs and measure over-refusal behavior to assess safety-induced performance degradation. Our results show that the Chinese-specific adversarial pattern is a critical challenge for lightweight LLMs. This benchmark offers a comprehensive evaluation of LLM safety in Chinese, assisting robust deployments in practice.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning

arXiv:2601.00791v1 Announce Type: cross Abstract: We present a training-free method for detecting valid mathematical reasoning in large language models through spectral analysis of attention patterns. By treating attention matrices as adjacency matrices of dynamic graphs over tokens, we extract four interpretable spectral diagnostics, the Fiedler value (algebraic connectivity), high-frequency energy ratio (HFER), graph signal smoothness, and spectral entropy, that exhibit statistically significant differences between valid and invalid mathematical proofs. Experiments across seven transformer models from four independent architectural families (Meta Llama, Alibaba Qwen, Microsoft Phi, and Mistral AI) demonstrate that this spectral signature produces effect sizes up to Cohen's $d = 3.30$ ($p < 10^{-116}$), enabling 85.0--95.6\% classification accuracy under rigorous evaluation, with calibrated thresholds reaching 93--95\% on the full dataset. The method requires no training data, fine-tuning, or learned classifiers: a single threshold on a spectral metric suffices for high accuracy. Through systematic label correction, we discover that the spectral method detects logical coherence rather than compiler acceptance, identifying mathematically valid proofs that formal verifiers reject due to technical failures. We further identify an architectural dependency: Mistral-7B's Sliding Window Attention shifts the discriminative signal from HFER to late-layer Smoothness ($d = 2.09$, $p_{\text{MW}} = 1.16 \times 10^{-48}$), revealing that attention mechanism design affects which spectral features capture reasoning validity. These findings establish spectral graph analysis as a principled framework for reasoning verification with immediate applications to hallucination detection and AI safety monitoring.

Fonte: arXiv cs.CL

Applications • Score 89

Mask-Conditioned Voxel Diffusion for Joint Geometry and Color Inpainting

arXiv:2601.00368v1 Announce Type: new Abstract: We present a lightweight two-stage framework for joint geometry and color inpainting of damaged 3D objects, motivated by the digital restoration of cultural heritage artifacts. The pipeline separates damage localization from reconstruction. In the first stage, a 2D convolutional network predicts damage masks on RGB slices extracted from a voxelized object, and these predictions are aggregated into a volumetric mask. In the second stage, a diffusion-based 3D U-Net performs mask-conditioned inpainting directly on voxel grids, reconstructing geometry and color while preserving observed regions. The model jointly predicts occupancy and color using a composite objective that combines occupancy reconstruction with masked color reconstruction and perceptual regularization. We evaluate the approach on a curated set of textured artifacts with synthetically generated damage using standard geometric and color metrics. Compared to symmetry-based baselines, our method produces more complete geometry and more coherent color reconstructions at a fixed 32^3 resolution. Overall, the results indicate that explicit mask conditioning is a practical way to guide volumetric diffusion models for joint 3D geometry and color inpainting.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

FCMBench: Um Benchmark Multimodal Abrangente de Crédito Financeiro para Aplicações do Mundo Real

À medida que a IA multimodal se torna amplamente utilizada para avaliação de risco de crédito e revisão de documentos, um benchmark específico do domínio é urgentemente necessário. Apresentamos o FCMBench-V1.0, um benchmark multimodal de crédito financeiro em larga escala, cobrindo 18 tipos de certificados principais, com 4.043 imagens em conformidade com a privacidade e 8.446 amostras de QA.

Fonte: arXiv cs.AI

Vision • Score 95

Context-Aware Pesticide Recommendation via Few-Shot Pest Recognition for Precision Agriculture

arXiv:2601.00243v1 Announce Type: new Abstract: Effective pest management is crucial for enhancing agricultural productivity, especially for crops such as sugarcane and wheat that are highly vulnerable to pest infestations. Traditional pest management methods depend heavily on manual field inspections and the use of chemical pesticides. These approaches are often costly, time-consuming, labor-intensive, and can have a negative impact on the environment. To overcome these challenges, this study presents a lightweight framework for pest detection and pesticide recommendation, designed for low-resource devices such as smartphones and drones, making it suitable for use by small and marginal farmers. The proposed framework includes two main components. The first is a Pest Detection Module that uses a compact, lightweight convolutional neural network (CNN) combined with prototypical meta-learning to accurately identify pests even when only a few training samples are available. The second is a Pesticide Recommendation Module that incorporates environmental factors like crop type and growth stage to suggest safe and eco-friendly pesticide recommendations. To train and evaluate our framework, a comprehensive pest image dataset was developed by combining multiple publicly available datasets. The final dataset contains samples with different viewing angles, pest sizes, and background conditions to ensure strong generalization. Experimental results show that the proposed lightweight CNN achieves high accuracy, comparable to state-of-the-art models, while significantly reducing computational complexity. The Decision Support System additionally improves pest management by reducing dependence on traditional chemical pesticides and encouraging sustainable practices, demonstrating its potential for real-time applications in precision agriculture.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

arXiv:2601.00264v1 Announce Type: new Abstract: Multimodal learning has revolutionized general domain tasks, yet its application in scientific discovery is hindered by the profound semantic gap between complex scientific imagery and sparse textual descriptions. We present S1-MMAlign, a large-scale, multi-disciplinary multimodal dataset comprising over 15.5 million high-quality image-text pairs derived from 2.5 million open-access scientific papers. Spanning disciplines from physics and biology to engineering, the dataset captures diverse visual modalities including experimental setups, heatmaps, and microscopic imagery. To address the pervasive issue of weak alignment in raw scientific captions, we introduce an AI-ready semantic enhancement pipeline that utilizes the Qwen-VL multimodal large model series to recaption images by synthesizing context from paper abstracts and citation contexts. Technical validation demonstrates that this enhancement significantly improves data quality: SciBERT-based pseudo-perplexity metrics show reduced semantic ambiguity, while CLIP scores indicate an 18.21% improvement in image-text alignment. S1-MMAlign provides a foundational resource for advancing scientific reasoning and cross-modal understanding in the era of AI for Science. The dataset is publicly available at https://huggingface.co/datasets/ScienceOne-AI/S1-MMAlign.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Application Research of a Deep Learning Model Integrating CycleGAN and YOLO in PCB Infrared Defect Detection

arXiv:2601.00237v1 Announce Type: new Abstract: This paper addresses the critical bottleneck of infrared (IR) data scarcity in Printed Circuit Board (PCB) defect detection by proposing a cross-modal data augmentation framework integrating CycleGAN and YOLOv8. Unlike conventional methods relying on paired supervision, we leverage CycleGAN to perform unpaired image-to-image translation, mapping abundant visible-light PCB images into the infrared domain. This generative process synthesizes high-fidelity pseudo-IR samples that preserve the structural semantics of defects while accurately simulating thermal distribution patterns. Subsequently, we construct a heterogeneous training strategy that fuses generated pseudo-IR data with limited real IR samples to train a lightweight YOLOv8 detector. Experimental results demonstrate that this method effectively enhances feature learning under low-data conditions. The augmented detector significantly outperforms models trained on limited real data alone and approaches the performance benchmarks of fully supervised training, proving the efficacy of pseudo-IR synthesis as a robust augmentation strategy for industrial inspection.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

FaithSCAN: Detecção de Alucinações em Uma Única Passagem Baseada em Modelos para Respostas Visuais de Perguntas Fiéis

As alucinações de fidelidade em VQA ocorrem quando modelos de visão-linguagem produzem respostas fluentes, mas visualmente não fundamentadas, comprometendo sua confiabilidade em aplicações críticas de segurança. Propomos o FaithSCAN, uma rede leve que detecta alucinações explorando sinais internos ricos dos VLMs, superando limitações de métodos existentes em eficiência e desempenho de detecção.

Fonte: arXiv cs.AI

Vision • Score 95

NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos

arXiv:2601.00393v1 Announce Type: new Abstract: In this paper, we propose NeoVerse, a versatile 4D world model that is capable of 4D reconstruction, novel-trajectory video generation, and rich downstream applications. We first identify a common limitation of scalability in current 4D world modeling methods, caused either by expensive and specialized multi-view 4D data or by cumbersome training pre-processing. In contrast, our NeoVerse is built upon a core philosophy that makes the full pipeline scalable to diverse in-the-wild monocular videos. Specifically, NeoVerse features pose-free feed-forward 4D reconstruction, online monocular degradation pattern simulation, and other well-aligned techniques. These designs empower NeoVerse with versatility and generalization to various domains. Meanwhile, NeoVerse achieves state-of-the-art performance in standard reconstruction and generation benchmarks. Our project page is available at https://neoverse-4d.github.io

Fonte: arXiv cs.CV

Theory/Optimization • Score 92

SingBAG Pro: Accelerating point cloud-based iterative reconstruction for 3D photoacoustic imaging under arbitrary array

arXiv:2601.00551v1 Announce Type: new Abstract: High-quality three-dimensional (3D) photoacoustic imaging (PAI) is gaining increasing attention in clinical applications. To address the challenges of limited space and high costs, irregular geometric transducer arrays that conform to specific imaging regions are promising for achieving high-quality 3D PAI with fewer transducers. However, traditional iterative reconstruction algorithms struggle with irregular array configurations, suffering from high computational complexity, substantial memory requirements, and lengthy reconstruction times. In this work, we introduce SlingBAG Pro, an advanced reconstruction algorithm based on the point cloud iteration concept of the Sliding ball adaptive growth (SlingBAG) method, while extending its compatibility to arbitrary array geometries. SlingBAG Pro maintains high reconstruction quality, reduces the number of required transducers, and employs a hierarchical optimization strategy that combines zero-gradient filtering with progressively increased temporal sampling rates during iteration. This strategy rapidly removes redundant spatial point clouds, accelerates convergence, and significantly shortens overall reconstruction time. Compared to the original SlingBAG algorithm, SlingBAG Pro achieves up to a 2.2-fold speed improvement in point cloud-based 3D PA reconstruction under irregular array geometries. The proposed method is validated through both simulation and in vivo mouse experiments, and the source code is publicly available at https://github.com/JaegerCQ/SlingBAG_Pro.

Fonte: arXiv cs.CV

Vision • Score 95

A Cascaded Information Interaction Network for Precise Image Segmentation

arXiv:2601.00562v1 Announce Type: new Abstract: Visual perception plays a pivotal role in enabling autonomous behavior, offering a cost-effective and efficient alternative to complex multi-sensor systems. However, robust segmentation remains a challenge in complex scenarios. To address this, this paper proposes a cascaded convolutional neural network integrated with a novel Global Information Guidance Module. This module is designed to effectively fuse low-level texture details with high-level semantic features across multiple layers, thereby overcoming the inherent limitations of single-scale feature extraction. This architectural innovation significantly enhances segmentation accuracy, particularly in visually cluttered or blurred environments where traditional methods often fail. Experimental evaluations on benchmark image segmentation datasets demonstrate that the proposed framework achieves superior precision, outperforming existing state-of-the-art methods. The results highlight the effectiveness of the approach and its promising potential for deployment in practical robotic applications.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

Ajuste Fino de Modelos de Linguagem de Grande Escala para Triagem Automatizada de Depressão em Pidgin Nigeriano: Estudo Piloto GENSCORE

A depressão é um grande contribuinte para a carga de saúde mental na Nigéria, mas a cobertura de triagem é limitada devido ao baixo acesso a clínicos, estigma e barreiras linguísticas. Este estudo apresenta uma abordagem inovadora para triagem automatizada de depressão usando modelos de linguagem de grande escala (LLMs) ajustados para o Pidgin Nigeriano conversacional.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

A Chain-of-Thought Approach to Semantic Query Categorization in e-Commerce Taxonomies

arXiv:2601.00510v1 Announce Type: cross Abstract: Search in e-Commerce is powered at the core by a structured representation of the inventory, often formulated as a category taxonomy. An important capability in e-Commerce with hierarchical taxonomies is to select a set of relevant leaf categories that are semantically aligned with a given user query. In this scope, we address a fundamental problem of search query categorization in real-world e-Commerce taxonomies. A correct categorization of a query not only provides a way to zoom into the correct inventory space, but opens the door to multiple intent understanding capabilities for a query. A practical and accurate solution to this problem has many applications in e-commerce, including constraining retrieved items and improving the relevance of the search results. For this task, we explore a novel Chain-of-Thought (CoT) paradigm that combines simple tree-search with LLM semantic scoring. Assessing its classification performance on human-judged query-category pairs, relevance tests, and LLM-based reference methods, we find that the CoT approach performs better than a benchmark that uses embedding-based query category predictions. We show how the CoT approach can detect problems within a hierarchical taxonomy. Finally, we also propose LLM-based approaches for query-categorization of the same spirit, but which scale better at the range of millions of queries.

Fonte: arXiv cs.CL

Vision • Score 95

Boosting Segment Anything Model to Generalize Visually Non-Salient Scenarios

arXiv:2601.00537v1 Announce Type: new Abstract: Segment Anything Model (SAM), known for its remarkable zero-shot segmentation capabilities, has garnered significant attention in the community. Nevertheless, its performance is challenged when dealing with what we refer to as visually non-salient scenarios, where there is low contrast between the foreground and background. In these cases, existing methods often cannot capture accurate contours and fail to produce promising segmentation results. In this paper, we propose Visually Non-Salient SAM (VNS-SAM), aiming to enhance SAM's perception of visually non-salient scenarios while preserving its original zero-shot generalizability. We achieve this by effectively exploiting SAM's low-level features through two designs: Mask-Edge Token Interactive decoder and Non-Salient Feature Mining module. These designs help the SAM decoder gain a deeper understanding of non-salient characteristics with only marginal parameter increments and computational requirements. The additional parameters of VNS-SAM can be optimized within 4 hours, demonstrating its feasibility and practicality. In terms of data, we established VNS-SEG, a unified dataset for various VNS scenarios, with more than 35K images, in contrast to previous single-task adaptations. It is designed to make the model learn more robust VNS features and comprehensively benchmark the model's segmentation performance and generalizability on VNS scenarios. Extensive experiments across various VNS segmentation tasks demonstrate the superior performance of VNS-SAM, particularly under zero-shot settings, highlighting its potential for broad real-world applications. Codes and datasets are publicly available at https://guangqian-guo.github.io/VNS-SAM.

Fonte: arXiv cs.CV

Vision • Score 96

Detecção Humana em Tempo Real para Sequências de Vídeo Capturadas Aéreas via Modelos Profundos

A detecção humana em vídeos desempenha um papel importante em diversas aplicações do mundo real. Abordagens tradicionais dependem de características feitas à mão, que são específicas para problemas e tarefas. Este trabalho utiliza métodos de aprendizado automático de características, combinando fluxo óptico e três modelos profundos diferentes para detecção humana em vídeos capturados com uma câmera não estática em plataformas aéreas.

Fonte: arXiv cs.LG

RL • Score 93

Modelos de Gargalo de Conceito Controláveis

Os Modelos de Gargalo de Conceito (CBMs) têm atraído atenção por sua capacidade de esclarecer o processo de previsão através de uma camada de conceito compreensível para humanos. No entanto, a maioria dos estudos anteriores focou em cenários estáticos. Propomos os Modelos de Gargalo de Conceito Controláveis (CCBMs), que suportam edições em diferentes granularidades, permitindo manutenção contínua sem a necessidade de re-treinamento.

Fonte: arXiv cs.LG

Evaluation/Benchmarks • Score 93

Otimização de Redes Neurais LSTM para Previsão de Vendas no Varejo com Recursos Limitados: Um Estudo de Compressão de Modelo

As redes neurais LSTM (Long Short-Term Memory) padrão oferecem previsões precisas para dados de vendas no setor varejista, mas exigem grande poder computacional, o que pode ser desafiador para indústrias de varejo de médio a pequeno porte. Este artigo investiga a compressão do modelo LSTM ao reduzir gradualmente o número de unidades ocultas de 128 para 16.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Robust Uncertainty Quantification for Factual Generation of Large Language Models

arXiv:2601.00348v1 Announce Type: new Abstract: The rapid advancement of large language model(LLM) technology has facilitated its integration into various domains of professional and daily life. However, the persistent challenge of LLM hallucination has emerged as a critical limitation, significantly compromising the reliability and trustworthiness of AI-generated content. This challenge has garnered significant attention within the scientific community, prompting extensive research efforts in hallucination detection and mitigation strategies. Current methodological frameworks reveal a critical limitation: traditional uncertainty quantification approaches demonstrate effectiveness primarily within conventional question-answering paradigms, yet exhibit notable deficiencies when confronted with non-canonical or adversarial questioning strategies. This performance gap raises substantial concerns regarding the dependability of LLM responses in real-world applications requiring robust critical thinking capabilities. This study aims to fill this gap by proposing an uncertainty quantification scenario in the task of generating with multiple facts. We have meticulously constructed a set of trap questions contained with fake names. Based on this scenario, we innovatively propose a novel and robust uncertainty quantification method(RU). A series of experiments have been conducted to verify its effectiveness. The results show that the constructed set of trap questions performs excellently. Moreover, when compared with the baseline methods on four different models, our proposed method has demonstrated great performance, with an average increase of 0.1-0.2 in ROCAUC values compared to the best performing baseline method, providing new sights and methods for addressing the hallucination issue of LLMs.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Memory Bank Compression for Continual Adaptation of Large Language Models

arXiv:2601.00756v1 Announce Type: cross Abstract: Large Language Models (LLMs) have become a mainstay for many everyday applications. However, as data evolve their knowledge quickly becomes outdated. Continual learning aims to update LLMs with new information without erasing previously acquired knowledge. Although methods such as full fine-tuning can incorporate new data, they are computationally expensive and prone to catastrophic forgetting, where prior knowledge is overwritten. Memory-augmented approaches address this by equipping LLMs with a memory bank, that is an external memory module which stores information for future use. However, these methods face a critical limitation, in particular, the memory bank constantly grows in the real-world scenario when large-scale data streams arrive. In this paper, we propose MBC, a model that compresses the memory bank through a codebook optimization strategy during online adaptation learning. To ensure stable learning, we also introduce an online resetting mechanism that prevents codebook collapse. In addition, we employ Key-Value Low-Rank Adaptation in the attention layers of the LLM, enabling efficient utilization of the compressed memory representations. Experiments with benchmark question-answering datasets demonstrate that MBC reduces the memory bank size to 0.3% when compared against the most competitive baseline, while maintaining high retention accuracy during online adaptation learning. Our code is publicly available at https://github.com/Thomkat/MBC.

Fonte: arXiv cs.CL

RL • Score 95

VisNet: Efficient Person Re-Identification via Alpha-Divergence Loss, Feature Fusion and Dynamic Multi-Task Learning

arXiv:2601.00307v1 Announce Type: new Abstract: Person re-identification (ReID) is an extremely important area in both surveillance and mobile applications, requiring strong accuracy with minimal computational cost. State-of-the-art methods give good accuracy but with high computational budgets. To remedy this, this paper proposes VisNet, a computationally efficient and effective re-identification model suitable for real-world scenarios. It is the culmination of conceptual contributions, including feature fusion at multiple scales with automatic attention on each, semantic clustering with anatomical body partitioning, a dynamic weight averaging technique to balance classification semantic regularization, and the use of loss function FIDI for improved metric learning tasks. The multiple scales fuse ResNet50's stages 1 through 4 without the use of parallel paths, with semantic clustering introducing spatial constraints through the use of rule-based pseudo-labeling. VisNet achieves 87.05% Rank-1 and 77.65% mAP on the Market-1501 dataset, having 32.41M parameters and 4.601 GFLOPs, hence, proposing a practical approach for real-time deployment in surveillance and mobile applications where computational resources are limited.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Efficient Prediction of Dense Visual Embeddings via Distillation and RGB-D Transformers

arXiv:2601.00359v1 Announce Type: new Abstract: In domestic environments, robots require a comprehensive understanding of their surroundings to interact effectively and intuitively with untrained humans. In this paper, we propose DVEFormer - an efficient RGB-D Transformer-based approach that predicts dense text-aligned visual embeddings (DVE) via knowledge distillation. Instead of directly performing classical semantic segmentation with fixed predefined classes, our method uses teacher embeddings from Alpha-CLIP to guide our efficient student model DVEFormer in learning fine-grained pixel-wise embeddings. While this approach still enables classical semantic segmentation, e.g., via linear probing, it further enables flexible text-based querying and other applications, such as creating comprehensive 3D maps. Evaluations on common indoor datasets demonstrate that our approach achieves competitive performance while meeting real-time requirements, operating at 26.3 FPS for the full model and 77.0 FPS for a smaller variant on an NVIDIA Jetson AGX Orin. Additionally, we show qualitative results that highlight the effectiveness and possible use cases in real-world applications. Overall, our method serves as a drop-in replacement for traditional segmentation approaches while enabling flexible natural-language querying and seamless integration into 3D mapping pipelines for mobile robotics.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications

arXiv:2505.14918v2 Announce Type: replace-cross Abstract: This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification, addressing the lack of established reliability assessment methods. Adapting psychometric principles, we determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability. Our case study examines financial news sentiment classification across 14 LLMs (including claude-3-7-sonnet, gpt-4o, deepseek-r1, gemma3, llama3.2, phi4, and command-r-plus), with five replicates per model on 1,350 articles. Models demonstrated high intra-rater consistency, achieving perfect agreement on 90-98% of examples, with minimal differences between expensive and economical models from the same families. When validated against StockNewsAPI labels, models achieved strong performance (accuracy 0.76-0.88), with smaller models like gemma3:1B, llama3.2:3B, and claude-3-5-haiku outperforming larger counterparts. All models performed at chance when predicting actual market movements, indicating task constraints rather than model limitations. Our framework provides systematic guidance for LLM selection, sample size planning, and reliability assessment, enabling organizations to optimize resources for classification tasks.

Fonte: arXiv stat.ML

Vision • Score 96

EIA-SEC: Framework Melhorado de Actor-Critic para Controle Colaborativo de Multi-UAV na Agricultura Inteligente

A aplicação generalizada da tecnologia de comunicação sem fio tem promovido o desenvolvimento da agricultura inteligente, onde veículos aéreos não tripulados (UAVs) desempenham um papel multifuncional. Neste trabalho, modelamos um processo de decisão de Markov para resolver o problema de planejamento de trajetória de multi-UAV e propomos o novo framework Elite Imitation Actor-Shared Ensemble Critic (EIA-SEC). Resultados experimentais mostram que o EIA-SEC supera as referências de ponta em desempenho de recompensa, estabilidade de treinamento e velocidade de convergência.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Benchmarking de substitutos neurais em fluxos multifísicos espaço-temporais realistas

Prever dinâmicas multifísicas é computacionalmente caro e desafiador devido ao acoplamento severo de processos físicos heterogêneos e multiescala. Apresentamos o REALM (REalistic AI Learning for Multiphysics), um framework rigoroso de benchmarking para testar substitutos neurais em fluxos reativos desafiadores, com 11 conjuntos de dados de alta fidelidade e um protocolo padronizado de treinamento e avaliação.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

O Número de Condição como um Proxy Invariante de Escala para Codificação de Informação em Unidades Neurais

Este artigo explora a relação entre o número de condição do tensor de pesos de uma rede neural e a extensão da informação codificada pela unidade de processamento associada, sob a perspectiva da teoria da informação. Argumenta-se que um número de condição elevado pode indicar que a unidade aprendeu a amplificar e comprimir informações de forma seletiva.

Fonte: arXiv stat.ML

Vision • Score 96

Mapas auto-organizáveis para avaliação da qualidade da água em reservatórios e lagos: Uma revisão sistemática da literatura

A qualidade da água sustentável é fundamental para o equilíbrio ecológico e a segurança hídrica. Esta revisão examina a aplicação do Self-Organizing Map (SOM), uma técnica de IA não supervisionada, na avaliação da qualidade da água, abordando seleção de parâmetros, estratégias de amostragem e abordagens de agrupamento.

Fonte: arXiv cs.LG

MLOps/Systems • Score 95

GenUQ: Estimativas de Incerteza Preditiva via Redes Hiper-Geradoras

O aprendizado de operadores é uma generalização recente da regressão para mapeamentos entre funções, prometendo reduzir a integração numérica cara de PDEs para avaliações rápidas de mapeamentos entre estados funcionais de um sistema. Neste artigo, apresentamos o GenUQ, uma abordagem teórica de medida para UQ que evita a construção de uma verossimilhança ao introduzir um modelo de rede hiper-geradora.

Fonte: arXiv stat.ML

NLP/LLMs • Score 96

HARBOR: Modelo Holístico e Adaptativo de Avaliação de Risco para Cuidados de Saúde Comportamental

A avaliação de risco em cuidados de saúde comportamental continua sendo um desafio devido à natureza multimodal dos dados dos pacientes e à dinâmica temporal dos transtornos de humor e afetivos. Neste trabalho, apresentamos o HARBOR, um modelo de linguagem consciente da saúde comportamental projetado para prever um escore de humor e risco discreto, denominado Harbor Risk Score (HRS).

Fonte: arXiv cs.AI

RL • Score 96

Previsão e Prognóstico dos Impactos de Secas de Curto Prazo Usando Machine Learning para Apoiar Esforços de Mitigação e Adaptação

A seca é um risco natural complexo que afeta sistemas ecológicos e humanos, resultando frequentemente em perdas ambientais e econômicas significativas. Este estudo aplica técnicas de machine learning para vincular índices de seca a registros históricos de impactos de seca, gerando previsões de impactos de curto prazo e melhorando a previsibilidade desses impactos em prazos acionáveis.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Adaptação Automática à Complexidade de Conceitos e Conceitos Naturais Subjetivos: Um Modelo Cognitivo Baseado em Chunking

Um problema central na ciência cognitiva diz respeito aos processos psicológicos fundamentais que sustentam a formação e recuperação de múltiplos tipos de conceitos na memória de curto e longo prazo (STM e LTM, respectivamente). Propomos que os mecanismos de chunking desempenham um papel essencial e mostramos como o modelo computacional CogAct fundamenta o aprendizado de conceitos em processos e estruturas cognitivas fundamentais.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Stop saying LLM: Large Discourse Models (LDM) and Artificial Discursive Agent (ADA)?

arXiv:2512.19117v1 Announce Type: new Abstract: This paper proposes an epistemological shift in the analysis of large generative models, replacing the category ''Large Language Models'' (LLM) with that of ''Large Discourse Models'' (LDM), and then with that of Artificial Discursive Agent (ADA). The theoretical framework is based on an ontological triad distinguishing three regulatory instances: the apprehension of the phenomenal regularities of the referential world, the structuring of embodied cognition, and the structural-linguistic sedimentation of the utterance within a socio-historical context. LDMs, operating on the product of these three instances (the document), model the discursive projection of a portion of human experience reified by the learning corpus. The proposed program aims to replace the ''fascination/fear'' dichotomy with public trials and procedures that make the place, uses, and limits of artificial discursive agents in contemporary social space decipherable, situating this approach within a perspective of governance and co-regulation involving the State, industry, civil society, and academia.

Fonte: arXiv cs.CL

RL • Score 96

FairExpand: Justiça Individual em Grafos com Informações de Similaridade Parcial

A justiça individual, que exige que indivíduos semelhantes sejam tratados de forma semelhante por sistemas algorítmicos, é um princípio central em machine learning justo. Este trabalho apresenta o FairExpand, um framework flexível que promove a justiça individual em cenários de informações parciais, superando a limitação de métodos existentes que requerem informações de similaridade pré-definidas para todos os pares de nós.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

CORE: Reforço Orientado a Conceitos para Reduzir a Lacuna entre Definição e Aplicação em Raciocínio Matemático

Modelos de linguagem grandes (LLMs) frequentemente resolvem exercícios matemáticos desafiadores, mas falham em aplicar o conceito quando o problema exige compreensão genuína. Apresentamos o CORE (Reforço Orientado a Conceitos), um framework de treinamento de RL que transforma conceitos explícitos em um sinal de supervisão controlável.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

$eta(3,4)$ 'Atenção' em Agentes Cognitivos: Representações de Conhecimento Sem Ontologia com Semântica Teórica de Promessa

A semântica e a dinâmica da 'atenção' estão intimamente relacionadas a noções teóricas de promessa desenvolvidas para agentes autônomos e podem ser facilmente expressas em um framework de promessas. Isso permite estabelecer uma ponte entre Machine Learning vetorizado e representações de Knowledge Graph sem depender implicitamente de modelos de linguagem.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Vibe Reasoning: Extraindo Capacidades Matemáticas de IA de Fronteira -- Um Estudo de Caso sobre o Problema 6 do IMO 2025

Apresentamos o Vibe Reasoning, um paradigma colaborativo entre humanos e IA para resolver problemas matemáticos complexos. Nossa principal percepção é que modelos de IA de fronteira já possuem o conhecimento necessário para resolver problemas desafiadores, mas não sabem como, o que ou quando aplicá-lo. Este trabalho demonstra essa abordagem através do Problema 6 do IMO 2025.

Fonte: arXiv cs.AI

RL • Score 96

APC-GNN++: Uma GNN Adaptativa Centrada no Paciente com Atenção Consciente do Contexto e Explicabilidade de Mini-Grafos para Classificação de Diabetes

Propomos o APC-GNN++, uma Rede Neural Gráfica centrada no paciente para classificação de diabetes. Nosso modelo integra atenção de arestas consciente do contexto, mistura guiada por confiança de características de nós e representações gráficas, e regularização de consistência de vizinhança para capturar melhor relações clinicamente significativas entre pacientes.

Fonte: arXiv cs.LG

RL • Score 96

Unificando Aprendizado por Reforço Causal: Revisão, Taxonomia, Algoritmos e Aplicações

Integrar inferência causal (CI) com aprendizado por reforço (RL) se tornou um paradigma poderoso para abordar limitações críticas no RL clássico, como baixa explicabilidade e falta de robustez. Este trabalho revisa avanços recentes na interseção entre CI e RL, categorizando abordagens existentes e discutindo desafios, sucessos empíricos e direções futuras de pesquisa.

Fonte: arXiv cs.AI

Applications • Score 89

Estrutura das Fronteiras de Classificadores: Estudo de Caso para um Classificador Naive Bayes

Classificadores atribuem pontos de dados de entrada complexos a uma pequena quantidade de categorias de saída. Neste estudo, analisamos a estrutura da extit{fronteira} de um classificador Bayes, que compreende pontos cujos vizinhos são classificados de maneira diferente. Apresentamos uma nova medida de incerteza, Neighbor Similarity, que compara o resultado de um ponto de entrada com a distribuição de resultados de seus vizinhos.

Fonte: arXiv stat.ML

RL • Score 95

Stochastic Optimization with Optimal Importance Sampling

arXiv:2504.03560v2 Announce Type: replace-cross Abstract: Importance Sampling (IS) is a widely used variance reduction technique for enhancing the efficiency of Monte Carlo methods, particularly in rare-event simulation and related applications. Despite its effectiveness, the performance of IS is highly sensitive to the choice of the proposal distribution and often requires stochastic calibration. While the design and analysis of IS have been extensively studied in estimation settings, applying IS within stochastic optimization introduces a fundamental challenge: the decision variable and the importance sampling distribution are mutually dependent, creating a circular optimization structure. This interdependence complicates both convergence analysis and variance control. We consider convex stochastic optimization problems with linear constraints and propose a single-loop stochastic approximation algorithm, based on a joint variant of Nesterov's dual averaging, that jointly updates the decision variable and the importance sampling distribution, without time-scale separation or nested optimization. The method is globally convergent and achieves minimal asymptotic variance among stochastic gradient schemes, matching the performance of an oracle sampler adapted to the optimal solution.

Fonte: arXiv stat.ML

NLP/LLMs • Score 96

MoE-TransMov: Um Modelo Baseado em Transformer para Previsão do Próximo Ponto de Interesse (POI) em Movimentos Familiares e Não Familiares

A previsão precisa do próximo ponto de interesse (POI) nas trajetórias de mobilidade humana é crucial para serviços baseados em localização, permitindo recomendações mais oportunas e personalizadas. Propomos o MoE-TransMov, um modelo baseado em Transformer com arquitetura Mixture-of-Experts (MoE) que captura padrões de mobilidade distintos em diferentes contextos de movimento, melhorando a precisão das previsões.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Q-KVComm: Comunicação Multi-Agente Eficiente Via Compressão Adaptativa de Cache KV

Sistemas multi-agente de Large Language Model (LLM) enfrentam um gargalo crítico: a transmissão redundante de informações contextuais entre agentes consome largura de banda e recursos computacionais excessivos. Apresentamos o Q-KVComm, um novo protocolo que permite a transmissão direta de representações comprimidas de cache key-value (KV) entre agentes LLM.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

On Finding Inconsistencies in Documents

arXiv:2512.18601v1 Announce Type: new Abstract: Professionals in academia, law, and finance audit their documents because inconsistencies can result in monetary, reputational, and scientific costs. Language models (LMs) have the potential to dramatically speed up this auditing process. To understand their abilities, we introduce a benchmark, FIND (Finding INconsistencies in Documents), where each example is a document with an inconsistency inserted manually by a domain expert. Despite the documents being long, technical, and complex, the best-performing model (gpt-5) recovered 64% of the inserted inconsistencies. Surprisingly, gpt-5 also found undiscovered inconsistencies present in the original documents. For example, on 50 arXiv papers, we judged 136 out of 196 of the model's suggestions to be legitimate inconsistencies missed by the original authors. However, despite these findings, even the best models miss almost half of the inconsistencies in FIND, demonstrating that inconsistency detection is still a challenging task.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Research on a hybrid LSTM-CNN-Attention model for text-based web content classification

arXiv:2512.18475v1 Announce Type: new Abstract: This study presents a hybrid deep learning architecture that integrates LSTM, CNN, and an Attention mechanism to enhance the classification of web content based on text. Pretrained GloVe embeddings are used to represent words as dense vectors that preserve semantic similarity. The CNN layer extracts local n-gram patterns and lexical features, while the LSTM layer models long-range dependencies and sequential structure. The integrated Attention mechanism enables the model to focus selectively on the most informative parts of the input sequence. A 5-fold cross-validation setup was used to assess the robustness and generalizability of the proposed solution. Experimental results show that the hybrid LSTM-CNN-Attention model achieved outstanding performance, with an accuracy of 0.98, precision of 0.94, recall of 0.92, and F1-score of 0.93. These results surpass the performance of baseline models based solely on CNNs, LSTMs, or transformer-based classifiers such as BERT. The combination of neural network components enabled the model to effectively capture both fine-grained text structures and broader semantic context. Furthermore, the use of GloVe embeddings provided an efficient and effective representation of textual data, making the model suitable for integration into systems with real-time or near-real-time requirements. The proposed hybrid architecture demonstrates high effectiveness in text-based web content classification, particularly in tasks requiring both syntactic feature extraction and semantic interpretation. By combining presented mechanisms, the model addresses the limitations of individual architectures and achieves improved generalization. These findings support the broader use of hybrid deep learning approaches in NLP applications, especially where complex, unstructured textual data must be processed and classified with high reliability.

Fonte: arXiv cs.CL

Theory/Optimization • Score 92

Desdobramento Generalizado de Dados Usando Estatísticas Suficientes

Nosso objetivo é desenvolver uma estratégia geral para decompor uma variável aleatória $X$ em múltiplas variáveis aleatórias independentes, sem sacrificar informações sobre parâmetros desconhecidos. Este trabalho generaliza um procedimento recente, permitindo a reconstrução exata de $X$ a partir de funções conhecidas das variáveis independentes.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

Mantenha Leve! Simplificando o Agrupamento de Imagens Através de Adaptadores Sem Texto

No contexto de modelos pré-treinados, a classificação eficaz pode ser alcançada com camadas de leitura leves. Este trabalho demonstra que, no agrupamento profundo, é possível obter desempenho competitivo com métodos mais complexos utilizando um pipeline de treinamento altamente simplificado e sem texto. Nossa abordagem, Simple Clustering via Pre-trained models (SCP), utiliza representações de características de modelos de visão pré-treinados e pares de dados positivos.

Fonte: arXiv stat.ML

RL • Score 96

Aprendizado por Reforço Contínuo Guiado por Demonstração em Ambientes Dinâmicos

O aprendizado por reforço (RL) se destaca em várias aplicações, mas enfrenta dificuldades em ambientes dinâmicos. O aprendizado por reforço contínuo (CRL) permite que agentes de RL aprendam e se adaptem continuamente, mas equilibrar estabilidade e plasticidade continua sendo um desafio. Propomos o aprendizado por reforço contínuo guiado por demonstração (DGCRL), que utiliza um repositório de demonstrações externas para guiar a exploração e adaptação do RL.

Fonte: arXiv cs.LG

Applications • Score 89

Learning Confidence Ellipsoids and Applications to Robust Subspace Recovery

arXiv:2512.16875v2 Announce Type: replace-cross Abstract: We study the problem of finding confidence ellipsoids for an arbitrary distribution in high dimensions. Given samples from a distribution $D$ and a confidence parameter $\alpha$, the goal is to find the smallest volume ellipsoid $E$ which has probability mass $\Pr_{D}[E] \ge 1-\alpha$. Ellipsoids are a highly expressive class of confidence sets as they can capture correlations in the distribution, and can approximate any convex set. This problem has been studied in many different communities. In statistics, this is the classic minimum volume estimator introduced by Rousseeuw as a robust non-parametric estimator of location and scatter. However in high dimensions, it becomes NP-hard to obtain any non-trivial approximation factor in volume when the condition number $\beta$ of the ellipsoid (ratio of the largest to the smallest axis length) goes to $\infty$. This motivates the focus of our paper: can we efficiently find confidence ellipsoids with volume approximation guarantees when compared to ellipsoids of bounded condition number $\beta$? Our main result is a polynomial time algorithm that finds an ellipsoid $E$ whose volume is within a $O(\beta)^{\gamma d}$ multiplicative factor of the volume of best $\beta$-conditioned ellipsoid while covering at least $1-O(\alpha/\gamma)$ probability mass for any $\gamma < \alpha$. We complement this with a computational hardness result that shows that such a dependence seems necessary up to constants in the exponent. The algorithm and analysis uses the rich primal-dual structure of the minimum volume enclosing ellipsoid and the geometric Brascamp-Lieb inequality. As a consequence, we obtain the first polynomial time algorithm with approximation guarantees on worst-case instances of the robust subspace recovery problem.

Fonte: arXiv stat.ML

Evaluation/Benchmarks • Score 93

Avaliação Comparativa de Machine Learning Explicável Versus Regressão Linear para Prever a Taxa de Mortalidade por Câncer de Pulmão em Nível de Condado nos Estados Unidos

O câncer de pulmão (CP) é uma das principais causas de mortalidade relacionada ao câncer nos Estados Unidos. A previsão precisa das taxas de mortalidade por CP é crucial para orientar intervenções direcionadas e abordar disparidades de saúde. Este estudo aplicou três modelos: random forest (RF), gradient boosting regression (GBR) e regressão linear (LR) para prever as taxas de mortalidade por CP em nível de condado.

Fonte: arXiv cs.LG

RL • Score 95

Teorema Central do Limite para médias ergódicas de cadeias de Markov e a comparação de algoritmos de amostragem para distribuições com cauda pesada

Estabelecer teoremas do limite central (CLTs) para médias ergódicas de cadeias de Markov é um problema fundamental em probabilidade e suas aplicações. Este artigo fornece condições necessárias verificáveis para CLTs de médias ergódicas em espaços de estados gerais, com foco em condições de drift que também oferecem limites inferiores para as taxas de convergência à estacionaridade.

Fonte: arXiv stat.ML

RecSys • Score 96

Gêmeos Digitais Probabilísticos de Usuários: Aprendizado de Representação Latente com Semântica Estatisticamente Validada

Entender a identidade e o comportamento do usuário é central para aplicações como personalização, recomendação e suporte à decisão. Propomos um framework de gêmeo digital probabilístico onde cada usuário é modelado como um estado estocástico latente que gera dados comportamentais observados. Este framework é aplicado a um conjunto de dados de respostas de usuários para capturar aspectos estáveis da identidade do usuário.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Supplementary Resources and Analysis for Automatic Speech Recognition Systems Trained on the Loquacious Dataset

arXiv:2512.17915v1 Announce Type: new Abstract: The recently published Loquacious dataset aims to be a replacement for established English automatic speech recognition (ASR) datasets such as LibriSpeech or TED-Lium. The main goal of the Loquacious dataset is to provide properly defined training and test partitions across many acoustic and language domains, with an open license suitable for both academia and industry. To further promote the benchmarking and usability of this new dataset, we present additional resources in the form of n-gram language models (LMs), a grapheme-to-phoneme (G2P) model and pronunciation lexica, with open and public access. Utilizing those additional resources we show experimental results across a wide range of ASR architectures with different label units and topologies. Our initial experimental results indicate that the Loquacious dataset offers a valuable study case for a variety of common challenges in ASR.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Rumo à Avaliação de Vulnerabilidades de Privacidade no Esquecimento Seletivo com Modelos de Linguagem de Grande Escala

Os avanços rápidos em inteligência artificial (IA) têm se concentrado no aprendizado a partir de dados para desenvolver sistemas de aprendizado informados. Com a implementação desses sistemas em áreas críticas, garantir sua privacidade e alinhamento com valores humanos é essencial. O esquecimento seletivo, ou machine unlearning, surge como uma abordagem promissora, mas também levanta preocupações significativas de privacidade, especialmente em domínios sensíveis.

Fonte: arXiv cs.LG

RL • Score 93

Redes Neurais Variacionais Baseadas em Microestrutura para Quantificação Robusta de Incertezas em Gêmeos Digitais de Materiais

As incertezas aleatórias - variabilidade irremovível na morfologia da microestrutura, comportamento dos constituintes e condições de processamento - representam um grande desafio para o desenvolvimento de gêmeos digitais robustos em relação à incerteza. Apresentamos a Variational Deep Material Network (VDMN), um modelo substituto informado pela física que permite previsões eficientes e probabilísticas do comportamento dos materiais.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Geração de Regras Programáticas para Detecção de Falsificação de Documentos Usando Modelos de Linguagem de Grande Escala

A falsificação de documentos representa uma ameaça crescente a processos legais, econômicos e governamentais, exigindo mecanismos de verificação cada vez mais sofisticados. Este trabalho investiga como modelos de linguagem de grande escala (LLMs) podem ser adaptados para gerar verificações de plausibilidade baseadas em regras para detecção de falsificações, utilizando recursos de hardware limitados.

Fonte: arXiv cs.AI

Vision • Score 96

Um Conjunto de Dados e Benchmarks para Detecção de Fibrilação Atrial a partir de Eletrocardiogramas de Pacientes em Unidades de Terapia Intensiva

Objetivo: A fibrilação atrial (AF) é a arritmia cardíaca mais comum entre pacientes em unidades de terapia intensiva (UTI) e pode causar efeitos adversos à saúde. Neste estudo, publicamos um conjunto de dados rotulado da UTI e benchmarks para detecção de AF, comparando modelos de machine learning em três abordagens de inteligência artificial (IA) baseadas em dados.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

LLM-CAS: Perturbação Dinâmica de Neurônios para Correção de Alucinações em Tempo Real

Modelos de linguagem grandes (LLMs) frequentemente geram conteúdo alucinado que carece de fundamentação factual ou contextual, limitando sua confiabilidade em aplicações críticas. O LLM-CAS propõe uma abordagem de correção de alucinações em tempo real como um problema de aprendizado por reforço hierárquico, permitindo correções adaptativas sem modificação permanente de parâmetros.

Fonte: arXiv cs.CL

Vision • Score 96

FedOAED: Denoiser Autoencoder Federado em Dispositivos para Dados Heterogêneos sob Disponibilidade Limitada de Clientes

Nos últimos anos, soluções de machine learning (ML) e deep learning (DL) mostraram seu potencial em diversas aplicações, mas regulamentações rigorosas de compartilhamento de dados, como o GDPR e o HIPAA, dificultaram a implementação de aplicações baseadas em dados. O Federated Learning (FL) surge como uma solução, mas ainda enfrenta desafios relacionados à heterogeneidade. Neste trabalho, apresentamos o FedOAED, um algoritmo inovador que visa mitigar a deriva de clientes e a variância induzida pela participação parcial de clientes.

Fonte: arXiv cs.LG

Vision • Score 96

Outro Fit Cai: Predição Conformal como um Padrão de Calibração para Machine Learning em Física de Altas Energias

As técnicas de machine learning são essenciais na pesquisa moderna de colisores, mas suas saídas probabilísticas frequentemente carecem de estimativas de incerteza calibradas. A predição conformal (CP) oferece um framework simples e livre de distribuição para calibrar modelos preditivos arbitrários, permitindo uma quantificação rigorosa da incerteza. Neste trabalho, investigamos a CP como uma camada de calibração unificadora para aplicações de machine learning em física de altas energias.

Fonte: arXiv cs.AI

RL • Score 96

Sobre Tempo: Aprendizado por Reforço Sem Modelo com Máquinas de Recompensa Temporizadas

A especificação de recompensas desempenha um papel central no aprendizado por reforço (RL), orientando o comportamento do agente. Neste artigo, propomos máquinas de recompensa temporizadas (TRMs), que estendem as máquinas de recompensa tradicionais ao incorporar restrições de tempo na estrutura de recompensa, permitindo especificações mais expressivas e ajustáveis.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Rumo à IA Conversacional Explicável para Diagnóstico Precoce com Modelos de Linguagem de Grande Escala

Os sistemas de saúde em todo o mundo enfrentam problemas como diagnósticos ineficientes, aumento de custos e acesso limitado a especialistas. Este estudo apresenta um chatbot diagnóstico alimentado por um Large Language Model (LLM), utilizando GPT-4o e técnicas de IA explicável, que interage com os pacientes para extrair e normalizar sintomas, priorizando diagnósticos potenciais.

Fonte: arXiv cs.AI

RL • Score 93

A Theoretical Analysis of State Similarity Between Markov Decision Processes

arXiv:2512.17265v1 Announce Type: new Abstract: The bisimulation metric (BSM) is a powerful tool for analyzing state similarities within a Markov decision process (MDP), revealing that states closer in BSM have more similar optimal value functions. While BSM has been successfully utilized in reinforcement learning (RL) for tasks like state representation learning and policy exploration, its application to state similarity between multiple MDPs remains challenging. Prior work has attempted to extend BSM to pairs of MDPs, but a lack of well-established mathematical properties has limited further theoretical analysis between MDPs. In this work, we formally establish a generalized bisimulation metric (GBSM) for measuring state similarity between arbitrary pairs of MDPs, which is rigorously proven with three fundamental metric properties, i.e., GBSM symmetry, inter-MDP triangle inequality, and a distance bound on identical spaces. Leveraging these properties, we theoretically analyze policy transfer, state aggregation, and sampling-based estimation across MDPs, obtaining explicit bounds that are strictly tighter than existing ones derived from the standard BSM. Additionally, GBSM provides a closed-form sample complexity for estimation, improving upon existing asymptotic results based on BSM. Numerical results validate our theoretical findings and demonstrate the effectiveness of GBSM in multi-MDP scenarios.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

The Effect of Negation on CLIP in Medical Imaging: Limitations of Contrastive Language-Image Pretraining

arXiv:2512.17121v1 Announce Type: new Abstract: Large vision-language models like CLIP are increasingly used in medical imaging tasks due to their ability to align images and text without the need for extensive labeled data. This makes them particularly useful for applications like image retrieval, report generation, and classification in clinical settings. A potential issue to this approach is that CLIP-based models often under perform when interpreting negated phrases, which is especially problematic in the context of medical diagnosing. In this study, we evaluate the Stanford AIMI CheXagent model on its ability to correctly retrieve chest X-ray images using prompts with and without negation. The goal of this project is to understand where this model fails and then use it as a base model to improve its retrieval accuracy by fine tuning methods outlined in previous work. Results from this study show improvement in handling of negation in the CLIP model with a slight decrease in accuracy of positive prompt evaluation. Alongside retrieval accuracy, we examined internal model behavior through token attribution, t-SNE projection, and attention-head ablation to better characterize how each fine tuning approach reshaped the text encoders representation of negated clinical language. Through this work, we hope to better understand the internal behavior of CLIP and improve its handling of negation using clinically relevant language for improving its reliability in medical AI devices.

Fonte: arXiv cs.LG

MLOps/Systems • Score 93

Diagnóstico e Quantificação de Falhas para Arrays Fotovoltaicos Baseados em Modelos Físicos Diferenciáveis

O diagnóstico e a quantificação precisos de falhas são essenciais para a operação confiável e a manutenção inteligente de arrays fotovoltaicos (PV). Este artigo propõe uma nova abordagem de quantificação de falhas para strings PV baseada em um modelo de simulação de falhas rápidas diferenciáveis (DFFSM), que modela com precisão as características I-V sob múltiplas falhas e fornece gradientes analíticos em relação aos parâmetros de falha.

Fonte: arXiv cs.LG

MLOps/Systems • Score 95

Penalized Fair Regression for Multiple Groups in Chronic Kidney Disease

arXiv:2512.17340v1 Announce Type: cross Abstract: Fair regression methods have the potential to mitigate societal bias concerns in health care, but there has been little work on penalized fair regression when multiple groups experience such bias. We propose a general regression framework that addresses this gap with unfairness penalties for multiple groups. Our approach is demonstrated for binary outcomes with true positive rate disparity penalties. It can be efficiently implemented through reduction to a cost-sensitive classification problem. We additionally introduce novel score functions for automatically selecting penalty weights. Our penalized fair regression methods are empirically studied in simulations, where they achieve a fairness-accuracy frontier beyond that of existing comparison methods. Finally, we apply these methods to a national multi-site primary care study of chronic kidney disease to develop a fair classifier for end-stage renal disease. There we find substantial improvements in fairness for multiple race and ethnicity groups who experience societal bias in the health care system without any appreciable loss in overall fit.

Fonte: arXiv stat.ML

MLOps/Systems • Score 95

A Systems-Theoretic View on the Convergence of Algorithms under Disturbances

arXiv:2512.17598v1 Announce Type: cross Abstract: Algorithms increasingly operate within complex physical, social, and engineering systems where they are exposed to disturbances, noise, and interconnections with other dynamical systems. This article extends known convergence guarantees of an algorithm operating in isolation (i.e., without disturbances) and systematically derives stability bounds and convergence rates in the presence of such disturbances. By leveraging converse Lyapunov theorems, we derive key inequalities that quantify the impact of disturbances. We further demonstrate how our result can be utilized to assess the effects of disturbances on algorithmic performance in a wide variety of applications, including communication constraints in distributed learning, sensitivity in machine learning generalization, and intentional noise injection for privacy. This underpins the role of our result as a unifying tool for algorithm analysis in the presence of noise, disturbances, and interconnections with other dynamical systems.

Fonte: arXiv stat.ML

Privacy/Security/Fairness • Score 92

Low-Rank Filtering and Smoothing for Sequential Deep Learning

arXiv:2410.06800v2 Announce Type: replace-cross Abstract: Learning multiple tasks sequentially requires neural networks to balance retaining knowledge, yet being flexible enough to adapt to new tasks. Regularizing network parameters is a common approach, but it rarely incorporates prior knowledge about task relationships, and limits information flow to future tasks only. We propose a Bayesian framework that treats the network's parameters as the state space of a nonlinear Gaussian model, unlocking two key capabilities: (1) A principled way to encode domain knowledge about task relationships, allowing, e.g., control over which layers should adapt between tasks. (2) A novel application of Bayesian smoothing, allowing task-specific models to also incorporate knowledge from models learned later. This does not require direct access to their data, which is crucial, e.g., for privacy-critical applications. These capabilities rely on efficient filtering and smoothing operations, for which we propose diagonal plus low-rank approximations of the precision matrix in the Laplace approximation (LR-LGF). Empirical results demonstrate the efficiency of LR-LGF and the benefits of the unlocked capabilities.

Fonte: arXiv stat.ML

Privacy/Security/Fairness • Score 92

Mitigating Forgetting in Low Rank Adaptation

arXiv:2512.17720v1 Announce Type: cross Abstract: Parameter-efficient fine-tuning methods, such as Low-Rank Adaptation (LoRA), enable fast specialization of large pre-trained models to different downstream applications. However, this process often leads to catastrophic forgetting of the model's prior domain knowledge. We address this issue with LaLoRA, a weight-space regularization technique that applies a Laplace approximation to Low-Rank Adaptation. Our approach estimates the model's confidence in each parameter and constrains updates in high-curvature directions, preserving prior knowledge while enabling efficient target-domain learning. By applying the Laplace approximation only to the LoRA weights, the method remains lightweight. We evaluate LaLoRA by fine-tuning a Llama model for mathematical reasoning and demonstrate an improved learning-forgetting trade-off, which can be directly controlled via the method's regularization strength. We further explore different loss landscape curvature approximations for estimating parameter confidence, analyze the effect of the data used for the Laplace approximation, and study robustness across hyperparameters.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

The Stochastic Occupation Kernel (SOCK) Method for Learning Stochastic Differential Equations

arXiv:2505.11622v2 Announce Type: replace Abstract: We present a novel kernel-based method for learning multivariate stochastic differential equations (SDEs). The method follows a two-step procedure: we first estimate the drift term function, then the (matrix-valued) diffusion function given the drift. Occupation kernels are integral functionals on a reproducing kernel Hilbert space (RKHS) that aggregate information over a trajectory. Our approach leverages vector-valued occupation kernels for estimating the drift component of the stochastic process. For diffusion estimation, we extend this framework by introducing operator-valued occupation kernels, enabling the estimation of an auxiliary matrix-valued function as a positive semi-definite operator, from which we readily derive the diffusion estimate. This enables us to avoid common challenges in SDE learning, such as intractable likelihoods, by optimizing a reconstruction-error-based objective. We propose a simple learning procedure that retains strong predictive accuracy while using Fenchel duality to promote efficiency. We validate the method on simulated benchmarks and a real-world dataset of Amyloid imaging in healthy and Alzheimer's disease subjects.

Fonte: arXiv stat.ML

RL • Score 95

A Survey on Archetypal Analysis

arXiv:2504.12392v2 Announce Type: replace-cross Abstract: Archetypal analysis (AA) was originally proposed in 1994 by Adele Cutler and Leo Breiman as a computational procedure for extracting distinct aspects, so-called archetypes, from observations, with each observational record approximated as a mixture (i.e., convex combination) of these archetypes. AA thereby provides straightforward, interpretable, and explainable representations for feature extraction and dimensionality reduction, facilitating the understanding of the structure of high-dimensional data and enabling wide applications across the sciences. However, AA also faces challenges, particularly as the associated optimization problem is non-convex. This is the first survey that provides researchers and data mining practitioners with an overview of the methodologies and opportunities that AA offers, surveying the many applications of AA across disparate fields of science, as well as best practices for modeling data with AA and its limitations. The survey concludes by explaining crucial future research directions concerning AA.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

Towards Human-Guided, Data-Centric LLM Co-Pilots

arXiv:2501.10321v3 Announce Type: replace-cross Abstract: Machine learning (ML) has the potential to revolutionize various domains, but its adoption is often hindered by the disconnect between the needs of domain experts and translating these needs into robust and valid ML tools. Despite recent advances in LLM-based co-pilots to democratize ML for non-technical domain experts, these systems remain predominantly focused on model-centric aspects while overlooking critical data-centric challenges. This limitation is problematic in complex real-world settings where raw data often contains complex issues, such as missing values, label noise, and domain-specific nuances requiring tailored handling. To address this we introduce CliMB-DC, a human-guided, data-centric framework for LLM co-pilots that combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing. At its core, CliMB-DC introduces a novel, multi-agent reasoning system that combines a strategic coordinator for dynamic planning and adaptation with a specialized worker agent for precise execution. Domain expertise is then systematically incorporated to guide the reasoning process using a human-in-the-loop approach. To guide development, we formalize a taxonomy of key data-centric challenges that co-pilots must address. Thereafter, to address the dimensions of the taxonomy, we integrate state-of-the-art data-centric tools into an extensible, open-source architecture, facilitating the addition of new tools from the research community. Empirically, using real-world healthcare datasets we demonstrate CliMB-DC's ability to transform uncurated datasets into ML-ready formats, significantly outperforming existing co-pilot baselines for handling data-centric challenges. CliMB-DC promises to empower domain experts from diverse domains -- healthcare, finance, social sciences and more -- to actively participate in driving real-world impact using ML.

Fonte: arXiv stat.ML

Applications • Score 89

On Agnostic PAC Learning in the Small Error Regime

arXiv:2502.09496v2 Announce Type: replace-cross Abstract: Binary classification in the classic PAC model exhibits a curious phenomenon: Empirical Risk Minimization (ERM) learners are suboptimal in the realizable case yet optimal in the agnostic case. Roughly speaking, this owes itself to the fact that non-realizable distributions $\mathcal{D}$ are simply more difficult to learn than realizable distributions -- even when one discounts a learner's error by $\mathrm{err}(h^*_{\mathcal{D}})$, the error of the best hypothesis in $\mathcal{H}$ for $\mathcal{D}$. Thus, optimal agnostic learners are permitted to incur excess error on (easier-to-learn) distributions $\mathcal{D}$ for which $\tau = \mathrm{err}(h^*_{\mathcal{D}})$ is small. Recent work of Hanneke, Larsen, and Zhivotovskiy (FOCS `24) addresses this shortcoming by including $\tau$ itself as a parameter in the agnostic error term. In this more fine-grained model, they demonstrate tightness of the error lower bound $\tau + \Omega \left(\sqrt{\frac{\tau (d + \log(1 / \delta))}{m}} + \frac{d + \log(1 / \delta)}{m} \right)$ in a regime where $\tau > d/m$, and leave open the question of whether there may be a higher lower bound when $\tau \approx d/m$, with $d$ denoting $\mathrm{VC}(\mathcal{H})$. In this work, we resolve this question by exhibiting a learner which achieves error $c \cdot \tau + O \left(\sqrt{\frac{\tau (d + \log(1 / \delta))}{m}} + \frac{d + \log(1 / \delta)}{m} \right)$ for a constant $c \leq 2.1$, thus matching the lower bound when $\tau \approx d/m$. Further, our learner is computationally efficient and is based upon careful aggregations of ERM classifiers, making progress on two other questions of Hanneke, Larsen, and Zhivotovskiy (FOCS `24). We leave open the interesting question of whether our approach can be refined to lower the constant from 2.1 to 1, which would completely settle the complexity of agnostic learning.

Fonte: arXiv stat.ML

RL • Score 93

UniCoMTE: A Universal Counterfactual Framework for Explaining Time-Series Classifiers on ECG Data

arXiv:2512.17100v1 Announce Type: cross Abstract: Machine learning models, particularly deep neural networks, have demonstrated strong performance in classifying complex time series data. However, their black-box nature limits trust and adoption, especially in high-stakes domains such as healthcare. To address this challenge, we introduce UniCoMTE, a model-agnostic framework for generating counterfactual explanations for multivariate time series classifiers. The framework identifies temporal features that most heavily influence a model's prediction by modifying the input sample and assessing its impact on the model's prediction. UniCoMTE is compatible with a wide range of model architectures and operates directly on raw time series inputs. In this study, we evaluate UniCoMTE's explanations on a time series ECG classifier. We quantify explanation quality by comparing our explanations' comprehensibility to comprehensibility of established techniques (LIME and SHAP) and assessing their generalizability to similar samples. Furthermore, clinical utility is assessed through a questionnaire completed by medical experts who review counterfactual explanations presented alongside original ECG samples. Results show that our approach produces concise, stable, and human-aligned explanations that outperform existing methods in both clarity and applicability. By linking model predictions to meaningful signal patterns, the framework advances the interpretability of deep learning models for real-world time series applications.

Fonte: arXiv cs.AI

Evaluation/Benchmarks • Score 92

A Generic Machine Learning Framework for Radio Frequency Fingerprinting

arXiv:2510.09775v3 Announce Type: replace-cross Abstract: Fingerprinting radio frequency (RF) emitters typically involves finding unique characteristics that are featured in their received signal. These fingerprints are nuanced, but sufficiently detailed, motivating the pursuit of methods that can successfully extract them. The downstream task that requires the most meticulous RF fingerprinting (RFF) is known as specific emitter identification (SEI), which entails recognising each individual transmitter. RFF and SEI have a long history, with numerous defence and civilian applications such as signal intelligence, electronic surveillance, physical-layer authentication of wireless devices, to name a few. In recent years, data-driven RFF approaches have become popular due to their ability to automatically learn intricate fingerprints. They generally deliver superior performance when compared to traditional RFF techniques that are often labour-intensive, inflexible, and only applicable to a particular emitter type or transmission scheme. In this paper, we present a generic and versatile machine learning (ML) framework for data-driven RFF with several popular downstream tasks such as SEI, data association (EDA) and RF emitter clustering (RFEC). It is emitter-type agnostic. We then demonstrate the introduced framework for several tasks using real RF datasets for spaceborne surveillance, signal intelligence and countering drones applications.

Fonte: arXiv stat.ML

Applications • Score 90

BumpNet: A Sparse Neural Network Framework for Learning PDE Solutions

arXiv:2512.17198v1 Announce Type: new Abstract: We introduce BumpNet, a sparse neural network framework for PDE numerical solution and operator learning. BumpNet is based on meshless basis function expansion, in a similar fashion to radial-basis function (RBF) networks. Unlike RBF networks, the basis functions in BumpNet are constructed from ordinary sigmoid activation functions. This enables the efficient use of modern training techniques optimized for such networks. All parameters of the basis functions, including shape, location, and amplitude, are fully trainable. Model parsimony and h-adaptivity are effectively achieved through dynamically pruning basis functions during training. BumpNet is a general framework that can be combined with existing neural architectures for learning PDE solutions: here, we propose Bump-PINNs (BumpNet with physics-informed neural networks) for solving general PDEs; Bump-EDNN (BumpNet with evolutionary deep neural networks) to solve time-evolution PDEs; and Bump-DeepONet (BumpNet with deep operator networks) for PDE operator learning. Bump-PINNs are trained using the same collocation-based approach used by PINNs, Bump-EDNN uses a BumpNet only in the spatial domain and uses EDNNs to advance the solution in time, while Bump-DeepONets employ a BumpNet regression network as the trunk network of a DeepONet. Extensive numerical experiments demonstrate the efficiency and accuracy of the proposed architecture.

Fonte: arXiv cs.LG

Applications • Score 89

Fast and Robust: Computationally Efficient Covariance Estimation for Sub-Weibull Vectors

arXiv:2512.17632v1 Announce Type: new Abstract: High-dimensional covariance estimation is notoriously sensitive to outliers. While statistically optimal estimators exist for general heavy-tailed distributions, they often rely on computationally expensive techniques like semidefinite programming or iterative M-estimation ($O(d^3)$). In this work, we target the specific regime of \textbf{Sub-Weibull distributions} (characterized by stretched exponential tails $\exp(-t^\alpha)$). We investigate a computationally efficient alternative: the \textbf{Cross-Fitted Norm-Truncated Estimator}. Unlike element-wise truncation, our approach preserves the spectral geometry while requiring $O(Nd^2)$ operations, which represents the theoretical lower bound for constructing a full covariance matrix. Although spherical truncation is geometrically suboptimal for anisotropic data, we prove that within the Sub-Weibull class, the exponential tail decay compensates for this mismatch. Leveraging weighted Hanson-Wright inequalities, we derive non-asymptotic error bounds showing that our estimator recovers the optimal sub-Gaussian rate $\tilde{O}(\sqrt{r(\Sigma)/N})$ with high probability. This provides a scalable solution for high-dimensional data that exhibits tails heavier than Gaussian but lighter than polynomial decay.

Fonte: arXiv stat.ML

Vision • Score 95

Rede deformável ciente de distorção em múltiplos níveis para super-resolução de imagens omnidirecionais

Com o aumento da popularidade de aplicações de realidade aumentada e virtual, o processamento de imagens para Imagens Omnidirecionais (ODIs) tem atraído atenção crescente. A Super-Resolução de Imagens Omnidirecionais (ODISR) é uma técnica promissora para melhorar a qualidade visual das ODIs. Propomos uma nova Rede Deformável Ciente de Distorção em Múltiplos Níveis (MDDN) para ODISR, projetada para expandir o alcance de amostragem e o campo receptivo.

Fonte: arXiv cs.CV

RL • Score 95

Generative Multi-Objective Bayesian Optimization with Scalable Batch Evaluations for Sample-Efficient De Novo Molecular Design

arXiv:2512.17659v1 Announce Type: new Abstract: Designing molecules that must satisfy multiple, often conflicting objectives is a central challenge in molecular discovery. The enormous size of chemical space and the cost of high-fidelity simulations have driven the development of machine learning-guided strategies for accelerating design with limited data. Among these, Bayesian optimization (BO) offers a principled framework for sample-efficient search, while generative models provide a mechanism to propose novel, diverse candidates beyond fixed libraries. However, existing methods that couple the two often rely on continuous latent spaces, which introduces both architectural entanglement and scalability challenges. This work introduces an alternative, modular "generate-then-optimize" framework for de novo multi-objective molecular design/discovery. At each iteration, a generative model is used to construct a large, diverse pool of candidate molecules, after which a novel acquisition function, qPMHI (multi-point Probability of Maximum Hypervolume Improvement), is used to optimally select a batch of candidates most likely to induce the largest Pareto front expansion. The key insight is that qPMHI decomposes additively, enabling exact, scalable batch selection via only simple ranking of probabilities that can be easily estimated with Monte Carlo sampling. We benchmark the framework against state-of-the-art latent-space and discrete molecular optimization methods, demonstrating significant improvements across synthetic benchmarks and application-driven tasks. Specifically, in a case study related to sustainable energy storage, we show that our approach quickly uncovers novel, diverse, and high-performing organic (quinone-based) cathode materials for aqueous redox flow battery applications.

Fonte: arXiv stat.ML

Vision • Score 95

Interpretable Similarity of Synthetic Image Utility

arXiv:2512.17080v1 Announce Type: new Abstract: Synthetic medical image data can unlock the potential of deep learning (DL)-based clinical decision support (CDS) systems through the creation of large scale, privacy-preserving, training sets. Despite the significant progress in this field, there is still a largely unanswered research question: "How can we quantitatively assess the similarity of a synthetically generated set of images with a set of real images in a given application domain?". Today, answers to this question are mainly provided via user evaluation studies, inception-based measures, and the classification performance achieved on synthetic images. This paper proposes a novel measure to assess the similarity between synthetically generated and real sets of images, in terms of their utility for the development of DL-based CDS systems. Inspired by generalized neural additive models, and unlike inception-based measures, the proposed measure is interpretable (Interpretable Utility Similarity, IUS), explaining why a synthetic dataset could be more useful than another one in the context of a CDS system based on clinically relevant image features. The experimental results on publicly available datasets from various color medical imaging modalities including endoscopic, dermoscopic and fundus imaging, indicate that selecting synthetic images of high utility similarity using IUS can result in relative improvements of up to 54.6% in terms of classification performance. The generality of IUS for synthetic data assessment is demonstrated also for greyscale X-ray and ultrasound imaging modalities. IUS implementation is available at https://github.com/innoisys/ius

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

AncientBench: Towards Comprehensive Evaluation on Excavated and Transmitted Chinese Corpora

arXiv:2512.17756v1 Announce Type: new Abstract: Comprehension of ancient texts plays an important role in archaeology and understanding of Chinese history and civilization. The rapid development of large language models needs benchmarks that can evaluate their comprehension of ancient characters. Existing Chinese benchmarks are mostly targeted at modern Chinese and transmitted documents in ancient Chinese, but the part of excavated documents in ancient Chinese is not covered. To meet this need, we propose the AncientBench, which aims to evaluate the comprehension of ancient characters, especially in the scenario of excavated documents. The AncientBench is divided into four dimensions, which correspond to the four competencies of ancient character comprehension: glyph comprehension, pronunciation comprehension, meaning comprehension, and contextual comprehension. The benchmark also contains ten tasks, including radical, phonetic radical, homophone, cloze, translation, and more, providing a comprehensive framework for evaluation. We convened archaeological researchers to conduct experimental evaluations, proposed an ancient model as baseline, and conducted extensive experiments on the currently best-performing large language models. The experimental results reveal the great potential of large language models in ancient textual scenarios as well as the gap with humans. Our research aims to promote the development and application of large language models in the field of archaeology and ancient Chinese language.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

V-Agent: Um Sistema de Busca de Vídeo Interativo Usando Modelos de Visão-Linguagem

Apresentamos o V-Agent, uma nova plataforma multi-agente projetada para busca avançada de vídeos e conversas interativas entre usuário e sistema. Ao ajustar um modelo de visão-linguagem (VLM) com um pequeno conjunto de dados de preferência de vídeo e aprimorá-lo com um vetor de recuperação de um modelo de recuperação de imagem-texto, superamos as limitações dos sistemas tradicionais de recuperação baseados em texto em cenários multimodais.

Fonte: arXiv cs.AI

Vision • Score 96

BIONIX: Um Braço Protético Sem Fio e de Baixo Custo com Controle de EEG e EMG de Duplo Sinal

As próteses de membro superior acessíveis frequentemente carecem de sistemas de controle intuitivos, limitando a funcionalidade e acessibilidade para amputados em ambientes de baixa renda. Este projeto apresenta um sistema de controle neuro-muscular de baixo custo que integra eletroencefalografia (EEG) e eletromiografia (EMG) para permitir controle em tempo real de um braço protético.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

When the Gold Standard isn't Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content

arXiv:2512.17738v1 Announce Type: new Abstract: User-generated content (UGC) is characterised by frequent use of non-standard language, from spelling errors to expressive choices such as slang, character repetitions, and emojis. This makes evaluating UGC translation particularly challenging: what counts as a "good" translation depends on the level of standardness desired in the output. To explore this, we examine the human translation guidelines of four UGC datasets, and derive a taxonomy of twelve non-standard phenomena and five translation actions (NORMALISE, COPY, TRANSFER, OMIT, CENSOR). Our analysis reveals notable differences in how UGC is treated, resulting in a spectrum of standardness in reference translations. Through a case study on large language models (LLMs), we show that translation scores are highly sensitive to prompts with explicit translation instructions for UGC, and that they improve when these align with the dataset's guidelines. We argue that when preserving UGC style is important, fair evaluation requires both models and metrics to be aware of translation guidelines. Finally, we call for clear guidelines during dataset creation and for the development of controllable, guideline-aware evaluation frameworks for UGC translation.

Fonte: arXiv cs.CL

Vision • Score 95

Endo-SemiS: Rumo à Segmentação de Imagem Semi-Supervisionada Robusta para Vídeo Endoscópico

Neste artigo, apresentamos o Endo-SemiS, um framework de segmentação semi-supervisionada para fornecer segmentação confiável de quadros de vídeo endoscópico com anotação limitada. O Endo-SemiS utiliza 4 estratégias para melhorar o desempenho, aproveitando efetivamente todos os dados disponíveis, especialmente dados não rotulados.

Fonte: arXiv cs.CV

Applications • Score 89

Dion2: Um Método Simples para Reduzir Matrizes no Muon

O otimizador Muon apresenta forte desempenho empírico e fundamentação teórica. No entanto, o custo super-linear de seu passo de ortonormalização introduz uma sobrecarga crescente com a escala. Apresentamos o Dion2, um método muito mais simples para reduzir a matriz envolvida no cálculo do Muon em comparação com abordagens anteriores.

Fonte: arXiv cs.LG

RL • Score 95

Método de Direção Alternada de Multiplicadores para Decomposições de Matrizes Não Lineares

Apresentamos um algoritmo baseado no método de direção alternada de multiplicadores (ADMM) para resolver decomposições de matrizes não lineares (NMD). Dada uma matriz de entrada $X \in \mathbb{R}^{m \times n}$ e um rank de fatoração $r \ll \min(m, n)$, NMD busca matrizes $W \in \mathbb{R}^{m \times r}$ e $H \in \mathbb{R}^{r \times n}$ de forma que $X \approx f(WH)$, onde $f$ é uma função não linear elemento a elemento.

Fonte: arXiv stat.ML

RL • Score 96

O Papel da Ética Islâmica na Prevenção do Abuso de Deepfakes Baseados em Inteligência Artificial (IA)

O desenvolvimento significativo da tecnologia deepfake impulsionada por inteligência artificial (IA) gerou preocupações globais sobre a alteração de informações falsas e a usurpação de identidades online. Este estudo propõe um framework ético islâmico abrangente para mitigar os riscos do uso indevido de deepfakes, abordando deficiências éticas e necessidades regulatórias.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators

arXiv:2512.17267v1 Announce Type: new Abstract: Evaluating user-facing AI applications remains a central challenge, especially in open-ended domains such as travel planning, clinical note generation, or dialogue. The gold standard is user feedback (e.g., thumbs up/down) or behavioral signals (e.g., retention), but these are often scarce in prototypes and research projects, or too-slow to use for system optimization. We present AutoMetrics, a framework for synthesizing evaluation metrics under low-data constraints. AutoMetrics combines retrieval from MetricBank, a collection of 48 metrics we curate, with automatically generated LLM-as-a-Judge criteria informed by lightweight human feedback. These metrics are composed via regression to maximize correlation with human signal. AutoMetrics takes you from expensive measures to interpretable automatic metrics. Across 5 diverse tasks, AutoMetrics improves Kendall correlation with human ratings by up to 33.4% over LLM-as-a-Judge while requiring fewer than 100 feedback points. We show that AutoMetrics can be used as a proxy reward to equal effect as a verifiable reward. We release the full AutoMetrics toolkit and MetricBank to accelerate adaptive evaluation of LLM applications.

Fonte: arXiv cs.CL

RL • Score 96

Viés Conservador na Aprendizagem Multi-Professor: Por Que Agentes Preferem Consultores de Baixa Recompensa

A aprendizagem por reforço interativa (IRL) tem mostrado potencial para permitir que agentes autônomos e robôs aprendam comportamentos complexos com professores humanos, mas a dinâmica da seleção de professores ainda é pouco compreendida. Este artigo revela um fenômeno inesperado na IRL: agentes de aprendizagem preferem professores conservadores e de baixa recompensa em vez de aqueles que oferecem recompensas 20 vezes maiores.

Fonte: arXiv cs.AI

Vision • Score 95

SynergyWarpNet: Attention-Guided Cooperative Warping for Neural Portrait Animation

arXiv:2512.17331v1 Announce Type: new Abstract: Recent advances in neural portrait animation have demonstrated remarked potential for applications in virtual avatars, telepresence, and digital content creation. However, traditional explicit warping approaches often struggle with accurate motion transfer or recovering missing regions, while recent attention-based warping methods, though effective, frequently suffer from high complexity and weak geometric grounding. To address these issues, we propose SynergyWarpNet, an attention-guided cooperative warping framework designed for high-fidelity talking head synthesis. Given a source portrait, a driving image, and a set of reference images, our model progressively refines the animation in three stages. First, an explicit warping module performs coarse spatial alignment between the source and driving image using 3D dense optical flow. Next, a reference-augmented correction module leverages cross-attention across 3D keypoints and texture features from multiple reference images to semantically complete occluded or distorted regions. Finally, a confidence-guided fusion module integrates the warped outputs with spatially-adaptive fusing, using a learned confidence map to balance structural alignment and visual consistency. Comprehensive evaluations on benchmark datasets demonstrate state-of-the-art performance.

Fonte: arXiv cs.CV

Multimodal • Score 95

Peeking Into The Future For Contextual Biasing

arXiv:2512.17657v1 Announce Type: new Abstract: While end-to-end (E2E) automatic speech recognition (ASR) models excel at general transcription, they struggle to recognize rare or unseen named entities (e.g., contact names, locations), which are critical for downstream applications like virtual assistants. In this paper, we propose a contextual biasing method for attention based encoder decoder (AED) models using a list of candidate named entities. Instead of predicting only the next token, we simultaneously predict multiple future tokens, enabling the model to "peek into the future" and score potential candidate entities in the entity list. Moreover, our approach leverages the multi-token prediction logits directly without requiring additional entity encoders or cross-attention layers, significantly reducing architectural complexity. Experiments on Librispeech demonstrate that our approach achieves up to 50.34% relative improvement in named entity word error rate compared to the baseline AED model.

Fonte: arXiv cs.CL

Applications • Score 90

Pesquisa sobre Algoritmo de Dead Reckoning para Robôs de Pipeline Autopropelidos em Pipelines Complexos Tridimensionais

No campo da localização de pipelines de gás, os métodos existentes geralmente dependem de instrumentos de localização. Este trabalho apresenta um robô de pipeline autopropelido que realiza a localização de pipelines complexos sem arrasto externo, utilizando um método que integra navegação inercial e odômetros de roda, superando limitações de métodos tradicionais de mapeamento.

Fonte: arXiv cs.AI

Vision • Score 95

Stakeholder Suite: A Unified AI Framework for Mapping Actors, Topics and Arguments in Public Debates

arXiv:2512.17347v1 Announce Type: new Abstract: Public debates surrounding infrastructure and energy projects involve complex networks of stakeholders, arguments, and evolving narratives. Understanding these dynamics is crucial for anticipating controversies and informing engagement strategies, yet existing tools in media intelligence largely rely on descriptive analytics with limited transparency. This paper presents Stakeholder Suite, a framework deployed in operational contexts for mapping actors, topics, and arguments within public debates. The system combines actor detection, topic modeling, argument extraction and stance classification in a unified pipeline. Tested on multiple energy infrastructure projects as a case study, the approach delivers fine-grained, source-grounded insights while remaining adaptable to diverse domains. The framework achieves strong retrieval precision and stance accuracy, producing arguments judged relevant in 75% of pilot use cases. Beyond quantitative metrics, the tool has proven effective for operational use: helping project teams visualize networks of influence, identify emerging controversies, and support evidence-based decision-making.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

PILAR: Personalizando Interações em Realidade Aumentada com Explicações Centricas no Humano e Confiáveis Baseadas em LLM para Casos de Uso Diários

Sistemas de realidade aumentada (AR) impulsionados por inteligência artificial (AI) estão se integrando cada vez mais à vida cotidiana, aumentando a necessidade de explicações em tempo real. O PILAR é um novo framework que utiliza um modelo de linguagem grande (LLM) pré-treinado para gerar explicações personalizadas e contextuais, melhorando a experiência do usuário em sistemas AR baseados em AI.

Fonte: arXiv cs.AI

RL • Score 95

Affect, Body, Cognition, Demographics, and Emotion: The ABCDE of Text Features for Computational Affective Science

arXiv:2512.17752v1 Announce Type: new Abstract: Work in Computational Affective Science and Computational Social Science explores a wide variety of research questions about people, emotions, behavior, and health. Such work often relies on language data that is first labeled with relevant information, such as the use of emotion words or the age of the speaker. Although many resources and algorithms exist to enable this type of labeling, discovering, accessing, and using them remains a substantial impediment, particularly for practitioners outside of computer science. Here, we present the ABCDE dataset (Affect, Body, Cognition, Demographics, and Emotion), a large-scale collection of over 400 million text utterances drawn from social media, blogs, books, and AI-generated sources. The dataset is annotated with a wide range of features relevant to computational affective and social science. ABCDE facilitates interdisciplinary research across numerous fields, including affective science, cognitive science, the digital humanities, sociology, political science, and computational linguistics.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Globally Optimal Solution to the Generalized Relative Pose Estimation Problem using Affine Correspondences

arXiv:2512.17188v1 Announce Type: new Abstract: Mobile devices equipped with a multi-camera system and an inertial measurement unit (IMU) are widely used nowadays, such as self-driving cars. The task of relative pose estimation using visual and inertial information has important applications in various fields. To improve the accuracy of relative pose estimation of multi-camera systems, we propose a globally optimal solver using affine correspondences to estimate the generalized relative pose with a known vertical direction. First, a cost function about the relative rotation angle is established after decoupling the rotation matrix and translation vector, which minimizes the algebraic error of geometric constraints from affine correspondences. Then, the global optimization problem is converted into two polynomials with two unknowns based on the characteristic equation and its first derivative is zero. Finally, the relative rotation angle can be solved using the polynomial eigenvalue solver, and the translation vector can be obtained from the eigenvector. Besides, a new linear solution is proposed when the relative rotation is small. The proposed solver is evaluated on synthetic data and real-world datasets. The experiment results demonstrate that our method outperforms comparable state-of-the-art methods in accuracy.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

LLM-as-a-qualitative-judge: automating error analysis in natural language generation

arXiv:2506.09147v4 Announce Type: replace Abstract: Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that instance-specific issues output by LLM-as-a-qualitative-judge match those annotated by humans in 2/3 cases, and that LLM-as-a-qualitative-judge is capable of producing error type reports resembling the reports composed by human annotators. We also demonstrate in a case study how the use of LLM-as-a-qualitative-judge can substantially improve NLG systems performance. Our code and data are publicly available at https://github.com/tunde-ajayi/llm-as-a-qualitative-judge.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Understanding Generalization in Role-Playing Models via Information Theory

arXiv:2512.17270v1 Announce Type: new Abstract: Role-playing models (RPMs) are widely used in real-world applications but underperform when deployed in the wild. This degradation can be attributed to distribution shifts, including user, character, and dialogue compositional shifts. Existing methods like LLM-as-a-judge fall short in providing a fine-grained diagnosis of how these shifts affect RPM generalization, and thus there lack formal frameworks to characterize RPM generalization behaviors. To bridge these gaps, we introduce an information-theoretic metric, named reasoning-based effective mutual information difference (R-EMID), to measure RPM performance degradation in an interpretable way. We also derive an upper bound on R-EMID to predict the worst-case generalization performance of RPMs and theoretically reveal how various shifts contribute to the RPM performance degradation. Moreover, we propose a co-evolving reinforcement learning framework to adaptively model the connection among user, character, and dialogue context and thus enhance the estimation of dialogue response generation probability, which is critical for calculating R-EMID. Finally, we evaluate the generalization performance of various RPMs using R-EMID, finding that user shift poses the highest risk among all shifts and reinforcement learning is the most effective approach for enhancing RPM generalization.

Fonte: arXiv cs.LG

RL • Score 96

Learning to Plan, Planning to Learn: Adaptive Hierarchical RL-MPC for Sample-Efficient Decision Making

arXiv:2512.17091v1 Announce Type: cross Abstract: We propose a new approach for solving planning problems with a hierarchical structure, fusing reinforcement learning and MPC planning. Our formulation tightly and elegantly couples the two planning paradigms. It leverages reinforcement learning actions to inform the MPPI sampler, and adaptively aggregates MPPI samples to inform the value estimation. The resulting adaptive process leverages further MPPI exploration where value estimates are uncertain, and improves training robustness and the overall resulting policies. This results in a robust planning approach that can handle complex planning problems and easily adapts to different applications, as demonstrated over several domains, including race driving, modified Acrobot, and Lunar Lander with added obstacles. Our results in these domains show better data efficiency and overall performance in terms of both rewards and task success, with up to a 72% increase in success rate compared to existing approaches, as well as accelerated convergence (x2.1) compared to non-adaptive sampling.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

DAVE: A VLM Vision Encoder for Document Understanding and Web Agents

arXiv:2512.17221v1 Announce Type: new Abstract: While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial information essential for document understanding and web agents. To bridge this gap, we introduce DAVE, a vision encoder purpose-built for VLMs and tailored for these tasks. Our training pipeline is designed to leverage abundant unlabeled data to bypass the need for costly large-scale annotations for document and web images. We begin with a self-supervised pretraining stage on unlabeled images, followed by a supervised autoregressive pretraining stage, where the model learns tasks like parsing and localization from limited, high-quality data. Within the supervised stage, we adopt two strategies to improve our encoder's alignment with both general visual knowledge and diverse document and web agentic tasks: (i) We introduce a novel model-merging scheme, combining encoders trained with different text decoders to ensure broad compatibility with different web agentic architectures. (ii) We use ensemble training to fuse features from pretrained generalist encoders (e.g., SigLIP2) with our own document and web-specific representations. Extensive experiments on classic document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness of our approach, establishing DAVE as a strong vision encoder for document and web applications.

Fonte: arXiv cs.CV

Vision • Score 95

WDFFU-Mamba: A Wavelet-guided Dual-attention Feature Fusion Mamba for Breast Tumor Segmentation in Ultrasound Images

arXiv:2512.17278v1 Announce Type: new Abstract: Breast ultrasound (BUS) image segmentation plays a vital role in assisting clinical diagnosis and early tumor screening. However, challenges such as speckle noise, imaging artifacts, irregular lesion morphology, and blurred boundaries severely hinder accurate segmentation. To address these challenges, this work aims to design a robust and efficient model capable of automatically segmenting breast tumors in BUS images.We propose a novel segmentation network named WDFFU-Mamba, which integrates wavelet-guided enhancement and dual-attention feature fusion within a U-shaped Mamba architecture. A Wavelet-denoised High-Frequency-guided Feature (WHF) module is employed to enhance low-level representations through noise-suppressed high-frequency cues. A Dual Attention Feature Fusion (DAFF) module is also introduced to effectively merge skip-connected and semantic features, improving contextual consistency.Extensive experiments on two public BUS datasets demonstrate that WDFFU-Mamba achieves superior segmentation accuracy, significantly outperforming existing methods in terms of Dice coefficient and 95th percentile Hausdorff Distance (HD95).The combination of wavelet-domain enhancement and attention-based fusion greatly improves both the accuracy and robustness of BUS image segmentation, while maintaining computational efficiency.The proposed WDFFU-Mamba model not only delivers strong segmentation performance but also exhibits desirable generalization ability across datasets, making it a promising solution for real-world clinical applications in breast tumor ultrasound analysis.

Fonte: arXiv cs.CV

Applications • Score 90

How to Square Tensor Networks and Circuits Without Squaring Them

arXiv:2512.17090v1 Announce Type: cross Abstract: Squared tensor networks (TNs) and their extension as computational graphs--squared circuits--have been used as expressive distribution estimators, yet supporting closed-form marginalization. However, the squaring operation introduces additional complexity when computing the partition function or marginalizing variables, which hinders their applicability in ML. To solve this issue, canonical forms of TNs are parameterized via unitary matrices to simplify the computation of marginals. However, these canonical forms do not apply to circuits, as they can represent factorizations that do not directly map to a known TN. Inspired by the ideas of orthogonality in canonical forms and determinism in circuits enabling tractable maximization, we show how to parameterize squared circuits to overcome their marginalization overhead. Our parameterizations unlock efficient marginalization even in factorizations different from TNs, but encoded as circuits, whose structure would otherwise make marginalization computationally hard. Finally, our experiments on distribution estimation show how our proposed conditions in squared circuits come with no expressiveness loss, while enabling more efficient learning.

Fonte: arXiv cs.AI

Evaluation/Benchmarks • Score 93

meval: A Statistical Toolbox for Fine-Grained Model Performance Analysis

arXiv:2512.17409v1 Announce Type: new Abstract: Analyzing machine learning model performance stratified by patient and recording properties is becoming the accepted norm and often yields crucial insights about important model failure modes. Performing such analyses in a statistically rigorous manner is non-trivial, however. Appropriate performance metrics must be selected that allow for valid comparisons between groups of different sample sizes and base rates; metric uncertainty must be determined and multiple comparisons be corrected for, in order to assess whether any observed differences may be purely due to chance; and in the case of intersectional analyses, mechanisms must be implemented to find the most `interesting' subgroups within combinatorially many subgroup combinations. We here present a statistical toolbox that addresses these challenges and enables practitioners to easily yet rigorously assess their models for potential subgroup performance disparities. While broadly applicable, the toolbox is specifically designed for medical imaging applications. The analyses provided by the toolbox are illustrated in two case studies, one in skin lesion malignancy classification on the ISIC2020 dataset and one in chest X-ray-based disease classification on the MIMIC-CXR dataset.

Fonte: arXiv cs.LG

RL • Score 96

Conjunto de Dados Sintético que Preserva a Privacidade de Trajetórias Diárias Individuais para Análises de Mobilidade em Escala Urbana

Os dados de mobilidade urbana são essenciais para o planejamento urbano, previsão de demanda de transporte e modelagem de pandemias. Este estudo apresenta um conjunto de dados sintético que preserva a privacidade, reconstruindo trajetórias diárias a partir de entradas agregadas, sem a necessidade de identificadores pessoais.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Deep Learning-Based Surrogate Creep Modelling in Inconel 625: A High-Temperature Alloy Study

arXiv:2512.17477v1 Announce Type: new Abstract: Time-dependent deformation, particularly creep, in high-temperature alloys such as Inconel 625 is a key factor in the long-term reliability of components used in aerospace and energy systems. Although Inconel 625 shows excellent creep resistance, finite-element creep simulations in tools such as ANSYS remain computationally expensive, often requiring tens of minutes for a single 10,000-hour run. This work proposes deep learning based surrogate models to provide fast and accurate replacements for such simulations. Creep strain data was generated in ANSYS using the Norton law under uniaxial stresses of 50 to 150 MPa and temperatures of 700 to 1000 $^\circ$C, and this temporal dataset was used to train two architectures: a BiLSTM Variational Autoencoder for uncertainty-aware and generative predictions, and a BiLSTM Transformer hybrid that employs self-attention to capture long-range temporal behavior. Both models act as surrogate predictors, with the BiLSTM-VAE offering probabilistic output and the BiLSTM-Transformer delivering high deterministic accuracy. Performance is evaluated using RMSE, MAE, and $R^2$. Results show that the BiLSTM-VAE provides stable and reliable creep strain forecasts, while the BiLSTM-Transformer achieves strong accuracy across the full time range. Latency tests indicate substantial speedup: while each ANSYS simulation requires 30 to 40 minutes for a given stress-temperature condition, the surrogate models produce predictions within seconds. The proposed framework enables rapid creep assessment for design optimization and structural health monitoring, and provides a scalable solution for high-temperature alloy applications.

Fonte: arXiv cs.LG

RL • Score 96

Navegando Expansões Taxonômicas de Conjuntos de Entidades Impulsionadas por Bases de Conhecimento

Reconhecer semelhanças entre entidades é central tanto para a cognição humana quanto para a inteligência computacional. A Expansão de Conjuntos de Entidades é uma tarefa que visa identificar entidades adicionais que compartilham propriedades semânticas relevantes. Um novo framework baseado em lógica introduz o conceito de um grafo de expansão, que apoia expansões taxonômicas de conjuntos de entidades, embora a grandeza desses grafos possa tornar a materialização completa impraticável.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Predictive Modeling of Maritime Radar Data Using Transformer Architecture

arXiv:2512.17098v1 Announce Type: new Abstract: Maritime autonomous systems require robust predictive capabilities to anticipate vessel motion and environmental dynamics. While transformer architectures have revolutionized AIS-based trajectory prediction and demonstrated feasibility for sonar frame forecasting, their application to maritime radar frame prediction remains unexplored, creating a critical gap given radar's all-weather reliability for navigation. This survey systematically reviews predictive modeling approaches relevant to maritime radar, with emphasis on transformer architectures for spatiotemporal sequence forecasting, where existing representative methods are analyzed according to data type, architecture, and prediction horizon. Our review shows that, while the literature has demonstrated transformer-based frame prediction for sonar sensing, no prior work addresses transformer-based maritime radar frame prediction, thereby defining a clear research gap and motivating a concrete research direction for future work in this area.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

Atom: Efficient On-Device Video-Language Pipelines Through Modular Reuse

arXiv:2512.17108v1 Announce Type: new Abstract: Recent advances in video-language models have enabled powerful applications like video retrieval, captioning, and assembly. However, executing such multi-stage pipelines efficiently on mobile devices remains challenging due to redundant model loads and fragmented execution. We introduce Atom, an on-device system that restructures video-language pipelines for fast and efficient execution. Atom decomposes a billion-parameter model into reusable modules, such as the visual encoder and language decoder, and reuses them across subtasks like captioning, reasoning, and indexing. This reuse-centric design eliminates repeated model loading and enables parallel execution, reducing end-to-end latency without sacrificing performance. On commodity smartphones, Atom achieves 27--33% faster execution compared to non-reuse baselines, with only marginal performance drop ($\leq$ 2.3 Recall@1 in retrieval, $\leq$ 1.5 CIDEr in captioning). These results position Atom as a practical, scalable approach for efficient video-language understanding on edge devices.

Fonte: arXiv cs.LG

Applications • Score 90

M2RU: Memristive Minion Recurrent Unit for Continual Learning at the Edge

arXiv:2512.17299v1 Announce Type: new Abstract: Continual learning on edge platforms remains challenging because recurrent networks depend on energy-intensive training procedures and frequent data movement that are impractical for embedded deployments. This work introduces M2RU, a mixed-signal architecture that implements the minion recurrent unit for efficient temporal processing with on-chip continual learning. The architecture integrates weighted-bit streaming, which enables multi-bit digital inputs to be processed in crossbars without high-resolution conversion, and an experience replay mechanism that stabilizes learning under domain shifts. M2RU achieves 15 GOPS at 48.62 mW, corresponding to 312 GOPS per watt, and maintains accuracy within 5 percent of software baselines on sequential MNIST and CIFAR-10 tasks. Compared with a CMOS digital design, the accelerator provides 29X improvement in energy efficiency. Device-aware analysis shows an expected operational lifetime of 12.2 years under continual learning workloads. These results establish M2RU as a scalable and energy-efficient platform for real-time adaptation in edge-level temporal intelligence.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Um Referencial de Saúde da Mulher para Modelos de Linguagem de Grande Escala

À medida que os modelos de linguagem de grande escala (LLMs) se tornam fontes primárias de informação em saúde para milhões, sua precisão em saúde da mulher permanece criticamente inexplorada. Apresentamos o Women's Health Benchmark (WHB), o primeiro referencial que avalia o desempenho dos LLMs especificamente em saúde da mulher.

Fonte: arXiv cs.AI

RL • Score 96

DiffeoMorph: Aprendendo a Morfose de Formas 3D Usando Simulações Baseadas em Agentes Diferenciáveis

Sistemas biológicos podem formar estruturas tridimensionais complexas através do comportamento coletivo de agentes idênticos. Neste trabalho, apresentamos o DiffeoMorph, um framework diferenciável para aprender um protocolo de morfogênese que orienta uma população de agentes a se transformar em uma forma 3D alvo, utilizando uma rede neural gráfica SE(3)-equivariant baseada em atenção.

Fonte: arXiv cs.LG