MLOps/Systems • Score 85
Roteamento Consciente de Energia para Grandes Modelos de Raciocínio
Modelos de raciocínio grandes (LRMs) apresentam custos de energia de inferência heterogêneos, dependendo do modelo utilizado e da quantidade de raciocínio realizada. Para reduzir o consumo de energia, é crucial escolher o LRM adequado e operá-lo de forma eficiente. O desempenho dos sistemas que distribuem tarefas entre diferentes LRMs individuais depende do equilíbrio entre o fornecimento médio de energia e as flutuações estocásticas.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
AI Agente para Tomada de Decisão de Risco de Crédito Autônoma, Explicável e em Tempo Real
A digitalização significativa dos serviços financeiros gerou uma demanda urgente por sistemas de tomada de decisão de risco de crédito que sejam autônomos, transparentes e em tempo real. Este artigo apresenta um framework de AI Agente, onde agentes de IA avaliam o crédito de forma dinâmica, minimizando a intervenção humana e melhorando a velocidade e transparência das decisões.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
CaveAgent: Transforming LLMs into Stateful Runtime Operators
arXiv:2601.01569v1 Announce Type: new
Abstract: LLM-based agents are increasingly capable of complex task execution, yet current agentic systems remain constrained by text-centric paradigms. Traditional approaches rely on procedural JSON-based function calling, which often struggles with long-horizon tasks due to fragile multi-turn dependencies and context drift. In this paper, we present CaveAgent, a framework that transforms the paradigm from "LLM-as-Text-Generator" to "LLM-as-Runtime-Operator." We introduce a Dual-stream Context Architecture that decouples state management into a lightweight semantic stream for reasoning and a persistent, deterministic Python Runtime stream for execution. In addition to leveraging code generation to efficiently resolve interdependent sub-tasks (e.g., loops, conditionals) in a single step, we introduce \textit{Stateful Runtime Management} in CaveAgent. Distinct from existing code-based approaches that remain text-bound and lack the support for external object injection and retrieval, CaveAgent injects, manipulates, and retrieves complex Python objects (e.g., DataFrames, database connections) that persist across turns. This persistence mechanism acts as a high-fidelity external memory to eliminate context drift, avoid catastrophic forgetting, while ensuring that processed data flows losslessly to downstream applications. Comprehensive evaluations on Tau$^2$-bench, BFCL and various case studies across representative SOTA LLMs demonstrate CaveAgent's superiority. Specifically, our framework achieves a 10.5\% success rate improvement on retail tasks and reduces total token consumption by 28.4\% in multi-turn scenarios. On data-intensive tasks, direct variable storage and retrieval reduces token consumption by 59\%, allowing CaveAgent to handle large-scale data that causes context overflow failures in both JSON-based and Code-based agents.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
Temporal Attack Pattern Detection in Multi-Agent AI Workflows: An Open Framework for Training Trace-Based Security Models
arXiv:2601.00848v1 Announce Type: new
Abstract: We present an openly documented methodology for fine-tuning language models to detect temporal attack patterns in multi-agent AI workflows using OpenTelemetry trace analysis. We curate a dataset of 80,851 examples from 18 public cybersecurity sources and 35,026 synthetic OpenTelemetry traces. We apply iterative QLoRA fine-tuning on resource-constrained ARM64 hardware (NVIDIA DGX Spark) through three training iterations with strategic augmentation. Our custom benchmark accuracy improves from 42.86% to 74.29%, a statistically significant 31.4-point gain. Targeted examples addressing specific knowledge gaps outperform indiscriminate scaling. Key contributions include: (1) synthetic trace generation methodology for multi-agent coordination attacks and regulatory violations, (2) empirical evidence that training data composition fundamentally determines behavior, and (3) complete open release of datasets, training scripts, and evaluation benchmarks on HuggingFace. While practical deployment requires human oversight due to false positive rates, this work establishes the first reproducible framework enabling practitioners to build custom agentic security models adapted to their threat landscapes.
Fonte: arXiv cs.AI
MLOps/Systems • Score 85
A New Benchmark for the Appropriate Evaluation of RTL Code Optimization
arXiv:2601.01765v1 Announce Type: new
Abstract: The rapid progress of artificial intelligence increasingly relies on efficient integrated circuit (IC) design. Recent studies have explored the use of large language models (LLMs) for generating Register Transfer Level (RTL) code, but existing benchmarks mainly evaluate syntactic correctness rather than optimization quality in terms of power, performance, and area (PPA). This work introduces RTL-OPT, a benchmark for assessing the capability of LLMs in RTL optimization. RTL-OPT contains 36 handcrafted digital designs that cover diverse implementation categories including combinational logic, pipelined datapaths, finite state machines, and memory interfaces. Each task provides a pair of RTL codes, a suboptimal version and a human-optimized reference that reflects industry-proven optimization patterns not captured by conventional synthesis tools. Furthermore, RTL-OPT integrates an automated evaluation framework to verify functional correctness and quantify PPA improvements, enabling standardized and meaningful assessment of generative models for hardware design optimization.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
Jenius Agent: Towards Experience-Driven Accuracy Optimization in Real-World Scenarios
arXiv:2601.01857v1 Announce Type: new
Abstract: As agent systems powered by large language models (LLMs) advance, improving the task performance of an autonomous agent, especially in context understanding, tool usage, and response generation, has become increasingly critical. Although prior studies have advanced the overall design of LLM-based agents, systematic optimization of their internal reasoning and tool-use pipelines remains underexplored. This paper introduces an agent framework grounded in real-world practical experience, with three key innovations: (1) an adaptive prompt generation strategy that aligns with the agent's state and task goals to improve reliability and robustness; (2) a context-aware tool orchestration module that performs tool categorization, semantic retrieval, and adaptive invocation based on user intent and context; and (3) a layered memory mechanism that integrates session memory, task history, and external summaries to improve relevance and efficiency through dynamic summarization and compression. An end-to-end framework named Jenius-Agent has been integrated with three key optimizations, including tools based on the Model Context Protocol (MCP), file input/output (I/O), and execution feedback. The experiments show a 20 percent improvement in task accuracy, along with a reduced token cost, response latency, and invocation failures. The framework is already deployed in Jenius (https://www.jenius.cn), providing a lightweight and scalable solution for robust, protocol-compatible autonomous agents.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
Digital Twin AI: Opportunities and Challenges from Large Language Models to World Models
arXiv:2601.01321v1 Announce Type: new
Abstract: Digital twins, as precise digital representations of physical systems, have evolved from passive simulation tools into intelligent and autonomous entities through the integration of artificial intelligence technologies. This paper presents a unified four-stage framework that systematically characterizes AI integration across the digital twin lifecycle, spanning modeling, mirroring, intervention, and autonomous management. By synthesizing existing technologies and practices, we distill a unified four-stage framework that systematically characterizes how AI methodologies are embedded across the digital twin lifecycle: (1) modeling the physical twin through physics-based and physics-informed AI approaches, (2) mirroring the physical system into a digital twin with real-time synchronization, (3) intervening in the physical twin through predictive modeling, anomaly detection, and optimization strategies, and (4) achieving autonomous management through large language models, foundation models, and intelligent agents. We analyze the synergy between physics-based modeling and data-driven learning, highlighting the shift from traditional numerical solvers to physics-informed and foundation models for physical systems. Furthermore, we examine how generative AI technologies, including large language models and generative world models, transform digital twins into proactive and self-improving cognitive systems capable of reasoning, communication, and creative scenario generation. Through a cross-domain review spanning eleven application domains, including healthcare, aerospace, smart manufacturing, robotics, and smart cities, we identify common challenges related to scalability, explainability, and trustworthiness, and outline directions for responsible AI-driven digital twin systems.
Fonte: arXiv cs.AI
NLP/LLMs • Score 85
MathLedger: Um Substrato de Aprendizado Verificável com Feedback Atestado por Ledger
Os sistemas de IA contemporâneos alcançam desempenho extraordinário, mas permanecem opacos e não verificáveis, criando uma crise de confiança para implantações críticas de segurança. Apresentamos o MathLedger, um substrato para cognição de máquina verificável que integra verificação formal, atestação criptográfica e dinâmicas de aprendizado em um único loop epistêmico.
Fonte: arXiv cs.AI
RL • Score 96
IA Generativa Nativa em Nuvem para Síntese Automatizada de Planogramas: Uma Abordagem de Modelo de Difusão para Otimização de Varejo em Múltiplas Lojas
A criação de planogramas é um desafio significativo para o varejo, exigindo em média 30 horas por layout complexo. Este artigo apresenta uma arquitetura nativa em nuvem utilizando modelos de difusão para gerar automaticamente planogramas específicos para cada loja, aprendendo com arranjos de prateleiras bem-sucedidos em várias localizações de varejo.
Fonte: arXiv cs.LG
Vision • Score 95
Robust Assembly Progress Estimation via Deep Metric Learning
arXiv:2601.00422v1 Announce Type: new
Abstract: In recent years, the advancement of AI technologies has accelerated the development of smart factories. In particular, the automatic monitoring of product assembly progress is crucial for improving operational efficiency, minimizing the cost of discarded parts, and maximizing factory productivity. However, in cases where assembly tasks are performed manually over multiple days, implementing smart factory systems remains a challenge. Previous work has proposed Anomaly Triplet-Net, which estimates assembly progress by applying deep metric learning to the visual features of products. Nevertheless, when visual changes between consecutive tasks are subtle, misclassification often occurs. To address this issue, this paper proposes a robust system for estimating assembly progress, even in cases of occlusion or minimal visual change, using a small-scale dataset. Our method leverages a Quadruplet Loss-based learning approach for anomaly images and introduces a custom data loader that strategically selects training samples to enhance estimation accuracy. We evaluated our approach using a image datasets: captured during desktop PC assembly. The proposed Anomaly Quadruplet-Net outperformed existing methods on the dataset. Specifically, it improved the estimation accuracy by 1.3% and reduced misclassification between adjacent tasks by 1.9% in the desktop PC dataset and demonstrating the effectiveness of the proposed method.
Fonte: arXiv cs.CV
Vision • Score 95
A Cascaded Information Interaction Network for Precise Image Segmentation
arXiv:2601.00562v1 Announce Type: new
Abstract: Visual perception plays a pivotal role in enabling autonomous behavior, offering a cost-effective and efficient alternative to complex multi-sensor systems. However, robust segmentation remains a challenge in complex scenarios. To address this, this paper proposes a cascaded convolutional neural network integrated with a novel Global Information Guidance Module. This module is designed to effectively fuse low-level texture details with high-level semantic features across multiple layers, thereby overcoming the inherent limitations of single-scale feature extraction. This architectural innovation significantly enhances segmentation accuracy, particularly in visually cluttered or blurred environments where traditional methods often fail. Experimental evaluations on benchmark image segmentation datasets demonstrate that the proposed framework achieves superior precision, outperforming existing state-of-the-art methods. The results highlight the effectiveness of the approach and its promising potential for deployment in practical robotic applications.
Fonte: arXiv cs.CV
Vision • Score 95
A Comprehensive Dataset for Human vs. AI Generated Image Detection
arXiv:2601.00553v1 Announce Type: new
Abstract: Multimodal generative AI systems like Stable Diffusion, DALL-E, and MidJourney have fundamentally changed how synthetic images are created. These tools drive innovation but also enable the spread of misleading content, false information, and manipulated media. As generated images become harder to distinguish from photographs, detecting them has become an urgent priority. To combat this challenge, We release MS COCOAI, a novel dataset for AI generated image detection consisting of 96000 real and synthetic datapoints, built using the MS COCO dataset. To generate synthetic images, we use five generators: Stable Diffusion 3, Stable Diffusion 2.1, SDXL, DALL-E 3, and MidJourney v6. Based on the dataset, we propose two tasks: (1) classifying images as real or generated, and (2) identifying which model produced a given synthetic image. The dataset is available at https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Image_Dataset.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Intelligent Traffic Surveillance for Real-Time Vehicle Detection, License Plate Recognition, and Speed Estimation
arXiv:2601.00344v1 Announce Type: new
Abstract: Speeding is a major contributor to road fatalities, particularly in developing countries such as Uganda, where road safety infrastructure is limited. This study proposes a real-time intelligent traffic surveillance system tailored to such regions, using computer vision techniques to address vehicle detection, license plate recognition, and speed estimation. The study collected a rich dataset using a speed gun, a Canon Camera, and a mobile phone to train the models. License plate detection using YOLOv8 achieved a mean average precision (mAP) of 97.9%. For character recognition of the detected license plate, the CNN model got a character error rate (CER) of 3.85%, while the transformer model significantly reduced the CER to 1.79%. Speed estimation used source and target regions of interest, yielding a good performance of 10 km/h margin of error. Additionally, a database was established to correlate user information with vehicle detection data, enabling automated ticket issuance via SMS via Africa's Talking API. This system addresses critical traffic management needs in resource-constrained environments and shows potential to reduce road accidents through automated traffic enforcement in developing countries where such interventions are urgently needed.
Fonte: arXiv cs.CV
NLP/LLMs • Score 96
FaithSCAN: Detecção de Alucinações em Uma Única Passagem Baseada em Modelos para Respostas Visuais de Perguntas Fiéis
As alucinações de fidelidade em VQA ocorrem quando modelos de visão-linguagem produzem respostas fluentes, mas visualmente não fundamentadas, comprometendo sua confiabilidade em aplicações críticas de segurança. Propomos o FaithSCAN, uma rede leve que detecta alucinações explorando sinais internos ricos dos VLMs, superando limitações de métodos existentes em eficiência e desempenho de detecção.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
Application Research of a Deep Learning Model Integrating CycleGAN and YOLO in PCB Infrared Defect Detection
arXiv:2601.00237v1 Announce Type: new
Abstract: This paper addresses the critical bottleneck of infrared (IR) data scarcity in Printed Circuit Board (PCB) defect detection by proposing a cross-modal data augmentation framework integrating CycleGAN and YOLOv8. Unlike conventional methods relying on paired supervision, we leverage CycleGAN to perform unpaired image-to-image translation, mapping abundant visible-light PCB images into the infrared domain. This generative process synthesizes high-fidelity pseudo-IR samples that preserve the structural semantics of defects while accurately simulating thermal distribution patterns. Subsequently, we construct a heterogeneous training strategy that fuses generated pseudo-IR data with limited real IR samples to train a lightweight YOLOv8 detector. Experimental results demonstrate that this method effectively enhances feature learning under low-data conditions. The augmented detector significantly outperforms models trained on limited real data alone and approaches the performance benchmarks of fully supervised training, proving the efficacy of pseudo-IR synthesis as a robust augmentation strategy for industrial inspection.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Focal-RegionFace: Generating Fine-Grained Multi-attribute Descriptions for Arbitrarily Selected Face Focal Regions
arXiv:2601.00156v1 Announce Type: new
Abstract: In this paper, we introduce an underexplored problem in facial analysis: generating and recognizing multi-attribute natural language descriptions, containing facial action units (AUs), emotional states, and age estimation, for arbitrarily selected face regions (termed FaceFocalDesc). We argue that the system's ability to focus on individual facial areas leads to better understanding and control. To achieve this capability, we construct a new multi-attribute description dataset for arbitrarily selected face regions, providing rich region-level annotations and natural language descriptions. Further, we propose a fine-tuned vision-language model based on Qwen2.5-VL, called Focal-RegionFace for facial state analysis, which incrementally refines its focus on localized facial features through multiple progressively fine-tuning stages, resulting in interpretable age estimation, FAU and emotion detection. Experimental results show that Focal-RegionFace achieves the best performance on the new benchmark in terms of traditional and widely used metrics, as well as new proposed metrics. This fully verifies its effectiveness and versatility in fine-grained multi-attribute face region-focal analysis scenarios.
Fonte: arXiv cs.CV
NLP/LLMs • Score 93
Mortar: Mecânicas em Evolução para Design de Jogos Automático
Apresentamos o Mortar, um sistema para evoluir mecanicamente jogos de forma autônoma para design automático. As mecânicas de jogo definem as regras e interações que governam a jogabilidade, e projetá-las manualmente é um processo que consome tempo e requer expertise. O Mortar combina um algoritmo de qualidade e diversidade com um modelo de linguagem grande para explorar um conjunto diversificado de mecânicas.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark
arXiv:2601.00092v1 Announce Type: new
Abstract: 4D spatial intelligence involves perceiving and processing how objects move or change over time. Humans naturally possess 4D spatial intelligence, supporting a broad spectrum of spatial reasoning abilities. To what extent can Multimodal Large Language Models (MLLMs) achieve human-level 4D spatial intelligence? In this work, we present Spatial4D-Bench, a versatile 4D spatial intelligence benchmark designed to comprehensively assess the 4D spatial reasoning abilities of MLLMs. Unlike existing spatial intelligence benchmarks that are often small-scale or limited in diversity, Spatial4D-Bench provides a large-scale, multi-task evaluation benchmark consisting of ~40,000 question-answer pairs covering 18 well-defined tasks. We systematically organize these tasks into six cognitive categories: object understanding, scene understanding, spatial relationship understanding, spatiotemporal relationship understanding, spatial reasoning and spatiotemporal reasoning. Spatial4D-Bench thereby offers a structured and comprehensive benchmark for evaluating the spatial cognition abilities of MLLMs, covering a broad spectrum of tasks that parallel the versatility of human spatial intelligence. We benchmark various state-of-the-art open-source and proprietary MLLMs on Spatial4D-Bench and reveal their substantial limitations in a wide variety of 4D spatial reasoning aspects, such as route plan, action recognition, and physical plausibility reasoning. We hope that the findings provided in this work offer valuable insights to the community and that our benchmark can facilitate the development of more capable MLLMs toward human-level 4D spatial intelligence. More resources can be found on our project page.
Fonte: arXiv cs.CV
Vision • Score 95
Context-Aware Pesticide Recommendation via Few-Shot Pest Recognition for Precision Agriculture
arXiv:2601.00243v1 Announce Type: new
Abstract: Effective pest management is crucial for enhancing agricultural productivity, especially for crops such as sugarcane and wheat that are highly vulnerable to pest infestations. Traditional pest management methods depend heavily on manual field inspections and the use of chemical pesticides. These approaches are often costly, time-consuming, labor-intensive, and can have a negative impact on the environment. To overcome these challenges, this study presents a lightweight framework for pest detection and pesticide recommendation, designed for low-resource devices such as smartphones and drones, making it suitable for use by small and marginal farmers.
The proposed framework includes two main components. The first is a Pest Detection Module that uses a compact, lightweight convolutional neural network (CNN) combined with prototypical meta-learning to accurately identify pests even when only a few training samples are available. The second is a Pesticide Recommendation Module that incorporates environmental factors like crop type and growth stage to suggest safe and eco-friendly pesticide recommendations. To train and evaluate our framework, a comprehensive pest image dataset was developed by combining multiple publicly available datasets. The final dataset contains samples with different viewing angles, pest sizes, and background conditions to ensure strong generalization.
Experimental results show that the proposed lightweight CNN achieves high accuracy, comparable to state-of-the-art models, while significantly reducing computational complexity. The Decision Support System additionally improves pest management by reducing dependence on traditional chemical pesticides and encouraging sustainable practices, demonstrating its potential for real-time applications in precision agriculture.
Fonte: arXiv cs.CV
Vision • Score 95
TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model
arXiv:2601.00051v1 Announce Type: new
Abstract: World models aim to endow AI systems with the ability to represent, generate, and interact with dynamic environments in a coherent and temporally consistent manner. While recent video generation models have demonstrated impressive visual quality, they remain limited in real-time interaction, long-horizon consistency, and persistent memory of dynamic scenes, hindering their evolution into practical world models. In this report, we present TeleWorld, a real-time multimodal 4D world modeling framework that unifies video generation, dynamic scene reconstruction, and long-term world memory within a closed-loop system. TeleWorld introduces a novel generation-reconstruction-guidance paradigm, where generated video streams are continuously reconstructed into a dynamic 4D spatio-temporal representation, which in turn guides subsequent generation to maintain spatial, temporal, and physical consistency. To support long-horizon generation with low latency, we employ an autoregressive diffusion-based video model enhanced with Macro-from-Micro Planning (MMPL)--a hierarchical planning method that reduces error accumulation from frame-level to segment-level-alongside efficient Distribution Matching Distillation (DMD), enabling real-time synthesis under practical computational budgets. Our approach achieves seamless integration of dynamic object modeling and static scene representation within a unified 4D framework, advancing world models toward practical, interactive, and computationally accessible systems. Extensive experiments demonstrate that TeleWorld achieves strong performance in both static and dynamic world understanding, long-term consistency, and real-time generation efficiency, positioning it as a practical step toward interactive, memory-enabled world models for multimodal generation and embodied intelligence.
Fonte: arXiv cs.CV
NLP/LLMs • Score 94
Controles de Abstenção Explícita para Confiabilidade Previsível em Respostas a Perguntas em Vídeo
A implementação de modelos de visão-linguagem (VLMs) em situações críticas exige previsões seletivas, onde os sistemas se abstêm quando incertos, evitando erros custosos. Investigamos se a abstenção baseada em confiança oferece controle confiável sobre as taxas de erro em respostas a perguntas em vídeo e se esse controle se mantém robusto sob mudança de distribuição.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
RoLID-11K: A Dashcam Dataset for Small-Object Roadside Litter Detection
arXiv:2601.00398v1 Announce Type: new
Abstract: Roadside litter poses environmental, safety and economic challenges, yet current monitoring relies on labour-intensive surveys and public reporting, providing limited spatial coverage. Existing vision datasets for litter detection focus on street-level still images, aerial scenes or aquatic environments, and do not reflect the unique characteristics of dashcam footage, where litter appears extremely small, sparse and embedded in cluttered road-verge backgrounds. We introduce RoLID-11K, the first large-scale dataset for roadside litter detection from dashcams, comprising over 11k annotated images spanning diverse UK driving conditions and exhibiting pronounced long-tail and small-object distributions. We benchmark a broad spectrum of modern detectors, from accuracy-oriented transformer architectures to real-time YOLO models, and analyse their strengths and limitations on this challenging task. Our results show that while CO-DETR and related transformers achieve the best localisation accuracy, real-time models remain constrained by coarse feature hierarchies. RoLID-11K establishes a challenging benchmark for extreme small-object detection in dynamic driving scenes and aims to support the development of scalable, low-cost systems for roadside-litter monitoring. The dataset is available at https://github.com/xq141839/RoLID-11K.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning
arXiv:2601.00791v1 Announce Type: cross
Abstract: We present a training-free method for detecting valid mathematical reasoning in large language models through spectral analysis of attention patterns. By treating attention matrices as adjacency matrices of dynamic graphs over tokens, we extract four interpretable spectral diagnostics, the Fiedler value (algebraic connectivity), high-frequency energy ratio (HFER), graph signal smoothness, and spectral entropy, that exhibit statistically significant differences between valid and invalid mathematical proofs. Experiments across seven transformer models from four independent architectural families (Meta Llama, Alibaba Qwen, Microsoft Phi, and Mistral AI) demonstrate that this spectral signature produces effect sizes up to Cohen's $d = 3.30$ ($p < 10^{-116}$), enabling 85.0--95.6\% classification accuracy under rigorous evaluation, with calibrated thresholds reaching 93--95\% on the full dataset. The method requires no training data, fine-tuning, or learned classifiers: a single threshold on a spectral metric suffices for high accuracy. Through systematic label correction, we discover that the spectral method detects logical coherence rather than compiler acceptance, identifying mathematically valid proofs that formal verifiers reject due to technical failures. We further identify an architectural dependency: Mistral-7B's Sliding Window Attention shifts the discriminative signal from HFER to late-layer Smoothness ($d = 2.09$, $p_{\text{MW}} = 1.16 \times 10^{-48}$), revealing that attention mechanism design affects which spectral features capture reasoning validity. These findings establish spectral graph analysis as a principled framework for reasoning verification with immediate applications to hallucination detection and AI safety monitoring.
Fonte: arXiv cs.CL
RL • Score 95
TeleDoCTR: Domain-Specific and Contextual Troubleshooting for Telecommunications
arXiv:2601.00691v1 Announce Type: cross
Abstract: Ticket troubleshooting refers to the process of analyzing and resolving problems that are reported through a ticketing system. In large organizations offering a wide range of services, this task is highly complex due to the diversity of submitted tickets and the need for specialized domain knowledge. In particular, troubleshooting in telecommunications (telecom) is a very time-consuming task as it requires experts to interpret ticket content, consult documentation, and search historical records to identify appropriate resolutions. This human-intensive approach not only delays issue resolution but also hinders overall operational efficiency. To enhance the effectiveness and efficiency of ticket troubleshooting in telecom, we propose TeleDoCTR, a novel telecom-related, domain-specific, and contextual troubleshooting system tailored for end-to-end ticket resolution in telecom. TeleDoCTR integrates both domain-specific ranking and generative models to automate key steps of the troubleshooting workflow which are: routing tickets to the appropriate expert team responsible for resolving the ticket (classification task), retrieving contextually and semantically similar historical tickets (retrieval task), and generating a detailed fault analysis report outlining the issue, root cause, and potential solutions (generation task). We evaluate TeleDoCTR on a real-world dataset from a telecom infrastructure and demonstrate that it achieves superior performance over existing state-of-the-art methods, significantly enhancing the accuracy and efficiency of the troubleshooting process.
Fonte: arXiv cs.CL
NLP/LLMs • Score 96
A Coleira Agente: Extraindo Mapas Cognitivos Fuzzy de Feedback Causal com LLMs
Desenvolvemos um agente de modelo de linguagem grande (LLM) que extrai mapas cognitivos fuzzy de feedback causal (FCMs) a partir de texto bruto. O processo de aprendizado ou extração causal é agente tanto pela semi-autonomia do LLM quanto pela dinâmica do sistema FCM, que orienta os agentes LLM a buscar e processar texto causal.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
StockBot 2.0: Vanilla LSTMs Outperform Transformer-based Forecasting for Stock Prices
arXiv:2601.00197v1 Announce Type: cross
Abstract: Accurate forecasting of financial markets remains a long-standing challenge due to complex temporal and often latent dependencies, non-linear dynamics, and high volatility. Building on our earlier recurrent neural network framework, we present an enhanced StockBot architecture that systematically evaluates modern attention-based, convolutional, and recurrent time-series forecasting models within a unified experimental setting. While attention-based and transformer-inspired models offer increased modeling flexibility, extensive empirical evaluation reveals that a carefully constructed vanilla LSTM consistently achieves superior predictive accuracy and more stable buy/sell decision-making when trained under a common set of default hyperparameters. These results highlight the robustness and data efficiency of recurrent sequence models for financial time-series forecasting, particularly in the absence of extensive hyperparameter tuning or the availability of sufficient data when discretized to single-day intervals. Additionally, these results underscore the importance of architectural inductive bias in data-limited market prediction tasks.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Adapting Natural Language Processing Models Across Jurisdictions: A pilot Study in Canadian Cancer Registries
arXiv:2601.00787v1 Announce Type: new
Abstract: Population-based cancer registries depend on pathology reports as their primary diagnostic source, yet manual abstraction is resource-intensive and contributes to delays in cancer data. While transformer-based NLP systems have improved registry workflows, their ability to generalize across jurisdictions with differing reporting conventions remains poorly understood. We present the first cross-provincial evaluation of adapting BCCRTron, a domain-adapted transformer model developed at the British Columbia Cancer Registry, alongside GatorTron, a biomedical transformer model, for cancer surveillance in Canada. Our training dataset consisted of approximately 104,000 and 22,000 de-identified pathology reports from the Newfoundland & Labrador Cancer Registry (NLCR) for Tier 1 (cancer vs. non-cancer) and Tier 2 (reportable vs. non-reportable) tasks, respectively. Both models were fine-tuned using complementary synoptic and diagnosis focused report section input pipelines. Across NLCR test sets, the adapted models maintained high performance, demonstrating transformers pretrained in one jurisdiction can be localized to another with modest fine-tuning. To improve sensitivity, we combined the two models using a conservative OR-ensemble achieving a Tier 1 recall of 0.99 and reduced missed cancers to 24, compared with 48 and 54 for the standalone models. For Tier 2, the ensemble achieved 0.99 recall and reduced missed reportable cancers to 33, compared with 54 and 46 for the individual models. These findings demonstrate that an ensemble combining complementary text representations substantially reduce missed cancers and improve error coverage in cancer-registry NLP. We implement a privacy-preserving workflow in which only model weights are shared between provinces, supporting interoperable NLP infrastructure and a future pan-Canadian foundation model for cancer pathology and registry workflows.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
A Language-Agnostic Hierarchical LoRA-MoE Architecture for CTC-based Multilingual ASR
arXiv:2601.00557v1 Announce Type: new
Abstract: Large-scale multilingual ASR (mASR) models such as Whisper achieve strong performance but incur high computational and latency costs, limiting their deployment on resource-constrained edge devices. In this study, we propose a lightweight and language-agnostic multilingual ASR system based on a CTC architecture with domain adaptation. Specifically, we introduce a Language-agnostic Hierarchical LoRA-MoE (HLoRA) framework integrated into an mHuBERT-CTC model, enabling end-to-end decoding via LID-posterior-driven LoRA routing. The hierarchical design consists of a multilingual shared LoRA for learning language-invariant acoustic representations and language-specific LoRA experts for modeling language-dependent characteristics. The proposed routing mechanism removes the need for prior language identity information or explicit language labels during inference, achieving true language-agnostic decoding. Experiments on MSR-86K and the MLC-SLM 2025 Challenge datasets demonstrate that HLoRA achieves competitive performance with state-of-the-art two-stage inference methods using only single-pass decoding, significantly improving decoding efficiency for low-resource mASR applications.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Beyond IVR: Benchmarking Customer Support LLM Agents for Business-Adherence
arXiv:2601.00596v1 Announce Type: new
Abstract: Traditional customer support systems, such as Interactive Voice Response (IVR), rely on rigid scripts and lack the flexibility required for handling complex, policy-driven tasks. While large language model (LLM) agents offer a promising alternative, evaluating their ability to act in accordance with business rules and real-world support workflows remains an open challenge. Existing benchmarks primarily focus on tool usage or task completion, overlooking an agent's capacity to adhere to multi-step policies, navigate task dependencies, and remain robust to unpredictable user or environment behavior. In this work, we introduce JourneyBench, a benchmark designed to assess policy-aware agents in customer support. JourneyBench leverages graph representations to generate diverse, realistic support scenarios and proposes the User Journey Coverage Score, a novel metric to measure policy adherence. We evaluate multiple state-of-the-art LLMs using two agent designs: a Static-Prompt Agent (SPA) and a Dynamic-Prompt Agent (DPA) that explicitly models policy control. Across 703 conversations in three domains, we show that DPA significantly boosts policy adherence, even allowing smaller models like GPT-4o-mini to outperform more capable ones like GPT-4o. Our findings demonstrate the importance of structured orchestration and establish JourneyBench as a critical resource to advance AI-driven customer support beyond IVR-era limitations.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Comparative Efficiency Analysis of Lightweight Transformer Models: A Multi-Domain Empirical Benchmark for Enterprise NLP Deployment
arXiv:2601.00444v1 Announce Type: new
Abstract: In the rapidly evolving landscape of enterprise natural language processing (NLP), the demand for efficient, lightweight models capable of handling multi-domain text automation tasks has intensified. This study conducts a comparative analysis of three prominent lightweight Transformer models - DistilBERT, MiniLM, and ALBERT - across three distinct domains: customer sentiment classification, news topic classification, and toxicity and hate speech detection. Utilizing datasets from IMDB, AG News, and the Measuring Hate Speech corpus, we evaluated performance using accuracy-based metrics including accuracy, precision, recall, and F1-score, as well as efficiency metrics such as model size, inference time, throughput, and memory usage. Key findings reveal that no single model dominates all performance dimensions. ALBERT achieves the highest task-specific accuracy in multiple domains, MiniLM excels in inference speed and throughput, and DistilBERT demonstrates the most consistent accuracy across tasks while maintaining competitive efficiency. All results reflect controlled fine-tuning under fixed enterprise-oriented constraints rather than exhaustive hyperparameter optimization. These results highlight trade-offs between accuracy and efficiency, recommending MiniLM for latency-sensitive enterprise applications, DistilBERT for balanced performance, and ALBERT for resource-constrained environments.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Rule-Based Approaches to Atomic Sentence Extraction
arXiv:2601.00506v1 Announce Type: new
Abstract: Natural language often combines multiple ideas into complex sentences. Atomic sentence extraction, the task of decomposing complex sentences into simpler sentences that each express a single idea, improves performance in information retrieval, question answering, and automated reasoning systems. Previous work has formalized the "split-and-rephrase" task and established evaluation metrics, and machine learning approaches using large language models have improved extraction accuracy. However, these methods lack interpretability and provide limited insight into which linguistic structures cause extraction failures. Although some studies have explored dependency-based extraction of subject-verb-object triples and clauses, no principled analysis has examined which specific clause structures and dependencies lead to extraction difficulties. This study addresses this gap by analyzing how complex sentence structures, including relative clauses, adverbial clauses, coordination patterns, and passive constructions, affect the performance of rule-based atomic sentence extraction. Using the WikiSplit dataset, we implemented dependency-based extraction rules in spaCy, generated 100 gold=standard atomic sentence sets, and evaluated performance using ROUGE and BERTScore. The system achieved ROUGE-1 F1 = 0.6714, ROUGE-2 F1 = 0.478, ROUGE-L F1 = 0.650, and BERTScore F1 = 0.5898, indicating moderate-to-high lexical, structural, and semantic alignment. Challenging structures included relative clauses, appositions, coordinated predicates, adverbial clauses, and passive constructions. Overall, rule-based extraction is reasonably accurate but sensitive to syntactic complexity.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Pat-DEVAL: Chain-of-Legal-Thought Evaluation for Patent Description
arXiv:2601.00166v1 Announce Type: new
Abstract: Patent descriptions must deliver comprehensive technical disclosure while meeting strict legal standards such as enablement and written description requirements. Although large language models have enabled end-to-end automated patent drafting, existing evaluation approaches fail to assess long-form structural coherence and statutory compliance specific to descriptions. We propose Pat-DEVAL, the first multi-dimensional evaluation framework dedicated to patent description bodies. Leveraging the LLM-as-a-judge paradigm, Pat-DEVAL introduces Chain-of-Legal-Thought (CoLT), a legally-constrained reasoning mechanism that enforces sequential patent-law-specific analysis. Experiments validated by patent expert on our Pap2Pat-EvalGold dataset demonstrate that Pat-DEVAL achieves a Pearson correlation of 0.69, significantly outperforming baseline metrics and existing LLM evaluators. Notably, the framework exhibits a superior correlation of 0.73 in Legal-Professional Compliance, proving that the explicit injection of statutory constraints is essential for capturing nuanced legal validity. By establishing a new standard for ensuring both technical soundness and legal compliance, Pat-DEVAL provides a robust methodological foundation for the practical deployment of automated patent drafting systems.
Fonte: arXiv cs.CL
MLOps/Systems • Score 96
Benchmarking de Métodos de Pré-processamento e Integração em Genômica de Células Únicas
A análise de dados de células únicas tem o potencial de revolucionar a medicina personalizada ao caracterizar mudanças moleculares associadas a doenças em nível celular. Este estudo examina um pipeline geral para análise de dados de células únicas, avaliando diferentes métodos de normalização, redução de dimensionalidade e integração, utilizando seis conjuntos de dados variados.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Além de APIs Perfeitas: Uma Avaliação Abrangente de Agentes LLM Sob a Complexidade Real de APIs
Apresentamos o WildAGTEval, um benchmark projetado para avaliar as capacidades de chamada de função de agentes de modelos de linguagem grande (LLM) sob a complexidade realista de APIs. Ao contrário de trabalhos anteriores que assumem um sistema de API idealizado, WildAGTEval considera a especificação e a execução de APIs, oferecendo cenários de complexidade variados para avaliar o desempenho dos LLMs.
Fonte: arXiv cs.AI
RL • Score 95
Retrieval--Reasoning Processes for Multi-hop Question Answering: A Four-Axis Design Framework and Empirical Trends
arXiv:2601.00536v1 Announce Type: new
Abstract: Multi-hop question answering (QA) requires systems to iteratively retrieve evidence and reason across multiple hops. While recent RAG and agentic methods report strong results, the underlying retrieval--reasoning \emph{process} is often left implicit, making procedural choices hard to compare across model families. This survey takes the execution procedure as the unit of analysis and introduces a four-axis framework covering (A) overall execution plan, (B) index structure, (C) next-step control (strategies and triggers), and (D) stop/continue criteria. Using this schema, we map representative multi-hop QA systems and synthesize reported ablations and tendencies on standard benchmarks (e.g., HotpotQA, 2WikiMultiHopQA, MuSiQue), highlighting recurring trade-offs among effectiveness, efficiency, and evidence faithfulness. We conclude with open challenges for retrieval--reasoning agents, including structure-aware planning, transferable control policies, and robust stopping under distribution shift.
Fonte: arXiv cs.CL
MLOps/Systems • Score 92
Noise-Aware Named Entity Recognition for Historical VET Documents
arXiv:2601.00488v1 Announce Type: new
Abstract: This paper addresses Named Entity Recognition (NER) in the domain of Vocational Education and Training (VET), focusing on historical, digitized documents that suffer from OCR-induced noise. We propose a robust NER approach leveraging Noise-Aware Training (NAT) with synthetically injected OCR errors, transfer learning, and multi-stage fine-tuning. Three complementary strategies, training on noisy, clean, and artificial data, are systematically compared. Our method is one of the first to recognize multiple entity types in VET documents. It is applied to German documents but transferable to arbitrary languages. Experimental results demonstrate that domain-specific and noise-aware fine-tuning substantially increases robustness and accuracy under noisy conditions. We provide publicly available code for reproducible noise-aware NER in domain-specific contexts.
Fonte: arXiv cs.CL
Vision • Score 95
DepFlow: Disentangled Speech Generation to Mitigate Semantic Bias in Depression Detection
arXiv:2601.00303v1 Announce Type: new
Abstract: Speech is a scalable and non-invasive biomarker for early mental health screening. However, widely used depression datasets like DAIC-WOZ exhibit strong coupling between linguistic sentiment and diagnostic labels, encouraging models to learn semantic shortcuts. As a result, model robustness may be compromised in real-world scenarios, such as Camouflaged Depression, where individuals maintain socially positive or neutral language despite underlying depressive states. To mitigate this semantic bias, we propose DepFlow, a three-stage depression-conditioned text-to-speech framework. First, a Depression Acoustic Encoder learns speaker- and content-invariant depression embeddings through adversarial training, achieving effective disentanglement while preserving depression discriminability (ROC-AUC: 0.693). Second, a flow-matching TTS model with FiLM modulation injects these embeddings into synthesis, enabling control over depressive severity while preserving content and speaker identity. Third, a prototype-based severity mapping mechanism provides smooth and interpretable manipulation across the depression continuum. Using DepFlow, we construct a Camouflage Depression-oriented Augmentation (CDoA) dataset that pairs depressed acoustic patterns with positive/neutral content from a sentiment-stratified text bank, creating acoustic-semantic mismatches underrepresented in natural data. Evaluated across three depression detection architectures, CDoA improves macro-F1 by 9%, 12%, and 5%, respectively, consistently outperforming conventional augmentation strategies in depression Detection. Beyond enhancing robustness, DepFlow provides a controllable synthesis platform for conversational systems and simulation-based evaluation, where real clinical data remains limited by ethical and coverage constraints.
Fonte: arXiv cs.CL
Vision • Score 95
Simulação como Supervisão: Pré-treinamento Mecânico para Descoberta Científica
A modelagem científica enfrenta um trade-off entre a interpretabilidade da teoria mecanicista e o poder preditivo do machine learning. Apresentamos as Simulation-Grounded Neural Networks (SGNNs), um framework que incorpora conhecimento de domínio nos dados de treinamento, permitindo que o modelo aprenda padrões amplos de possibilidade física e seja mais robusto a erros de especificação do modelo.
Fonte: arXiv stat.ML
NLP/LLMs • Score 95
Correção Residual Segura Inspirada em Causalidade para Séries Temporais Multivariadas
Embora os preditores multivariados modernos, como Transformers e GNNs, apresentem forte desempenho em benchmarks, eles frequentemente sofrem de erros sistemáticos em variáveis ou horizontes específicos e, criticamente, carecem de garantias contra degradação de desempenho na implementação. Para abordar essa lacuna de segurança, propomos o CRC (Correção Residual Segura Inspirada em Causalidade), um framework projetado para garantir não degradação.
Fonte: arXiv stat.ML
NLP/LLMs • Score 95
Limite de Largura Infinita de uma Única Camada de Atenção: Análise via Programas Tensor
No presente artigo, identificamos rigorosamente a distribuição do limite de largura infinita de variáveis dentro de uma única camada de atenção, utilizando o framework Tensor Programs. Derivamos a forma exata dessa lei limite, demonstrando que ela se desvia fundamentalmente da Gaussianidade. Nossos experimentos numéricos validam as previsões teóricas, confirmando a eficácia da teoria em largura finita e a descrição precisa de atenções com cabeçotes finitos.
Fonte: arXiv stat.ML
RL • Score 95
Mitigando o viés otimista na estimativa e otimização de risco entrópico
A medida de risco entrópico é amplamente utilizada em decisões críticas em economia, ciência da gestão, finanças e sistemas de controle críticos, pois captura riscos extremos associados a perdas incertas. Este trabalho apresenta um procedimento de bootstrap paramétrico que corrige o viés do estimador empírico de risco entrópico, melhorando a precisão na tomada de decisões.
Fonte: arXiv stat.ML
NLP/LLMs • Score 95
Language as Mathematical Structure: Examining Semantic Field Theory Against Language Games
arXiv:2601.00448v1 Announce Type: new
Abstract: Large language models (LLMs) offer a new empirical setting in which long-standing theories of linguistic meaning can be examined. This paper contrasts two broad approaches: social constructivist accounts associated with language games, and a mathematically oriented framework we call Semantic Field Theory. Building on earlier work by the author, we formalize the notions of lexical fields (Lexfelder) and linguistic fields (Lingofelder) as interacting structures in a continuous semantic space. We then analyze how core properties of transformer architectures-such as distributed representations, attention mechanisms, and geometric regularities in embedding spaces-relate to these concepts. We argue that the success of LLMs in capturing semantic regularities supports the view that language exhibits an underlying mathematical structure, while their persistent limitations in pragmatic reasoning and context sensitivity are consistent with the importance of social grounding emphasized in philosophical accounts of language use. On this basis, we suggest that mathematical structure and language games can be understood as complementary rather than competing perspectives. The resulting framework clarifies the scope and limits of purely statistical models of language and motivates new directions for theoretically informed AI architectures.
Fonte: arXiv cs.CL
Vision • Score 95
Compressed Map Priors for 3D Perception
arXiv:2601.00139v1 Announce Type: new
Abstract: Human drivers rarely travel where no person has gone before. After all, thousands of drivers use busy city roads every day, and only one can claim to be the first. The same holds for autonomous computer vision systems. The vast majority of the deployment area of an autonomous vision system will have been visited before. Yet, most autonomous vehicle vision systems act as if they are encountering each location for the first time. In this work, we present Compressed Map Priors (CMP), a simple but effective framework to learn spatial priors from historic traversals. The map priors use a binarized hashmap that requires only $32\text{KB}/\text{km}^2$, a $20\times$ reduction compared to the dense storage. Compressed Map Priors easily integrate into leading 3D perception systems at little to no extra computational costs, and lead to a significant and consistent improvement in 3D object detection on the nuScenes dataset across several architectures.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
From Transformers to LLMs: A Systematic Survey of Efficiency Considerations in NLP
arXiv:2406.16893v2 Announce Type: replace
Abstract: The emergence of Transformer-based Large Language Models (LLMs) has substantially augmented the capabilities of Natural Language Processing (NLP), thereby intensifying the demand for computational resources. Therefore, enhancing efficiency based on factors like computational requirements, energy consumption, carbon footprint and financial cost has become a vital area of research. This motivates us to conduct a systematic literature review on Transformer-based LLMs in NLP from the perspective of efficiency. In this survey of 312 articles published between the years 2011 and 2025, efficiency-improvement endeavors have been systematically discussed targeting various aspects such as data curation, model design, model downsizing, and dynamic inferencing. This has been augmented with efficiency considerations in model adaptation strategies like pre-training, fine-tuning, prompt-engineering and Retrieval-Augmented Generation (RAG). Furthermore, a statistical analysis of the articles has been performed followed by an in-depth evaluation of the efficiency and efficacy of more than 30 renowned NLP models has been conducted on 13 evaluation benchmarks. This paper offers valuable insights for researchers, professionals as well as scholars, and explores the trend of research toward sustainable practices in NLP.
Fonte: arXiv cs.CL
Vision • Score 96
Do Barro ao Código: Raciocínio Tipológico e Material nas Interpretações de IA das Torres de Pombos Iranianas
Este estudo investiga como sistemas de IA generativa interpretam a inteligência arquitetônica embutida na forma vernacular. Usando a torre de pombos iraniana como estudo de caso, a pesquisa testa três modelos de difusão: Midjourney v6, DALL-E 3 e DreamStudio baseado em Stable Diffusion XL (SDXL), em três estágios de prompt: referencial, adaptativo e especulativo.
Fonte: arXiv cs.AI
RL • Score 96
Rumo a uma Teoria Física da Inteligência
Apresentamos uma teoria física da inteligência fundamentada no processamento irreversível de informações em sistemas sujeitos a leis de conservação. Um sistema inteligente é modelado como um processo acoplado agente-ambiente, cuja evolução transforma informações em trabalho direcionado a objetivos. Introduzimos o framework Conservation-Congruent Encoding (CCE) para conectar informações ao estado físico.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
The Role of Mixed-Language Documents for Multilingual Large Language Model Pretraining
arXiv:2601.00364v1 Announce Type: new
Abstract: Multilingual large language models achieve impressive cross-lingual performance despite largely monolingual pretraining. While bilingual data in pretraining corpora is widely believed to enable these abilities, details of its contributions remain unclear. We investigate this question by pretraining models from scratch under controlled conditions, comparing the standard web corpus with a monolingual-only version that removes all multilingual documents. Despite constituting only 2% of the corpus, removing bilingual data causes translation performance to drop 56% in BLEU, while behaviour on cross-lingual QA and general reasoning tasks remains stable, with training curves largely overlapping the baseline. To understand this asymmetry, we categorize bilingual data into parallel (14%), code-switching (72%), and miscellaneous documents (14%) based on the semantic relevance of content in different languages. We then conduct granular ablations by reintroducing parallel or code-switching data into the monolingual-only corpus. Our experiments reveal that parallel data almost fully restores translation performance (91% of the unfiltered baseline), whereas code-switching contributes minimally. Other cross-lingual tasks remain largely unaffected by either type. These findings reveal that translation critically depends on systematic token-level alignments from parallel data, whereas cross-lingual understanding and reasoning appear to be achievable even without bilingual data.
Fonte: arXiv cs.CL
NLP/LLMs • Score 96
Detecção de Descargas de Onda Espigada (SWD) usando 1-dimensional Residual UNet
A rotulagem manual de eventos em registros de eletroencefalografia (EEG) é um processo demorado, especialmente quando as gravações são contínuas por semanas a meses. Um método para rotular automaticamente eventos relevantes de EEG reduz a carga de trabalho manual. Neste estudo, comparamos o desempenho de 14 classificadores de machine learning em um conjunto de dados anotado manualmente, encontrando que um 1D UNet é o mais eficaz para rotular SWDs.
Fonte: arXiv cs.LG
RL • Score 95
Integração de Multi-Armed Bandit, Aprendizado Ativo e Computação Distribuída para Otimização Escalável
Problemas modernos de otimização em domínios científicos e de engenharia frequentemente dependem de avaliações black-box caras. Propomos o ALMAB-DC, um framework modular e unificado para otimização black-box escalável que integra aprendizado ativo, multi-armed bandits e computação distribuída, com aceleração opcional por GPU. Resultados empíricos mostram que ALMAB-DC supera consistentemente otimizadores black-box de última geração.
Fonte: arXiv stat.ML
MLOps/Systems • Score 96
Avatar Forcing: Geração Interativa de Avatares de Cabeça em Tempo Real para Conversação Natural
A geração de cabeças falantes cria avatares realistas a partir de retratos estáticos para comunicação virtual e criação de conteúdo. No entanto, modelos atuais não transmitem a sensação de comunicação verdadeiramente interativa, gerando respostas unidimensionais que carecem de engajamento emocional. Propomos o Avatar Forcing, um novo framework para geração de avatares que modela interações em tempo real entre usuários e avatares através de difusão forçada.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
From Evidence-Based Medicine to Knowledge Graph: Retrieval-Augmented Generation for Sports Rehabilitation and a Domain Benchmark
arXiv:2601.00216v1 Announce Type: new
Abstract: In medicine, large language models (LLMs) increasingly rely on retrieval-augmented generation (RAG) to ground outputs in up-to-date external evidence. However, current RAG approaches focus primarily on performance improvements while overlooking evidence-based medicine (EBM) principles. This study addresses two key gaps: (1) the lack of PICO alignment between queries and retrieved evidence, and (2) the absence of evidence hierarchy considerations during reranking. We present a generalizable strategy for adapting EBM to graph-based RAG, integrating the PICO framework into knowledge graph construction and retrieval, and proposing a Bayesian-inspired reranking algorithm to calibrate ranking scores by evidence grade without introducing predefined weights. We validated this framework in sports rehabilitation, a literature-rich domain currently lacking RAG systems and benchmarks. We released a knowledge graph (357,844 nodes and 371,226 edges) and a reusable benchmark of 1,637 QA pairs. The system achieved 0.830 nugget coverage, 0.819 answer faithfulness, 0.882 semantic similarity, and 0.788 PICOT match accuracy. In a 5-point Likert evaluation, five expert clinicians rated the system 4.66-4.84 across factual accuracy, faithfulness, relevance, safety, and PICO alignment. These findings demonstrate that the proposed EBM adaptation strategy improves retrieval and answer quality and is transferable to other clinical domains. The released resources also help address the scarcity of RAG datasets in sports rehabilitation.
Fonte: arXiv cs.CL
NLP/LLMs • Score 96
FlashInfer-Bench: Construindo o Ciclo Virtuoso para Sistemas LLM Impulsionados por IA
Avanços recentes mostram que modelos de linguagem de grande escala (LLMs) podem atuar como agentes autônomos capazes de gerar kernels de GPU, mas integrar esses kernels gerados por IA em sistemas de inferência do mundo real continua sendo um desafio. O FlashInfer-Bench aborda essa lacuna ao estabelecer um framework padronizado e de ciclo fechado que conecta geração de kernels, benchmarking e implantação.
Fonte: arXiv cs.AI
MLOps/Systems • Score 93
Compreendendo Emoção no Discurso: Insights de Reconhecimento e Padrões Linguísticos para Geração
O reconhecimento de emoções em conversas (ERC) alcançou alta precisão, mas duas lacunas críticas permanecem: uma compreensão limitada das escolhas arquitetônicas que realmente importam e a falta de análise linguística conectando reconhecimento à geração. Abordamos ambas as lacunas por meio de uma análise sistemática do conjunto de dados IEMOCAP.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
Talk Less, Verify More: Improving LLM Assistants with Semantic Checks and Execution Feedback
arXiv:2601.00224v1 Announce Type: new
Abstract: As large language model (LLM) assistants become increasingly integrated into enterprise workflows, their ability to generate accurate, semantically aligned, and executable outputs is critical. However, current conversational business analytics (CBA) systems often lack built-in verification mechanisms, leaving users to manually validate potentially flawed results. This paper introduces two complementary verification techniques: Q*, which performs reverse translation and semantic matching between code and user intent, and Feedback+, which incorporates execution feedback to guide code refinement. Embedded within a generator-discriminator framework, these mechanisms shift validation responsibilities from users to the system. Evaluations on three benchmark datasets, Spider, Bird, and GSM8K, demonstrate that both Q* and Feedback+ reduce error rates and task completion time. The study also identifies reverse translation as a key bottleneck, highlighting opportunities for future improvement. Overall, this work contributes a design-oriented framework for building more reliable, enterprise-grade GenAI systems capable of trustworthy decision support.
Fonte: arXiv cs.CL
RL • Score 96
Uma abordagem multi-algoritmo para o balanceamento da carga de trabalho operacional de recursos humanos em um sistema de entrega urbana de última milha
A atribuição eficiente de carga de trabalho à força de trabalho é crucial em sistemas de entrega de pacotes de última milha. Este artigo aborda o problema do balanceamento da carga de trabalho operacional em sistemas de entrega urbana, propondo uma abordagem multi-algoritmo que otimiza o tempo de entrega e garante uma distribuição equilibrada da carga de trabalho entre os trabalhadores.
Fonte: arXiv cs.AI
RL • Score 96
Colocação Ótima de Táxis Consciente do Tráfego Usando Aprendizado por Reforço Baseado em Redes Neurais Gráficas
No contexto do transporte em cidades inteligentes, o emparelhamento eficiente da oferta de táxis com a demanda de passageiros requer a integração em tempo real de dados da rede de tráfego urbano e padrões de mobilidade. Este artigo apresenta um framework de aprendizado por reforço (RL) baseado em grafos para a colocação ótima de táxis em ambientes metropolitanos.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Knowledge Distillation for Temporal Knowledge Graph Reasoning with Large Language Models
arXiv:2601.00202v1 Announce Type: new
Abstract: Reasoning over temporal knowledge graphs (TKGs) is fundamental to improving the efficiency and reliability of intelligent decision-making systems and has become a key technological foundation for future artificial intelligence applications. Despite recent progress, existing TKG reasoning models typically rely on large parameter sizes and intensive computation, leading to high hardware costs and energy consumption. These constraints hinder their deployment on resource-constrained, low-power, and distributed platforms that require real-time inference. Moreover, most existing model compression and distillation techniques are designed for static knowledge graphs and fail to adequately capture the temporal dependencies inherent in TKGs, often resulting in degraded reasoning performance. To address these challenges, we propose a distillation framework specifically tailored for temporal knowledge graph reasoning. Our approach leverages large language models as teacher models to guide the distillation process, enabling effective transfer of both structural and temporal reasoning capabilities to lightweight student models. By integrating large-scale public knowledge with task-specific temporal information, the proposed framework enhances the student model's ability to model temporal dynamics while maintaining a compact and efficient architecture. Extensive experiments on multiple publicly available benchmark datasets demonstrate that our method consistently outperforms strong baselines, achieving a favorable trade-off between reasoning accuracy, computational efficiency, and practical deployability.
Fonte: arXiv cs.CL
NLP/LLMs • Score 96
Os Chatbots LLMs Falam Demais? O Benchmark YapBench
Modelos de Linguagem de Grande Escala (LLMs) como ChatGPT, Claude e Gemini atuam cada vez mais como copilotos de propósito geral, mas frequentemente respondem com excessiva extensão em solicitações simples, aumentando a carga cognitiva e inflacionando o custo de inferência baseado em tokens. Apresentamos o YapBench, um benchmark leve para quantificar a sobregeração visível ao usuário em prompts de brevidade ideal.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
JP-TL-Bench: Avaliação Ancorada de LLM em Par para Tradução Bidirecional Japonês-Inglês
Apresentamos o JP-TL-Bench, um benchmark leve e aberto projetado para guiar o desenvolvimento iterativo de sistemas de tradução Japonês-Inglês. O desafio muitas vezes é 'qual dessas duas boas traduções é melhor?' em vez de 'esta tradução é aceitável?'. Essa distinção é crucial para o Japonês-Inglês, onde escolhas sutis em polidez, implicatura, elipse e registro afetam fortemente a naturalidade percebida.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Framework Auto-reparador Agente Bio-inspirado para Sistemas de Computação Distribuída Resilientes
Este artigo apresenta o ReCiSt, um framework auto-reparador bio-inspirado projetado para alcançar resiliência em Sistemas de Computação Distribuída (DCCS). O ReCiSt reconstrói fases biológicas em camadas computacionais que realizam isolamento autônomo de falhas, diagnóstico causal, recuperação adaptativa e consolidação de conhecimento a partir de agentes impulsionados por Language Model (LM).
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Trajectory Guard -- Um Modelo Leve e Consciente de Sequência para Detecção de Anomalias em Tempo Real em AI Agente
Agentes autônomos de LLM geram planos de ação de múltiplos passos que podem falhar devido a desalinhamento contextual ou incoerência estrutural. Métodos existentes de detecção de anomalias não são adequados para esse desafio. Apresentamos o Trajectory Guard, um Autoencoder Recorrente Siamês que aprende alinhamento de tarefa e trajetória, permitindo a detecção unificada de planos incorretos e estruturas de planos malformadas.
Fonte: arXiv cs.LG
RL • Score 96
Métodos Semânticos Podem Aprimorar Táticas em Esportes Coletivos? Uma Metodologia para Futebol com Aplicações Mais Amplas
Este artigo explora como o raciocínio em espaço semântico, tradicionalmente utilizado em linguística computacional, pode ser estendido à tomada de decisão tática em esportes coletivos. A metodologia proposta modela configurações táticas como estruturas semânticas composicionais, representando cada jogador como um vetor multidimensional que integra atributos técnicos, físicos e psicológicos.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Cadeias Neurais e Sistemas Dinâmicos Discretos
Neste trabalho, inspecionamos a analogia entre aplicações de machine learning (ML) baseadas na arquitetura transformer sem autoatenção, denominadas { t cadeias neurais}, e sistemas dinâmicos discretos associados a versões discretizadas de equações integrais e diferenciais parciais neurais (NIE, PDE). Apresentamos uma análise comparativa da solução numérica das equações de Burgers e Eikonal via discretização numérica padrão e aprendizado PINN.
Fonte: arXiv cs.LG
Vision • Score 93
Detecção Inteligente de Falhas no Sistema de Energia Elétrica de Nanosatélites
Este artigo apresenta um novo método de detecção de falhas no sistema de energia elétrica de nanosatélites sem um Subsystema de Controle de Determinação de Atitude (ADCS) em órbita LEO. O sistema é simulado sem falhas com base em uma rede neural, utilizando radiação solar e temperatura da superfície do painel solar como dados de entrada.
Fonte: arXiv cs.LG
RL • Score 96
Uma Análise Comparativa de Métodos de Machine Learning Interpretabéis
Nos últimos anos, o Machine Learning (ML) tem sido amplamente adotado em diversos setores, incluindo áreas críticas como saúde, finanças e direito. Essa dependência crescente levantou preocupações sobre a interpretabilidade e a responsabilidade dos modelos, especialmente com a imposição de restrições legais e regulatórias sobre o uso de modelos black-box. Este estudo apresenta uma avaliação comparativa de 16 métodos inerentemente interpretabéis, abrangendo 216 conjuntos de dados tabulares do mundo real.
Fonte: arXiv cs.LG
RL • Score 96
Dominação Quântica King-Ring no Xadrez: Uma Abordagem QAOA
O Quantum Approximate Optimization Algorithm (QAOA) é amplamente testado em instâncias aleatórias sintéticas, mas carece de estrutura semântica e interpretabilidade humana. Apresentamos a Dominação Quântica King-Ring (QKRD), um benchmark em escala NISQ derivado de posições táticas de xadrez, oferecendo 5.000 instâncias estruturadas. Usando QKRD, avaliamos escolhas de design do QAOA e mostramos que técnicas informadas por problemas revelam vantagens ocultas em instâncias aleatórias.
Fonte: arXiv cs.LG
RL • Score 96
Computação de Reservatório Sequencial para Previsão Espacial e Temporal de Alta Dimensionalidade de Forma Eficiente
A previsão de sistemas espaciais e temporais de alta dimensionalidade continua sendo um desafio computacional para redes neurais recorrentes (RNNs) e modelos de memória de longo e curto prazo (LSTM). Introduzimos uma arquitetura de Computação de Reservatório Sequencial (Sequential RC) que decompõe um grande reservatório em uma série de reservatórios menores e interconectados, melhorando a eficiência e reduzindo custos computacionais.
Fonte: arXiv cs.LG
RL • Score 96
O Transporte Ótimo Pode Melhorar o Aprendizado por Reforço Inverso Federado?
Neste artigo, introduzimos uma abordagem baseada em transporte ótimo para o aprendizado por reforço inverso federado (IRL). Cada cliente realiza localmente um IRL de Máxima Entropia, respeitando suas limitações computacionais e de privacidade. As funções de recompensa resultantes são fundidas via um barycenter de Wasserstein, que considera sua estrutura geométrica subjacente. Este trabalho oferece um framework eficiente em comunicação para derivar uma recompensa compartilhada que se generaliza entre agentes e ambientes heterogêneos.
Fonte: arXiv cs.LG
Vision • Score 96
Engenharia de Recursos Híbridos Otimizada para Detecção de Arritmias Eficiente em Recursos em Sinais de ECG: Um Framework de Otimização
As doenças cardiovasculares, especialmente as arritmias, continuam a ser uma das principais causas de mortalidade global, exigindo monitoramento contínuo via Internet das Coisas Médicas (IoMT). Este estudo propõe um framework centrado em dados e eficiente em recursos que prioriza a engenharia de características em vez da complexidade, alcançando alta precisão diagnóstica com um modelo leve.
Fonte: arXiv cs.LG
RL • Score 96
Predição Precoce de Cirrose Hepática com Antecedência de Até Três Anos: Um Estudo de Machine Learning Comparando com o FIB-4
Objetivo: Desenvolver e avaliar modelos de machine learning (ML) para prever a cirrose hepática incidente um, dois e três anos antes do diagnóstico, utilizando dados de registros eletrônicos de saúde (EHR) coletados rotineiramente, e comparar seu desempenho com o escore FIB-4. Métodos: Realizamos um estudo de coorte retrospectivo usando dados de EHR desidentificados de um grande sistema de saúde acadêmico.
Fonte: arXiv cs.LG
Vision • Score 96
IMBWatch -- uma abordagem de Rede Neural Gráfica Espacial-Temporal para detectar Negócios de Massagem Ilícitos
Os Negócios de Massagem Ilícitos (IMBs) são uma forma encoberta e persistente de exploração organizada que operam sob a fachada de serviços de bem-estar legítimos. A detecção de IMBs é difícil devido a anúncios digitais codificados e mudanças frequentes de pessoal e locais. Apresentamos o IMBWatch, um framework de rede neural gráfica espacial-temporal (ST-GNN) para a detecção em larga escala de IMBs, que combina operações de convolução gráfica com mecanismos de atenção temporal.
Fonte: arXiv cs.LG
RL • Score 93
Proteção de Erro Desigual Aprendida por Reforço para Embeddings Semânticos Quantizados
Este artigo aborda o desafio premente de preservar o significado semântico em sistemas de comunicação com largura de banda limitada. Introduzimos um novo framework de aprendizado por reforço que alcança proteção desigual por dimensão via codificação de repetição adaptativa, utilizando uma métrica de distorção semântica composta que equilibra a similaridade global de embeddings com a preservação em nível de entidade.
Fonte: arXiv cs.LG
RL • Score 92
Aprendizado ativo para modelos reduzidos baseados em dados de sistemas diferenciais paramétricos com inferência bayesiana de operadores
Este trabalho desenvolve um framework de aprendizado ativo para enriquecer de forma inteligente modelos reduzidos de ordem (ROMs) de sistemas dinâmicos paramétricos, que podem servir como base para ativos virtuais em um digital twin. Os ROMs baseados em dados são modelos de machine learning científicos explicáveis e computacionalmente eficientes que visam preservar a física subjacente de simulações dinâmicas complexas.
Fonte: arXiv stat.ML
Vision • Score 96
Avaliação de Detectores de Anomalias para Problemas de Classificação Industrial Altamente Desequilibrados Simulados
O machine learning oferece soluções potenciais para problemas atuais em sistemas industriais, como controle de qualidade e manutenção preditiva, mas enfrenta barreiras únicas em aplicações industriais. Este artigo apresenta uma avaliação abrangente de algoritmos de detecção de anomalias usando um conjunto de dados simulado que reflete restrições de engenharia do mundo real.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Democratizando Sistemas de IA Eletrônico-Fotônicos: Um Fluxo de Ferramentas de Co-Projeto e Automação de Design Infundido com IA e Código Aberto
A fotônica está se tornando uma tecnologia fundamental para sistemas de IA de alto desempenho e computação científica, oferecendo velocidade, paralelismo e eficiência energética incomparáveis. No entanto, o design e a implementação de sistemas de IA eletrônico-fotônicos permanecem desafiadores devido a uma curva de aprendizado acentuada. Apresentamos um framework de co-design e automação de múltiplas camadas para democratizar o desenvolvimento de sistemas de IA fotônicos.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Uma Avaliação Empírica de Abordagens Baseadas em LLM para Detecção de Vulnerabilidades de Código: RAG, SFT e Sistemas de Agentes Duplos
O rápido avanço dos Large Language Models (LLMs) apresenta novas oportunidades para a detecção automatizada de vulnerabilidades de software, uma tarefa crucial para a segurança de bases de código modernas. Este artigo apresenta um estudo comparativo sobre a eficácia de técnicas baseadas em LLM para detectar vulnerabilidades de software, avaliando três abordagens: Retrieval-Augmented Generation (RAG), Supervised Fine-Tuning (SFT) e um framework de LLM de Agente Duplo.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Rumo a Sistemas de IA Potencializados por Fotônica em Grande Escala: Da Automação de Design Físico à Coexploração de Sistema e Algoritmo
Neste trabalho, identificamos três considerações essenciais para a realização de sistemas práticos de IA fotônica em escala: (1) suporte a operações tensorais dinâmicas para modelos modernos; (2) gerenciamento sistemático de sobrecargas de conversão, controle e movimentação de dados; e (3) robustez sob não idealidades de hardware. Desenvolvemos uma ferramenta de suporte ao design de IA fotônica desde a exploração inicial até a realização física.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
O Cavalo de Troia no Vocabulário: Sabotagem Sutil da Composição de LLM
O ecossistema de LLM de pesos abertos é cada vez mais definido por técnicas de composição de modelos que remixam capacidades de diversas fontes. Um pré-requisito crítico para aplicar esses métodos é o transplante de tokenizers, que alinha vocabulários incompatíveis a um espaço de embedding compartilhado. Demonstramos que esse passo de interoperabilidade introduz uma vulnerabilidade na cadeia de suprimentos.
Fonte: arXiv cs.LG
RL • Score 93
Ideação Progressiva usando um Framework de IA Agente para Co-Criação Humano-IA
A geração de ideias verdadeiramente novas e diversificadas é crucial para o design de engenharia contemporâneo, mas continua sendo um desafio cognitivo significativo para designers novatos. Propomos o MIDAS (Meta-cognitive Ideation through Distributed Agentic AI System), um framework inovador que substitui o paradigma de IA única por uma 'equipe' distribuída de agentes de IA especializados, projetados para emular o fluxo de trabalho de ideação meta-cognitiva humana.
Fonte: arXiv cs.AI
Vision • Score 96
Detecção Adaptativa de Coordenação Causal para Mídias Sociais: Um Framework Guiado por Memória com Aprendizado Semi-Supervisionado
Detectar comportamentos inautênticos coordenados em mídias sociais é um desafio crítico. Propomos o framework Adaptive Causal Coordination Detection (ACCD), que utiliza uma arquitetura progressiva em três estágios para aprender e reter configurações de detecção otimizadas. O ACCD melhora a identificação de relações causais e reduz a necessidade de rotulagem manual, alcançando um F1-score de 87,3% em detecção de ataques coordenados.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Pergunte, Esclareça, Otimize: Colaboração Humano-Agente LLM para um Controle de Estoque Mais Inteligente
A gestão de estoque continua sendo um desafio para muitas pequenas e médias empresas que carecem de expertise para implementar métodos avançados de otimização. Este artigo investiga se os Large Language Models (LLMs) podem ajudar a superar essa lacuna, propondo um framework híbrido que separa rigorosamente o raciocínio semântico do cálculo matemático.
Fonte: arXiv cs.AI
Vision • Score 96
Um Estudo Comparativo de Estratégias de Adaptação para Modelos Fundamentais de Séries Temporais na Detecção de Anomalias
A detecção de anomalias em séries temporais é essencial para a operação confiável de sistemas complexos, mas a maioria dos métodos existentes requer treinamento extenso e específico. Este estudo investiga se modelos fundamentais de séries temporais (TSFMs), pré-treinados em grandes dados heterogêneos, podem servir como bases universais para detecção de anomalias.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Universal Adaptive Constraint Propagation: Scaling Structured Inference for Large Language Models via Meta-Reinforcement Learning
arXiv:2601.00095v1 Announce Type: new
Abstract: Large language models increasingly require structured inference, from JSON schema enforcement to multi-lingual parsing, where outputs must satisfy complex constraints. We introduce MetaJuLS, a meta-reinforcement learning approach that learns universal constraint propagation policies applicable across languages and tasks without task-specific retraining. By formulating structured inference as adaptive constraint propagation and training a Graph Attention Network with meta-learning, MetaJuLS achieves 1.5--2.0$\times$ speedups over GPU-optimized baselines while maintaining within 0.2\% accuracy of state-of-the-art parsers. On Universal Dependencies across 10 languages and LLM-constrained generation (LogicBench, GSM8K-Constrained), MetaJuLS demonstrates rapid cross-domain adaptation: a policy trained on English parsing adapts to new languages and tasks with 5--10 gradient steps (5--15 seconds) rather than requiring hours of task-specific training. Mechanistic analysis reveals the policy discovers human-like parsing strategies (easy-first) and novel non-intuitive heuristics. By reducing propagation steps in LLM deployments, MetaJuLS contributes to Green AI by directly reducing inference carbon footprint.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
CoPE: A Small Language Model for Steerable and Scalable Content Labeling
arXiv:2512.18027v1 Announce Type: new
Abstract: This paper details the methodology behind CoPE, a policy-steerable small language model capable of fast and accurate content labeling. We present a novel training curricula called Contradictory Example Training that enables the model to learn policy interpretation rather than mere policy memorization. We also present a novel method for generating content policies, called Binocular Labeling, which enables rapid construction of unambiguous training datasets. When evaluated across seven different harm areas, CoPE exhibits equal or superior accuracy to frontier models at only 1% of their size. We openly release a 9 billion parameter version of the model that can be run on a single consumer-grade GPU. Models like CoPE represent a paradigm shift for classifier systems. By turning an ML task into a policy writing task, CoPE opens up new design possibilities for the governance of online platforms.
Fonte: arXiv cs.CL
MLOps/Systems • Score 95
GenUQ: Estimativas de Incerteza Preditiva via Redes Hiper-Geradoras
O aprendizado de operadores é uma generalização recente da regressão para mapeamentos entre funções, prometendo reduzir a integração numérica cara de PDEs para avaliações rápidas de mapeamentos entre estados funcionais de um sistema. Neste artigo, apresentamos o GenUQ, uma abordagem teórica de medida para UQ que evita a construção de uma verossimilhança ao introduzir um modelo de rede hiper-geradora.
Fonte: arXiv stat.ML
NLP/LLMs • Score 95
Rumo ao Desaprendizado que Preserva o Raciocínio em Modelos de Linguagem Grande Multimodal
O desaprendizado de máquinas visa apagar dados solicitados de modelos treinados sem re-treinamento completo. Para Modelos de Linguagem Grande Multimodal com Raciocínio (RMLLMs), isso é desafiador, pois etapas intermediárias podem vazar informações sensíveis. Apresentamos o RMLLMU-Bench, o primeiro benchmark para desaprendizado de RMLLM que avalia o vazamento de raciocínio e a retenção de raciocínio.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Supplementary Resources and Analysis for Automatic Speech Recognition Systems Trained on the Loquacious Dataset
arXiv:2512.17915v1 Announce Type: new
Abstract: The recently published Loquacious dataset aims to be a replacement for established English automatic speech recognition (ASR) datasets such as LibriSpeech or TED-Lium. The main goal of the Loquacious dataset is to provide properly defined training and test partitions across many acoustic and language domains, with an open license suitable for both academia and industry. To further promote the benchmarking and usability of this new dataset, we present additional resources in the form of n-gram language models (LMs), a grapheme-to-phoneme (G2P) model and pronunciation lexica, with open and public access. Utilizing those additional resources we show experimental results across a wide range of ASR architectures with different label units and topologies. Our initial experimental results indicate that the Loquacious dataset offers a valuable study case for a variety of common challenges in ASR.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Rumo a Agentes Eficientes: Um Co-Design da Arquitetura de Inferência e do Sistema
O rápido desenvolvimento de agentes baseados em large language models (LLMs) abriu novas possibilidades para raciocínio autônomo em múltiplas interações e tomada de decisão com ferramentas. No entanto, sua implementação no mundo real é dificultada por ineficiências severas que surgem não da inferência isolada do modelo, mas da latência sistêmica acumulada ao longo dos ciclos de raciocínio, crescimento de contexto e interações heterogêneas com ferramentas.
Fonte: arXiv cs.CL
RL • Score 95
ReGal: A First Look at PPO-based Legal AI for Judgment Prediction and Summarization in India
arXiv:2512.18014v1 Announce Type: new
Abstract: This paper presents an early exploration of reinforcement learning methodologies for legal AI in the Indian context. We introduce Reinforcement Learning-based Legal Reasoning (ReGal), a framework that integrates Multi-Task Instruction Tuning with Reinforcement Learning from AI Feedback (RLAIF) using Proximal Policy Optimization (PPO). Our approach is evaluated across two critical legal tasks: (i) Court Judgment Prediction and Explanation (CJPE), and (ii) Legal Document Summarization. Although the framework underperforms on standard evaluation metrics compared to supervised and proprietary models, it provides valuable insights into the challenges of applying RL to legal texts. These challenges include reward model alignment, legal language complexity, and domain-specific adaptation. Through empirical and qualitative analysis, we demonstrate how RL can be repurposed for high-stakes, long-document tasks in law. Our findings establish a foundation for future work on optimizing legal reasoning pipelines using reinforcement learning, with broader implications for building interpretable and adaptive legal AI systems.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Does It Tie Out? Towards Autonomous Legal Agents in Venture Capital
arXiv:2512.18658v1 Announce Type: new
Abstract: Before closing venture capital financing rounds, lawyers conduct diligence that includes tying out the capitalization table: verifying that every security (for example, shares, options, warrants) and issuance term (for example, vesting schedules, acceleration triggers, transfer restrictions) is supported by large sets of underlying legal documentation. While LLMs continue to improve on legal benchmarks, specialized legal workflows, such as capitalization tie-out, remain out of reach even for strong agentic systems. The task requires multi-document reasoning, strict evidence traceability, and deterministic outputs that current approaches fail to reliably deliver. We characterize capitalization tie-out as an instance of a real-world benchmark for legal AI, analyze and compare the performance of existing agentic systems, and propose a world model architecture toward tie-out automation-and more broadly as a foundation for applied legal intelligence.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
De Palavra a Mundo: Podem Modelos de Linguagem Grande Servir como Modelos de Mundo Baseados em Texto Implicitamente?
O aprendizado por reforço agente depende cada vez mais de escalabilidade orientada pela experiência, mas ambientes do mundo real continuam sendo não adaptativos e difíceis de escalar. Este estudo investiga se modelos de linguagem grande (LLMs) podem melhorar a eficiência do aprendizado em ambientes baseados em texto, apresentando um framework de três níveis para avaliação de modelos de mundo baseados em LLMs.
Fonte: arXiv cs.CL
RL • Score 93
Quando a Aprendizagem se Renormaliza? Condições Suficientes para Dinâmicas Espectrais de Lei de Potência
A escalabilidade empírica de lei de potência tem sido amplamente observada em sistemas modernos de deep learning, mas suas origens teóricas e escopo de validade permanecem incompletamente compreendidos. O framework Generalized Resolution-Shell Dynamics (GRSD) modela a aprendizagem como transporte de energia espectral através de camadas de resolução logarítmica, oferecendo uma descrição dinâmica de treinamento.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
GeoSense-AI: Fast Location Inference from Crisis Microblogs
arXiv:2512.18225v1 Announce Type: new
Abstract: This paper presents an applied AI pipeline for realtime geolocation from noisy microblog streams, unifying statistical hashtag segmentation, part-of-speech-driven proper-noun detection, dependency parsing around disaster lexicons, lightweight named-entity recognition, and gazetteer-grounded disambiguation to infer locations directly from text rather than sparse geotags. The approach operationalizes information extraction under streaming constraints, emphasizing low-latency NLP components and efficient validation against geographic knowledge bases to support situational awareness during emergencies. In head to head comparisons with widely used NER toolkits, the system attains strong F1 while being engineered for orders-of-magnitude faster throughput, enabling deployment in live crisis informatics settings. A production map interface demonstrates end-to-end AI functionality ingest, inference, and visualization--surfacing locational signals at scale for floods, outbreaks, and other fastmoving events. By prioritizing robustness to informal text and streaming efficiency, GeoSense-AI illustrates how domain-tuned NLP and knowledge grounding can elevate emergency response beyond conventional geo-tag reliance.
Fonte: arXiv cs.CL
Vision • Score 93
Os Salmões Mortos da Interpretabilidade em IA
Neste estudo de neurociência, os autores colocaram um salmão morto em um scanner de MRI e mostraram imagens de humanos em situações sociais, revelando artefatos de 'salmão morto' em análises de IA. Este trabalho propõe uma reinterpretação estatística-causal, tratando explicações de sistemas computacionais como parâmetros de um modelo estatístico, enfatizando a importância de testar hipóteses alternativas.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Sophia: Um Framework de Agente Persistente de Vida Artificial
O desenvolvimento de LLMs elevou os agentes de IA de ferramentas específicas para entidades de tomada de decisão de longa duração. No entanto, a maioria das arquiteturas permanece estática e reativa, limitadas a cenários definidos manualmente. Propomos um terceiro estrato, o Sistema 3, que supervisiona a identidade narrativa do agente e a adaptação a longo prazo, culminando em Sophia, um wrapper de 'Agente Persistente' que integra um ciclo contínuo de autoaperfeiçoamento.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Repensando a Inteligência Multi-Agente Através da Lente de Redes de Pequeno Mundo
Modelos de linguagem grandes (LLMs) possibilitaram sistemas multi-agente (MAS) onde múltiplos agentes argumentam, criticam e coordenam para resolver tarefas complexas, tornando a topologia de comunicação uma escolha de design fundamental. Neste trabalho, revisitamos a teoria clássica sobre redes de pequeno mundo (SW) e investigamos como a conectividade SW pode ser utilizada como um princípio de design para MAS.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
A Comparative Study of Light-weight Language Models for PII Masking and their Deployment for Real Conversational Texts
arXiv:2512.18608v1 Announce Type: new
Abstract: Automated masking of Personally Identifiable Information (PII) is critical for privacy-preserving conversational systems. While current frontier large language models demonstrate strong PII masking capabilities, concerns about data handling and computational costs motivate exploration of whether lightweight models can achieve comparable performance. We compare encoder-decoder and decoder-only architectures by fine-tuning T5-small and Mistral-Instruct-v0.3 on English datasets constructed from the AI4Privacy benchmark. We create different dataset variants to study label standardization and PII representation, covering 24 standardized PII categories and higher-granularity settings. Evaluation using entity-level and character-level metrics, type accuracy, and exact match shows that both lightweight models achieve performance comparable to frontier LLMs for PII masking tasks. Label normalization consistently improves performance across architectures. Mistral achieves higher F1 and recall with greater robustness across PII types but incurs significantly higher generation latency. T5, while less robust in conversational text, offers more controllable structured outputs and lower inference cost, motivating its use in a real-time Discord bot for real-world PII redaction. Evaluation on live messages reveals performance degradation under informal inputs. These results clarify trade-offs between accuracy, robustness, and computational efficiency, demonstrating that lightweight models can provide effective PII masking while addressing data handling concerns associated with frontier LLMs.
Fonte: arXiv cs.CL
NLP/LLMs • Score 96
ASTIF: Integração Semântica-Temporal Adaptativa para Previsão de Preços de Criptomoedas
A previsão de séries temporais financeiras é um desafio de fusão de informações, mas a maioria dos modelos existentes depende de arquiteturas estáticas que têm dificuldade em integrar fontes de conhecimento heterogêneas. Propomos o ASTIF, um sistema híbrido inteligente que adapta sua estratégia de previsão em tempo real através de meta-aprendizado baseado em confiança, integrando componentes complementares para melhorar a precisão das previsões.
Fonte: arXiv cs.AI
Vision • Score 96
Rede Bayesiana Multimodal para Avaliação Robusta de Vítimas em Triagem Autônoma
Incidentes de Múltiplas Vítimas podem sobrecarregar sistemas médicos de emergência, e atrasos ou erros na avaliação das vítimas podem resultar em mortes evitáveis. Apresentamos um framework de suporte à decisão que combina saídas de múltiplos modelos de computer vision, estimando sinais de hemorragia severa, dificuldade respiratória, alerta físico ou trauma visível, em uma rede Bayesiana construída inteiramente a partir de regras definidas por especialistas.
Fonte: arXiv cs.AI
RL • Score 96
Podemos Testar Teorias da Consciência em IA? Ablations, Marcadores e Robustez
A busca por indicadores confiáveis de consciência se fragmentou em campos teóricos concorrentes (Global Workspace Theory (GWT), Integrated Information Theory (IIT) e Higher-Order Theories (HOT)), cada um propondo assinaturas neurais distintas. Adotamos uma abordagem de neuro-fenomenologia sintética, construindo agentes artificiais para testar as consequências funcionais dessas teorias através de ablações arquitetônicas precisas. Relatamos dissociações que sugerem que essas teorias descrevem camadas funcionais complementares.
Fonte: arXiv cs.AI
RL • Score 93
ORPR: Um Modelo de Aprendizado Guiado por OR para Gestão de Estoque com Pré-treinamento e Reforço
À medida que a busca por sinergia entre Inteligência Artificial (AI) e Pesquisa Operacional (OR) avança na gestão de sistemas de estoque complexos, um desafio crítico permanece: como reconciliar efetivamente a percepção adaptativa da AI com o rigor estrutural da OR. Propomos um novo framework 'Pré-treinamento e Reforço' guiado por OR.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Observador, Não Jogador: Simulando a Teoria da Mente em LLMs através da Observação de Jogos
Apresentamos um framework interativo para avaliar se modelos de linguagem grandes (LLMs) exibem um verdadeiro 'entendimento' em um ambiente simples, mas estratégico. Utilizando o jogo Pedra-Papel-Tesoura (RPS) como exemplo, nosso sistema posiciona o LLM como um Observador, cuja tarefa é identificar as estratégias em jogo e articular o raciocínio por trás desse julgamento.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Grandes Modelos de Linguagem como Filtros Bayesianos Descontados
Os Grandes Modelos de Linguagem (LLMs) demonstram forte generalização em poucos exemplos através do aprendizado em contexto, mas seu raciocínio em ambientes dinâmicos e estocásticos permanece opaco. Introduzimos um framework de filtragem bayesiana para avaliar a inferência online em LLMs, revelando como as atualizações de crença se comportam como filtros de esquecimento exponencial.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Geração de Regras Programáticas para Detecção de Falsificação de Documentos Usando Modelos de Linguagem de Grande Escala
A falsificação de documentos representa uma ameaça crescente a processos legais, econômicos e governamentais, exigindo mecanismos de verificação cada vez mais sofisticados. Este trabalho investiga como modelos de linguagem de grande escala (LLMs) podem ser adaptados para gerar verificações de plausibilidade baseadas em regras para detecção de falsificações, utilizando recursos de hardware limitados.
Fonte: arXiv cs.AI
Vision • Score 96
Um Conjunto de Dados e Benchmarks para Detecção de Fibrilação Atrial a partir de Eletrocardiogramas de Pacientes em Unidades de Terapia Intensiva
Objetivo: A fibrilação atrial (AF) é a arritmia cardíaca mais comum entre pacientes em unidades de terapia intensiva (UTI) e pode causar efeitos adversos à saúde. Neste estudo, publicamos um conjunto de dados rotulado da UTI e benchmarks para detecção de AF, comparando modelos de machine learning em três abordagens de inteligência artificial (IA) baseadas em dados.
Fonte: arXiv cs.LG
RL • Score 95
AraMix: Recycling, Refiltering, and Deduplicating to Deliver the Largest Arabic Pretraining Corpus
arXiv:2512.18834v1 Announce Type: new
Abstract: We present AraMix, a deduplicated Arabic pretraining corpus containing approximately 178 billion tokens across 179 million documents. Rather than scraping the web again, AraMix demonstrates that substantial value lies in systematically reusing and curating existing pretraining datasets: we combine seven publicly available Arabic web datasets, apply quality filtering designed specifically for Arabic text to re-filter some datasets, and perform cross-dataset deduplication, both MinHash and sentence-level. This approach reveals that nearly 60% of tokens across these independently collected corpora are duplicates, redundancy that any new scraping efforts will reproduce. Our work suggests that for lower resource languages, investment in curation pipelines for existing data yields greater returns than additional web crawls, an approach that allowed us to curate the largest heavily filtered publicly available Arabic pretraining corpus.
Fonte: arXiv cs.CL
MLOps/Systems • Score 93
Comparando Modelos Dinâmicos Através do Alinhamento de Campos Vetoriais Difeomórficos
Modelos de sistemas dinâmicos, como redes neurais recorrentes (RNNs), são cada vez mais populares na neurociência teórica para geração de hipóteses e análise de dados. Avaliar a dinâmica nesses modelos é crucial para entender seus mecanismos generativos aprendidos, mas enfrenta desafios significativos relacionados à comparação de dinâmicas e identificação de motivos importantes em modelos não lineares de alta dimensão.
Fonte: arXiv cs.LG
Vision • Score 96
EIA-SEC: Framework Melhorado de Actor-Critic para Controle Colaborativo de Multi-UAV na Agricultura Inteligente
A aplicação generalizada da tecnologia de comunicação sem fio tem promovido o desenvolvimento da agricultura inteligente, onde veículos aéreos não tripulados (UAVs) desempenham um papel multifuncional. Neste trabalho, modelamos um processo de decisão de Markov para resolver o problema de planejamento de trajetória de multi-UAV e propomos o novo framework Elite Imitation Actor-Shared Ensemble Critic (EIA-SEC). Resultados experimentais mostram que o EIA-SEC supera as referências de ponta em desempenho de recompensa, estabilidade de treinamento e velocidade de convergência.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
Benchmarking de substitutos neurais em fluxos multifísicos espaço-temporais realistas
Prever dinâmicas multifísicas é computacionalmente caro e desafiador devido ao acoplamento severo de processos físicos heterogêneos e multiescala. Apresentamos o REALM (REalistic AI Learning for Multiphysics), um framework rigoroso de benchmarking para testar substitutos neurais em fluxos reativos desafiadores, com 11 conjuntos de dados de alta fidelidade e um protocolo padronizado de treinamento e avaliação.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
Aprendizado por transferência baseado em operador neural convolucional para resolver PDEs
O operador neural convolucional é uma arquitetura baseada em CNN recentemente proposta para garantir equivalência contínua-discreta que preserva a estrutura e possibilitar o aprendizado genuíno e sem aliasing de operadores de solução de PDEs. Este operador neural demonstrou superar, em certos casos, modelos de referência como DeepONet e Fourier neural operator em termos de precisão de surrogates.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Separating Constraint Compliance from Semantic Accuracy: A Novel Benchmark for Evaluating Instruction-Following Under Compression
arXiv:2512.17920v1 Announce Type: new
Abstract: Large language models (LLMs) exhibit degraded performance under prompt compression, but the mechanisms remain poorly understood. We introduce the Compression-Decay Comprehension Test (CDCT), a benchmark that independently measures constraint compliance (CC) and semantic accuracy (SA) across compression levels. We evaluate 9 frontier LLMs across 8 concepts using 5 compression levels from extreme (c=0.0, ~2 words) to none (c=1.0, ~135 words). A three-judge LLM jury achieves almost perfect inter-rater agreement on CC (Fleiss' \k{appa}=0.90).
We observe a universal U-curve pattern in constraint compliance (97.2% prevalence), with violations peaking at medium compression (c=0.5, ~27 words). Counterintuitively, models perform better at extreme compression than medium lengths. The dimensions are statistically orthogonal (r=0.193, p=0.084), with constraint effects 2.9x larger than semantic effects.
Experimental validation via RLHF ablation confirms our constraint salience hypothesis: removing "helpfulness" signals improves CC by 598% on average (71/72 trials, p<0.001), with 79% achieving perfect compliance. This demonstrates that RLHF-trained helpfulness behaviors are the dominant cause of constraint violations at medium compression. Reasoning models outperform efficient models by 27.5% (Cohen's d=0.96).
Our findings reveal a fundamental tension between RLHF alignment and instruction-following, providing actionable guidelines for improving deployed systems.
Fonte: arXiv cs.CL
NLP/LLMs • Score 93
CodeGEMM: Uma Abordagem Centrada em Codebook para GEMM Eficiente em LLMs Quantizados
A quantização apenas de pesos é amplamente utilizada para mitigar a natureza limitada da memória na inferência de LLM. Métodos baseados em codebook alcançam alta precisão em regimes de bits extremamente baixos (por exemplo, 2 bits). Apresentamos o CodeGEMM, um kernel GEMM centrado em codebook que substitui a dequantização por produtos internos pré-computados, melhorando a eficiência computacional e a utilização do subsistema de memória.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Generalization Gaps in Political Fake News Detection: An Empirical Study on the LIAR Dataset
arXiv:2512.18533v1 Announce Type: new
Abstract: The proliferation of linguistically subtle political disinformation poses a significant challenge to automated fact-checking systems. Despite increasing emphasis on complex neural architectures, the empirical limits of text-only linguistic modeling remain underexplored. We present a systematic diagnostic evaluation of nine machine learning algorithms on the LIAR benchmark. By isolating lexical features (Bag-of-Words, TF-IDF) and semantic embeddings (GloVe), we uncover a hard "Performance Ceiling", with fine-grained classification not exceeding a Weighted F1-score of 0.32 across models. Crucially, a simple linear SVM (Accuracy: 0.624) matches the performance of pre-trained Transformers such as RoBERTa (Accuracy: 0.620), suggesting that model capacity is not the primary bottleneck. We further diagnose a massive "Generalization Gap" in tree-based ensembles, which achieve more than 99% training accuracy but collapse to approximately 25% on test data, indicating reliance on lexical memorization rather than semantic inference. Synthetic data augmentation via SMOTE yields no meaningful gains, confirming that the limitation is semantic (feature ambiguity) rather than distributional. These findings indicate that for political fact-checking, increasing model complexity without incorporating external knowledge yields diminishing returns.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
O Número de Condição como um Proxy Invariante de Escala para Codificação de Informação em Unidades Neurais
Este artigo explora a relação entre o número de condição do tensor de pesos de uma rede neural e a extensão da informação codificada pela unidade de processamento associada, sob a perspectiva da teoria da informação. Argumenta-se que um número de condição elevado pode indicar que a unidade aprendeu a amplificar e comprimir informações de forma seletiva.
Fonte: arXiv stat.ML
NLP/LLMs • Score 96
DeliveryBench: Agentes Podem Lucrar no Mundo Real?
arXiv:2512.19234v1 Tipo de Anúncio: novo. Resumo: LLMs e VLMs estão sendo cada vez mais utilizados como agentes incorporados, mas os benchmarks existentes se concentram em tarefas simples de curto prazo e têm dificuldade em capturar as ricas restrições realistas que moldam a tomada de decisão no mundo real. Para fechar essa lacuna, propomos o DeliveryBench, um benchmark incorporado em escala de cidade baseado na profissão real de entrega de alimentos.
Fonte: arXiv cs.AI
RL • Score 92
Algoritmo Garantido de Regret a Qualquer Momento para Controle de Sistemas Lineares Quadráticos
Propomos um algoritmo computacionalmente eficiente que alcança um regret a qualquer momento de ordem $ extmath{O}( extmath{ extsqrt{t}})$, com dependência explícita nas dimensões do sistema e na solução da Equação de Riccati Algébrica Discreta (DARE). Nossa abordagem utiliza uma regularização adequadamente ajustada e uma estimativa inicial suficientemente precisa para construir elipsóides de confiança para o design de controle.
Fonte: arXiv stat.ML
RL • Score 96
Inteligência Alinhada à Segurança Embutida via Embeddings de Alinhamento Interno Diferenciáveis
Apresentamos a Inteligência Alinhada à Segurança Embutida (ESAI), um framework teórico para aprendizado por reforço multi-agente que incorpora restrições de alinhamento diretamente nas representações internas dos agentes usando embeddings de alinhamento interno diferenciáveis. Este trabalho analisa condições de estabilidade e propriedades teóricas, posicionando o ESAI como uma contribuição conceitual para mecanismos de alinhamento diferenciáveis em sistemas multi-agente.
Fonte: arXiv cs.LG
Vision • Score 96
Mapas auto-organizáveis para avaliação da qualidade da água em reservatórios e lagos: Uma revisão sistemática da literatura
A qualidade da água sustentável é fundamental para o equilíbrio ecológico e a segurança hídrica. Esta revisão examina a aplicação do Self-Organizing Map (SOM), uma técnica de IA não supervisionada, na avaliação da qualidade da água, abordando seleção de parâmetros, estratégias de amostragem e abordagens de agrupamento.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Ensinando e Criticando a Conceituação e Operacionalização em NLP
Pesquisadores de NLP frequentemente invocam conceitos abstratos como 'interpretabilidade', 'viés', 'raciocínio' e 'estereótipos' sem defini-los. Este artigo descreve um seminário criado para estudantes explorarem questões de conceituação e operacionalização, com uma lista de leitura interdisciplinar e ênfase em discussão e crítica.
Fonte: arXiv cs.CL
Vision • Score 95
Data-driven particle dynamics: Structure-preserving coarse-graining for emergent behavior in non-equilibrium systems
arXiv:2508.12569v2 Announce Type: replace-cross
Abstract: Multiscale systems are ubiquitous in science and technology, but are notoriously challenging to simulate as short spatiotemporal scales must be appropriately linked to emergent bulk physics. When expensive high-dimensional dynamical systems are coarse-grained into low-dimensional models, the entropic loss of information leads to emergent physics which are dissipative, history-dependent, and stochastic. To machine learn coarse-grained dynamics from time-series observations of particle trajectories, we propose a framework using the metriplectic bracket formalism that preserves these properties by construction; most notably, the framework guarantees discrete notions of the first and second laws of thermodynamics, conservation of momentum, and a discrete fluctuation-dissipation balance crucial for capturing non-equilibrium statistics. We introduce the mathematical framework abstractly before specializing to a particle discretization. As labels are generally unavailable for entropic state variables, we introduce a novel self-supervised learning strategy to identify emergent structural variables. We validate the method on benchmark systems and demonstrate its utility on two challenging examples: (1) coarse-graining star polymers at challenging levels of coarse-graining while preserving non-equilibrium statistics, and (2) learning models from high-speed video of colloidal suspensions that capture coupling between local rearrangement events and emergent stochastic dynamics. We provide open-source implementations in both PyTorch and LAMMPS, enabling large-scale inference and extensibility to diverse particle-based systems.
Fonte: arXiv stat.ML
RL • Score 95
Neural CDEs como Corretores para Modelos de Séries Temporais Aprendidos
Modelos de séries temporais aprendidos, sejam contínuos ou discretos, são amplamente utilizados para prever os estados de um sistema dinâmico. Propomos um mecanismo de Predictor-Corrector, onde o Predictor é um modelo de série temporal aprendido e o Corrector é uma equação diferencial controlada neuralmente. A adição de erros às previsões melhora o desempenho das previsões.
Fonte: arXiv stat.ML
RL • Score 93
A Geometria da Abstração: Aprendizado Contínuo via Quotienting Recursivo
Sistemas de aprendizado contínuo que operam em espaços de dimensão fixa enfrentam uma barreira geométrica fundamental: o problema do manifold plano. Neste trabalho, propomos uma resolução geométrica para esse paradoxo com base na Contração Métrica Recursiva, formalizando a abstração como uma deformação topológica.
Fonte: arXiv cs.LG
RL • Score 96
Previsão e Prognóstico dos Impactos de Secas de Curto Prazo Usando Machine Learning para Apoiar Esforços de Mitigação e Adaptação
A seca é um risco natural complexo que afeta sistemas ecológicos e humanos, resultando frequentemente em perdas ambientais e econômicas significativas. Este estudo aplica técnicas de machine learning para vincular índices de seca a registros históricos de impactos de seca, gerando previsões de impactos de curto prazo e melhorando a previsibilidade desses impactos em prazos acionáveis.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
MemEvolve: Meta-Evolution of Agent Memory Systems
arXiv:2512.18746v1 Announce Type: new
Abstract: Self-evolving memory systems are unprecedentedly reshaping the evolutionary paradigm of large language model (LLM)-based agents. Prior work has predominantly relied on manually engineered memory architectures to store trajectories, distill experience, and synthesize reusable tools, enabling agents to evolve on the fly within environment interactions. However, this paradigm is fundamentally constrained by the staticity of the memory system itself: while memory facilitates agent-level evolving, the underlying memory architecture cannot be meta-adapted to diverse task contexts. To address this gap, we propose MemEvolve, a meta-evolutionary framework that jointly evolves agents' experiential knowledge and their memory architecture, allowing agent systems not only to accumulate experience but also to progressively refine how they learn from it. To ground MemEvolve in prior research and foster openness in future self-evolving systems, we introduce EvolveLab, a unified self-evolving memory codebase that distills twelve representative memory systems into a modular design space (encode, store, retrieve, manage), providing both a standardized implementation substrate and a fair experimental arena. Extensive evaluations on four challenging agentic benchmarks demonstrate that MemEvolve achieves (I) substantial performance gains, improving frameworks such as SmolAgent and Flash-Searcher by up to $17.06\%$; and (II) strong cross-task and cross-LLM generalization, designing memory architectures that transfer effectively across diverse benchmarks and backbone models.
Fonte: arXiv cs.CL
RL • Score 95
From Natural Language to Control Signals: A Conceptual Framework for Semantic Channel Finding in Complex Experimental Infrastructure
arXiv:2512.18779v1 Announce Type: new
Abstract: Modern experimental platforms such as particle accelerators, fusion devices, telescopes, and industrial process control systems expose tens to hundreds of thousands of control and diagnostic channels accumulated over decades of evolution. Operators and AI systems rely on informal expert knowledge, inconsistent naming conventions, and fragmented documentation to locate signals for monitoring, troubleshooting, and automated control, creating a persistent bottleneck for reliability, scalability, and language-model-driven interfaces. We formalize semantic channel finding-mapping natural-language intent to concrete control-system signals-as a general problem in complex experimental infrastructure, and introduce a four-paradigm framework to guide architecture selection across facility-specific data regimes. The paradigms span (i) direct in-context lookup over curated channel dictionaries, (ii) constrained hierarchical navigation through structured trees, (iii) interactive agent exploration using iterative reasoning and tool-based database queries, and (iv) ontology-grounded semantic search that decouples channel meaning from facility-specific naming conventions. We demonstrate each paradigm through proof-of-concept implementations at four operational facilities spanning two orders of magnitude in scale-from compact free-electron lasers to large synchrotron light sources-and diverse control-system architectures, from clean hierarchies to legacy environments. These implementations achieve 90-97% accuracy on expert-curated operational queries.
Fonte: arXiv cs.CL
RL • Score 96
FairExpand: Justiça Individual em Grafos com Informações de Similaridade Parcial
A justiça individual, que exige que indivíduos semelhantes sejam tratados de forma semelhante por sistemas algorítmicos, é um princípio central em machine learning justo. Este trabalho apresenta o FairExpand, um framework flexível que promove a justiça individual em cenários de informações parciais, superando a limitação de métodos existentes que requerem informações de similaridade pré-definidas para todos os pares de nós.
Fonte: arXiv cs.LG
RL • Score 96
Monitoramento da Monitorabilidade
A observabilidade na tomada de decisões de sistemas de IA modernos pode ser necessária para implantar com segurança agentes cada vez mais capazes. Monitorar a cadeia de pensamento (CoT) dos modelos de raciocínio atuais tem se mostrado eficaz na detecção de comportamentos inadequados. No entanto, essa 'monitorabilidade' pode ser frágil sob diferentes procedimentos de treinamento e fontes de dados.
Fonte: arXiv cs.AI
RL • Score 96
APC-GNN++: Uma GNN Adaptativa Centrada no Paciente com Atenção Consciente do Contexto e Explicabilidade de Mini-Grafos para Classificação de Diabetes
Propomos o APC-GNN++, uma Rede Neural Gráfica centrada no paciente para classificação de diabetes. Nosso modelo integra atenção de arestas consciente do contexto, mistura guiada por confiança de características de nós e representações gráficas, e regularização de consistência de vizinhança para capturar melhor relações clinicamente significativas entre pacientes.
Fonte: arXiv cs.LG
RL • Score 96
Unificando Aprendizado por Reforço Causal: Revisão, Taxonomia, Algoritmos e Aplicações
Integrar inferência causal (CI) com aprendizado por reforço (RL) se tornou um paradigma poderoso para abordar limitações críticas no RL clássico, como baixa explicabilidade e falta de robustez. Este trabalho revisa avanços recentes na interseção entre CI e RL, categorizando abordagens existentes e discutindo desafios, sucessos empíricos e direções futuras de pesquisa.
Fonte: arXiv cs.AI
MLOps/Systems • Score 92
Uma Perspectiva de Otimização Riemanniana do Método Gauss-Newton para Redes Neurais Feedforward
Neste trabalho, estabelecemos limites de convergência não assintóticos para o método Gauss-Newton no treinamento de redes neurais com ativações suaves. No regime subparametrizado, o fluxo de gradiente Gauss-Newton induz um fluxo de gradiente Riemanniano em uma subvariedade embutida de baixa dimensão do espaço de funções.
Fonte: arXiv stat.ML
NLP/LLMs • Score 96
MoE-TransMov: Um Modelo Baseado em Transformer para Previsão do Próximo Ponto de Interesse (POI) em Movimentos Familiares e Não Familiares
A previsão precisa do próximo ponto de interesse (POI) nas trajetórias de mobilidade humana é crucial para serviços baseados em localização, permitindo recomendações mais oportunas e personalizadas. Propomos o MoE-TransMov, um modelo baseado em Transformer com arquitetura Mixture-of-Experts (MoE) que captura padrões de mobilidade distintos em diferentes contextos de movimento, melhorando a precisão das previsões.
Fonte: arXiv cs.LG
RL • Score 96
Extensão de Base Contrafactual e Geometria Representacional: Um Modelo de Crescimento Conceitual com Restrições de MDL
O aprendizado de conceitos se torna possível apenas quando representações existentes falham em contabilizar a experiência. Este artigo propõe um framework geométrico onde o crescimento conceitual é modelado como extensão de base admissível avaliada sob um critério de Minimum Description Length (MDL). A experiência é representada como vetores em relação a um subespaço conceitual atual.
Fonte: arXiv cs.AI
MLOps/Systems • Score 89
Universalidade dos limites de escalonamento de alta dimensão do stochastic gradient descent
Consideramos tarefas estatísticas em altas dimensões cuja perda depende dos dados apenas através de sua projeção em um subespaço de dimensão fixa. Isso inclui a classificação de distribuições mistas com perda de cross-entropy usando redes de uma e duas camadas. Nosso principal resultado é que os limites de ODE são universais, desde que a inicialização e os vetores de verdade fundamental sejam deslocalizados em coordenadas.
Fonte: arXiv stat.ML
RL • Score 92
Aprendizado de atratores para sistemas dinâmicos caóticos espaciotemporais usando redes de estado de eco com aprendizado por transferência
Neste artigo, exploramos as capacidades preditivas das redes de estado de eco (ESNs) para a equação de Kuramoto-Sivashinsky generalizada (gKS), uma PDE não linear arquetípica que exibe caos espaciotemporal. Nossa pesquisa foca na previsão de mudanças em padrões estatísticos de longo prazo do modelo gKS resultantes da variação da relação de dispersão ou do comprimento do domínio espacial.
Fonte: arXiv stat.ML
MLOps/Systems • Score 89
Misturas de cadeias de Markov variacionais com seleção automática de componentes
A modelagem de estados de Markov ganhou popularidade em diversos campos científicos, pois reduz conjuntos de dados de séries temporais complexas em transições entre poucos estados. Este artigo propõe um modelo de dados de séries temporais usando uma mistura de cadeias de Markov, determinando automaticamente o número de componentes da mistura com o algoritmo variacional de expectativa-maximização.
Fonte: arXiv stat.ML
RL • Score 96
Aprendizado por Reforço Confiável e Explicável para Controle de Processos Seguro e Eficiente em Energia: Um Caso de Uso em Sistemas Industriais de Ar Comprimido
Este artigo apresenta uma abordagem confiável de aprendizado por reforço para o controle de sistemas industriais de ar comprimido. Desenvolvemos um framework que possibilita operação segura e eficiente em energia sob condições de contorno realistas e introduzimos um pipeline de explicabilidade em múltiplos níveis. Uma avaliação empírica mostra que a política aprendida é fisicamente plausível e respeita consistentemente os limites do sistema.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications
arXiv:2505.14918v2 Announce Type: replace-cross
Abstract: This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification, addressing the lack of established reliability assessment methods. Adapting psychometric principles, we determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability. Our case study examines financial news sentiment classification across 14 LLMs (including claude-3-7-sonnet, gpt-4o, deepseek-r1, gemma3, llama3.2, phi4, and command-r-plus), with five replicates per model on 1,350 articles. Models demonstrated high intra-rater consistency, achieving perfect agreement on 90-98% of examples, with minimal differences between expensive and economical models from the same families. When validated against StockNewsAPI labels, models achieved strong performance (accuracy 0.76-0.88), with smaller models like gemma3:1B, llama3.2:3B, and claude-3-5-haiku outperforming larger counterparts. All models performed at chance when predicting actual market movements, indicating task constraints rather than model limitations. Our framework provides systematic guidance for LLM selection, sample size planning, and reliability assessment, enabling organizations to optimize resources for classification tasks.
Fonte: arXiv stat.ML
Vision • Score 96
Aprendizado de Cláusulas Direcionado por Conflito com Heurísticas VSIDS para Layout Discreto de Instalações
Este artigo estuda o uso de Aprendizado de Cláusulas Direcionado por Conflito (CDCL) com heurísticas VSIDS como um motor computacional para problemas de layout discreto de instalações. O problema de layout é modelado como um problema de atribuição combinatória com uma estrutura lógica densa, resultante de restrições de adjacência, separação e disponibilidade de slots.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
MSC-180: Um Benchmark para Prova Formal Automatizada de Teoremas a partir da Classificação de Assuntos Matemáticos
O Automated Theorem Proving (ATP) é uma direção de pesquisa central em inteligência artificial para alcançar raciocínio formal e verificação. Propomos o MSC-180, um benchmark baseado na classificação de assuntos matemáticos MSC2020, que compreende 180 problemas de verificação formal, abrangendo níveis de graduação e pós-graduação, para avaliar e impulsionar o desenvolvimento de sistemas de IA com habilidades genuínas de raciocínio matemático.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Parceria Inteligente Homem-Máquina para Manufatura: Aprimorando o Planejamento de Armazém através de Grafos de Conhecimento Baseados em Simulação e Colaboração de LLM
Os planejadores de manufatura enfrentam desafios operacionais complexos que exigem colaboração entre a expertise humana e sistemas inteligentes. Nosso framework integra Grafos de Conhecimento e agentes baseados em Large Language Models (LLMs) para capacitar profissionais da manufatura, permitindo interações em linguagem natural com dados operacionais, melhorando a análise e a tomada de decisões.
Fonte: arXiv cs.AI
Vision • Score 96
Detecção de Drift de Saída Baseada em Agentes para Previsão de Resposta ao Câncer de Mama em um Sistema de Suporte à Decisão Clínica Multissite
Os sistemas modernos de suporte à decisão clínica podem atender simultaneamente várias instituições independentes de imagem médica, mas seu desempenho preditivo pode se degradar entre os sites devido a variações nas populações de pacientes, hardware de imagem e protocolos de aquisição. Propomos um framework baseado em agentes para detectar drift e avaliar sua gravidade em sistemas de IA clínica multissite.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Vox Deorum: Uma Arquitetura Híbrida de LLM para IA em Jogos 4X / de Grande Estratégia -- Lições de Civilization V
A capacidade dos Large Language Models de raciocinar em linguagem natural os torna promissores para jogos 4X e de grande estratégia, facilitando interações mais naturais entre humanos e IA. No entanto, a complexidade desses jogos e fatores como latência e custo podem dificultar a implementação real dos LLMs. Apresentamos Vox Deorum, uma arquitetura híbrida LLM+X, validada por meio de 2.327 jogos completos.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
IntelliCode: Um Sistema de Tutoria com LLM Multi-Agente e Modelagem Centralizada do Aprendiz
Os tutores baseados em LLM geralmente são assistentes de turno único que não possuem representações persistentes do conhecimento do aprendiz, dificultando o suporte pedagógico a longo prazo. Apresentamos o IntelliCode, um sistema de tutoria LLM multi-agente que integra estimativas de domínio, equívocos, cronogramas de revisão e sinais de engajamento em um estado de aprendiz centralizado e versionado.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Population-Evolve: um Método de Amostragem Paralela e Evolutiva para Raciocínio Matemático em LLMs
O escalonamento em tempo de teste surgiu como uma direção promissora para aprimorar as capacidades de raciocínio de Large Language Models nos últimos anos. Neste trabalho, propomos o Population-Evolve, um método livre de treinamento inspirado em Algoritmos Genéticos para otimizar o raciocínio em LLMs, mantendo uma população dinâmica de soluções candidatas.
Fonte: arXiv cs.AI
NLP/LLMs • Score 93
Conceitos abstratos de LLM podem melhorar o desempenho de SLM?
Modelos de linguagem grandes (LLMs) se destacam em diversas tarefas, mas sua implementação em dispositivos com recursos limitados continua desafiadora. Investigamos a transferibilidade de conceitos de alto nível extraídos de modelos maiores para modelos de linguagem menores (SLM) durante a inferência, demonstrando melhorias de desempenho em uma ampla gama de tarefas.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Raciocínio Híbrido Aumentado por Ferramentas com Destilação para Resolução Bilingue de Problemas Matemáticos
A resolução bilíngue de problemas matemáticos requer uma ligação clara entre raciocínio linguístico e cálculo simbólico. Este artigo apresenta o HERALD (Hybrid Ensemble Reasoning with Adaptive Learning and Distillation), um framework que integra raciocínio e cálculo utilizando NuminaMath-7B-TIR, GPT-4o e Mistral-7B, oferecendo uma solução prática para raciocínio matemático multilíngue com melhor precisão e clareza.
Fonte: arXiv cs.AI
RL • Score 93
Serviço Eficiente de Mistura de Agentes via Roteamento em Estruturas de Árvore, Poda Adaptativa e Sobreposição de Pré-preenchimento e Decodificação Consciente de Dependências
A inferência de Mistura de Agentes (MoA) pode sofrer com comunicação densa entre agentes e baixa utilização de hardware, o que aumenta a latência de serviço. Apresentamos um design de serviço que aborda esses gargalos através de um co-design de algoritmo e sistema, utilizando uma topologia hierárquica e mecanismos adaptativos.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Gabliteration: Modificação Adaptativa de Pesos Neurais Multi-Direcional para Alteração Comportamental Seletiva em Grandes Modelos de Linguagem
Apresentamos o Gabliteration, uma nova técnica de modificação de pesos neurais que avança além dos métodos tradicionais de abliteration, implementando projeções multi-direcionais adaptativas com seleção de camadas regularizada. Nossa abordagem supera limitações fundamentais dos métodos existentes, mantendo a qualidade do modelo ao modificar padrões comportamentais específicos.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Vibe Reasoning: Extraindo Capacidades Matemáticas de IA de Fronteira -- Um Estudo de Caso sobre o Problema 6 do IMO 2025
Apresentamos o Vibe Reasoning, um paradigma colaborativo entre humanos e IA para resolver problemas matemáticos complexos. Nossa principal percepção é que modelos de IA de fronteira já possuem o conhecimento necessário para resolver problemas desafiadores, mas não sabem como, o que ou quando aplicá-lo. Este trabalho demonstra essa abordagem através do Problema 6 do IMO 2025.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
Q-KVComm: Comunicação Multi-Agente Eficiente Via Compressão Adaptativa de Cache KV
Sistemas multi-agente de Large Language Model (LLM) enfrentam um gargalo crítico: a transmissão redundante de informações contextuais entre agentes consome largura de banda e recursos computacionais excessivos. Apresentamos o Q-KVComm, um novo protocolo que permite a transmissão direta de representações comprimidas de cache key-value (KV) entre agentes LLM.
Fonte: arXiv cs.CL
Vision • Score 95
Statistical laws and linguistics inform meaning in naturalistic and fictional conversation
arXiv:2512.18072v1 Announce Type: new
Abstract: Conversation is a cornerstone of social connection and is linked to well-being outcomes. Conversations vary widely in type with some portion generating complex, dynamic stories. One approach to studying how conversations unfold in time is through statistical patterns such as Heaps' law, which holds that vocabulary size scales with document length. Little work on Heaps's law has looked at conversation and considered how language features impact scaling. We measure Heaps' law for conversations recorded in two distinct mediums: 1. Strangers brought together on video chat and 2. Fictional characters in movies. We find that scaling of vocabulary size differs by parts of speech. We discuss these findings through behavioral and linguistic frameworks.
Fonte: arXiv cs.CL
NLP/LLMs • Score 92
SRS-Stories: Vocabulary-constrained multilingual story generation for language learning
arXiv:2512.18362v1 Announce Type: new
Abstract: In this paper, we use large language models to generate personalized stories for language learners, using only the vocabulary they know. The generated texts are specifically written to teach the user new vocabulary by simply reading stories where it appears in context, while at the same time seamlessly reviewing recently learned vocabulary. The generated stories are enjoyable to read and the vocabulary reviewing/learning is optimized by a Spaced Repetition System. The experiments are conducted in three languages: English, Chinese and Polish, evaluating three story generation methods and three strategies for enforcing lexical constraints. The results show that the generated stories are more grammatical, coherent, and provide better examples of word usage than texts generated by the standard constrained beam search approach
Fonte: arXiv cs.CL
MLOps/Systems • Score 96
Mecanismos de Memória Dependentes de Modalidade em Computação Neuromórfica Cross-Modal
As redes neurais spiking (SNNs) com memória prometem computação neuromórfica energeticamente eficiente, mas sua generalização entre modalidades sensoriais permanece inexplorada. Apresentamos o primeiro estudo abrangente de ablação cross-modal dos mecanismos de memória em SNNs, avaliando redes de Hopfield, Redes Recorrentes Gated Hierárquicas (HGRNs) e aprendizado contrastivo supervisionado (SCL) em conjuntos de dados neuromórficos visuais (N-MNIST) e auditivos (SHD).
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Research on a hybrid LSTM-CNN-Attention model for text-based web content classification
arXiv:2512.18475v1 Announce Type: new
Abstract: This study presents a hybrid deep learning architecture that integrates LSTM, CNN, and an Attention mechanism to enhance the classification of web content based on text. Pretrained GloVe embeddings are used to represent words as dense vectors that preserve semantic similarity. The CNN layer extracts local n-gram patterns and lexical features, while the LSTM layer models long-range dependencies and sequential structure. The integrated Attention mechanism enables the model to focus selectively on the most informative parts of the input sequence. A 5-fold cross-validation setup was used to assess the robustness and generalizability of the proposed solution. Experimental results show that the hybrid LSTM-CNN-Attention model achieved outstanding performance, with an accuracy of 0.98, precision of 0.94, recall of 0.92, and F1-score of 0.93. These results surpass the performance of baseline models based solely on CNNs, LSTMs, or transformer-based classifiers such as BERT. The combination of neural network components enabled the model to effectively capture both fine-grained text structures and broader semantic context. Furthermore, the use of GloVe embeddings provided an efficient and effective representation of textual data, making the model suitable for integration into systems with real-time or near-real-time requirements. The proposed hybrid architecture demonstrates high effectiveness in text-based web content classification, particularly in tasks requiring both syntactic feature extraction and semantic interpretation. By combining presented mechanisms, the model addresses the limitations of individual architectures and achieves improved generalization. These findings support the broader use of hybrid deep learning approaches in NLP applications, especially where complex, unstructured textual data must be processed and classified with high reliability.
Fonte: arXiv cs.CL
NLP/LLMs • Score 92
From Scratch to Fine-Tuned: A Comparative Study of Transformer Training Strategies for Legal Machine Translation
arXiv:2512.18593v1 Announce Type: new
Abstract: In multilingual nations like India, access to legal information is often hindered by language barriers, as much of the legal and judicial documentation remains in English. Legal Machine Translation (L-MT) offers a scalable solution to this challenge by enabling accurate and accessible translations of legal documents. This paper presents our work for the JUST-NLP 2025 Legal MT shared task, focusing on English-Hindi translation using Transformer-based approaches. We experiment with 2 complementary strategies, fine-tuning a pre-trained OPUS-MT model for domain-specific adaptation and training a Transformer model from scratch using the provided legal corpus. Performance is evaluated using standard MT metrics, including SacreBLEU, chrF++, TER, ROUGE, BERTScore, METEOR, and COMET. Our fine-tuned OPUS-MT model achieves a SacreBLEU score of 46.03, significantly outperforming both baseline and from-scratch models. The results highlight the effectiveness of domain adaptation in enhancing translation quality and demonstrate the potential of L-MT systems to improve access to justice and legal transparency in multilingual contexts.
Fonte: arXiv cs.CL
NLP/LLMs • Score 96
Rumo à Avaliação de Vulnerabilidades de Privacidade no Esquecimento Seletivo com Modelos de Linguagem de Grande Escala
Os avanços rápidos em inteligência artificial (IA) têm se concentrado no aprendizado a partir de dados para desenvolver sistemas de aprendizado informados. Com a implementação desses sistemas em áreas críticas, garantir sua privacidade e alinhamento com valores humanos é essencial. O esquecimento seletivo, ou machine unlearning, surge como uma abordagem promissora, mas também levanta preocupações significativas de privacidade, especialmente em domínios sensíveis.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Evaluating the Challenges of LLMs in Real-world Medical Follow-up: A Comparative Study and An Optimized Framework
arXiv:2512.18999v1 Announce Type: new
Abstract: When applied directly in an end-to-end manner to medical follow-up tasks, Large Language Models (LLMs) often suffer from uncontrolled dialog flow and inaccurate information extraction due to the complexity of follow-up forms. To address this limitation, we designed and compared two follow-up chatbot systems: an end-to-end LLM-based system (control group) and a modular pipeline with structured process control (experimental group). Experimental results show that while the end-to-end approach frequently fails on lengthy and complex forms, our modular method-built on task decomposition, semantic clustering, and flow management-substantially improves dialog stability and extraction accuracy. Moreover, it reduces the number of dialogue turns by 46.73% and lowers token consumption by 80% to 87.5%. These findings highlight the necessity of integrating external control mechanisms when deploying LLMs in high-stakes medical follow-up scenarios.
Fonte: arXiv cs.CL
RL • Score 96
LeJOT: Uma Solução Inteligente de Orquestração de Custos de Trabalho para a Plataforma Databricks
Com os avanços rápidos em tecnologias de big data, a plataforma Databricks tornou-se fundamental para empresas e instituições de pesquisa. No entanto, gerenciar os custos operacionais crescentes associados à execução de trabalhos é um desafio crítico. Apresentamos o LeJOT, um framework de orquestração de custos de trabalho que utiliza machine learning para previsão de tempo de execução e um modelo de otimização baseado em solver para alocação de recursos em tempo real.
Fonte: arXiv cs.LG
Vision • Score 95
Teste para estrutura latente via a matriz aleatória de Wilcoxon--Wigner de estatísticas de classificação normalizadas
Este artigo considera o problema de testar a estrutura latente em grandes matrizes de dados simétricas. O objetivo é desenvolver uma metodologia estatisticamente fundamentada que seja flexível em sua aplicabilidade, computacionalmente eficiente e insensível a variações extremas nos dados, superando assim as limitações das abordagens existentes.
Fonte: arXiv stat.ML
RL • Score 96
O Challenger: Quando Novas Fontes de Dados Justificam a Troca de Modelos de Machine Learning?
Estudamos o problema de decidir se e quando uma organização deve substituir um modelo incumbente treinado por um challenger que utiliza novas características disponíveis. Desenvolvemos um framework econômico e estatístico unificado que relaciona dinâmicas de curva de aprendizado, custos de aquisição de dados e re-treinamento, e desconto de ganhos futuros.
Fonte: arXiv cs.LG
NLP/LLMs • Score 92
Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction
arXiv:2512.18880v1 Announce Type: new
Abstract: Accurate estimation of item (question or task) difficulty is critical for educational assessment but suffers from the cold start problem. While Large Language Models demonstrate superhuman problem-solving capabilities, it remains an open question whether they can perceive the cognitive struggles of human learners. In this work, we present a large-scale empirical analysis of Human-AI Difficulty Alignment for over 20 models across diverse domains such as medical knowledge and mathematical reasoning. Our findings reveal a systematic misalignment where scaling up model size is not reliably helpful; instead of aligning with humans, models converge toward a shared machine consensus. We observe that high performance often impedes accurate difficulty estimation, as models struggle to simulate the capability limitations of students even when being explicitly prompted to adopt specific proficiency levels. Furthermore, we identify a critical lack of introspection, as models fail to predict their own limitations. These results suggest that general problem-solving capability does not imply an understanding of human cognitive struggles, highlighting the challenge of using current models for automated difficulty prediction.
Fonte: arXiv cs.CL
MLOps/Systems • Score 96
A Cama Procrusteana das Séries Temporais: O Viés de Otimização da Função de Perda Pontual
Otimizar modelos de séries temporais por meio de funções de perda pontuais (por exemplo, MSE) baseando-se em uma suposição falha de independência e distribuição idêntica pontual (i.i.d.) que desconsidera a estrutura temporal causal. Este artigo analisa o Expectation of Optimization Bias (EOB) e revela que quanto mais determinística e estruturada a série temporal, mais severo é o viés causado pela função de perda pontual.
Fonte: arXiv cs.LG
NLP/LLMs • Score 94
LLM Agents Implement an NLG System from Scratch: Building Interpretable Rule-Based RDF-to-Text Generators
arXiv:2512.18360v1 Announce Type: new
Abstract: We present a novel neurosymbolic framework for RDF-to-text generation, in which the model is "trained" through collaborative interactions among multiple LLM agents rather than traditional backpropagation. The LLM agents produce rule-based Python code for a generator for the given domain, based on RDF triples only, with no in-domain human reference texts. The resulting system is fully interpretable, requires no supervised training data, and generates text nearly instantaneously using only a single CPU. Our experiments on the WebNLG and OpenDialKG data show that outputs produced by our approach reduce hallucination, with only slight fluency penalties compared to finetuned or prompted language models
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Training LLMs with LogicReward for Faithful and Rigorous Reasoning
arXiv:2512.18196v1 Announce Type: new
Abstract: Although LLMs exhibit strong reasoning capabilities, existing training methods largely depend on outcome-based feedback, which can produce correct answers with flawed reasoning. Prior work introduces supervision on intermediate steps but still lacks guarantees of logical soundness, which is crucial in high-stakes scenarios where logical consistency is paramount. To address this, we propose LogicReward, a novel reward system that guides model training by enforcing step-level logical correctness with a theorem prover. We further introduce Autoformalization with Soft Unification, which reduces natural language ambiguity and improves formalization quality, enabling more effective use of the theorem prover. An 8B model trained on data constructed with LogicReward surpasses GPT-4o and o4-mini by 11.6\% and 2\% on natural language inference and logical reasoning tasks with simple training procedures. Further analysis shows that LogicReward enhances reasoning faithfulness, improves generalizability to unseen tasks such as math and commonsense reasoning, and provides a reliable reward signal even without ground-truth labels. We will release all data and code at https://llm-symbol.github.io/LogicReward.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3
arXiv:2512.18399v1 Announce Type: new
Abstract: Tokenization is a critical preprocessing step for large language models (LLMs), directly impacting training efficiency and downstream performance. General-purpose tokenizers trained predominantly on English and Latin-script languages exhibit suboptimal performance on morphologically rich languages such as Arabic, resulting in inflated token sequences and reduced compression efficiency. In this work, we present AraToken, an Arabic-optimized tokenizer built on SentencePiece Unigram algorithm with a comprehensive normalization pipeline addressing Arabic-specific orthographic variations including Alif variants, diacritics, and Arabic-Indic numerals. We systematically compare BPE, WordPiece, and SentencePiece algorithms across multiple configurations, demonstrating that SentencePiece with normalization achieves 18% lower fertility (1.199 vs 1.35 tokens/word) compared to unnormalized baselines. Furthermore, we introduce the Language Extension Pipeline (LEP), a method for integrating the optimized tokenizer into Qwen3-0.6B through vocabulary extension with mean subtoken initialization and selective transformer layer unfreezing. Our experiments show that LEP reduces evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. We release our tokenizer, training scripts, and model checkpoints to facilitate Arabic NLP research.
Fonte: arXiv cs.CL
NLP/LLMs • Score 92
LIR$^3$AG: A Lightweight Rerank Reasoning Strategy Framework for Retrieval-Augmented Generation
arXiv:2512.18329v1 Announce Type: new
Abstract: Retrieval-Augmented Generation (RAG) effectively enhances Large Language Models (LLMs) by incorporating retrieved external knowledge into the generation process. Reasoning models improve LLM performance in multi-hop QA tasks, which require integrating and reasoning over multiple pieces of evidence across different documents to answer a complex question. However, they often introduce substantial computational costs, including increased token consumption and inference latency. To better understand and mitigate this trade-off, we conduct a comprehensive study of reasoning strategies for reasoning models in RAG multi-hop QA tasks. Our findings reveal that reasoning models adopt structured strategies to integrate retrieved and internal knowledge, primarily following two modes: Context-Grounded Reasoning, which relies directly on retrieved content, and Knowledge-Reconciled Reasoning, which resolves conflicts or gaps using internal knowledge. To this end, we propose a novel Lightweight Rerank Reasoning Strategy Framework for RAG (LiR$^3$AG) to enable non-reasoning models to transfer reasoning strategies by restructuring retrieved evidence into coherent reasoning chains. LiR$^3$AG significantly reduce the average 98% output tokens overhead and 58.6% inferencing time while improving 8B non-reasoning model's F1 performance ranging from 6.2% to 22.5% to surpass the performance of 32B reasoning model in RAG, offering a practical and efficient path forward for RAG systems.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
FASTRIC: Linguagem de Especificação de Prompt para Interações LLM Verificáveis
Modelos de Linguagem de Grande Escala (LLMs) executam protocolos complexos de interação, mas carecem de especificações formais para verificar a execução em relação à intenção do designer. Apresentamos o FASTRIC, uma Linguagem de Especificação de Prompt que torna Máquinas de Estados Finitos (FSMs) implícitas explícitas em prompts de linguagem natural, permitindo a verificação de conformidade por meio da análise de rastreamento de execução.
Fonte: arXiv cs.CL
NLP/LLMs • Score 92
XLM: A Python package for non-autoregressive language models
arXiv:2512.17065v1 Announce Type: new
Abstract: In recent years, there has been a resurgence of interest in non-autoregressive text generation in the context of general language modeling. Unlike the well-established autoregressive language modeling paradigm, which has a plethora of standard training and inference libraries, implementations of non-autoregressive language modeling have largely been bespoke making it difficult to perform systematic comparisons of different methods. Moreover, each non-autoregressive language model typically requires it own data collation, loss, and prediction logic, making it challenging to reuse common components. In this work, we present the XLM python package, which is designed to make implementing small non-autoregressive language models faster with a secondary goal of providing a suite of small pre-trained models (through a companion xlm-models package) that can be used by the research community. The code is available at https://github.com/dhruvdcoder/xlm-core.
Fonte: arXiv cs.CL
RL • Score 96
Riscos de Segurança de Veículos Agentes: Uma Análise Sistemática de Ameaças Cognitivas e Intercamadas
A Inteligência Artificial Agente está sendo cada vez mais explorada em veículos manuais e autônomos, resultando na noção de Veículos Agentes (AgVs). Este artigo investiga ameaças de segurança em AgVs, incluindo riscos no estilo OWASP e ciberataques de outras camadas que afetam a camada agente. Um novo framework é proposto para analisar esses riscos em plataformas de veículos atuais e emergentes.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Deep Learning-Based Surrogate Creep Modelling in Inconel 625: A High-Temperature Alloy Study
arXiv:2512.17477v1 Announce Type: new
Abstract: Time-dependent deformation, particularly creep, in high-temperature alloys such as Inconel 625 is a key factor in the long-term reliability of components used in aerospace and energy systems. Although Inconel 625 shows excellent creep resistance, finite-element creep simulations in tools such as ANSYS remain computationally expensive, often requiring tens of minutes for a single 10,000-hour run. This work proposes deep learning based surrogate models to provide fast and accurate replacements for such simulations. Creep strain data was generated in ANSYS using the Norton law under uniaxial stresses of 50 to 150 MPa and temperatures of 700 to 1000 $^\circ$C, and this temporal dataset was used to train two architectures: a BiLSTM Variational Autoencoder for uncertainty-aware and generative predictions, and a BiLSTM Transformer hybrid that employs self-attention to capture long-range temporal behavior. Both models act as surrogate predictors, with the BiLSTM-VAE offering probabilistic output and the BiLSTM-Transformer delivering high deterministic accuracy. Performance is evaluated using RMSE, MAE, and $R^2$. Results show that the BiLSTM-VAE provides stable and reliable creep strain forecasts, while the BiLSTM-Transformer achieves strong accuracy across the full time range. Latency tests indicate substantial speedup: while each ANSYS simulation requires 30 to 40 minutes for a given stress-temperature condition, the surrogate models produce predictions within seconds. The proposed framework enables rapid creep assessment for design optimization and structural health monitoring, and provides a scalable solution for high-temperature alloy applications.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
GreedySnake: Accelerating SSD-Offloaded LLM Training with Efficient Scheduling and Optimizer Step Overlapping
arXiv:2512.17570v1 Announce Type: new
Abstract: SSD-offloaded training offers a practical and promising approach to making LLM training cost-effective. Building on gradient accumulation with micro-batches, this paper introduces GreedySnake, a new SSD-offloaded training system that employs vertical scheduling, which executes all microbatches of a layer before proceeding to the next. Compared to existing systems that use horizontal scheduling (i.e., executing micro-batches sequentially), GreedySnake achieves higher training throughput with smaller batch sizes, bringing the system much closer to the ideal scenario predicted by the roofline model. To further mitigate the I/O bottleneck, GreedySnake overlaps part of the optimization step with the forward pass of the next iteration. Experimental results on A100 GPUs show that GreedySnake achieves saturated training throughput improvements over ZeRO-Infinity: 1.96x on 1 GPU and 1.93x on 4 GPUs for GPT-65B, and 2.53x on 1 GPU for GPT-175B. The code is open-sourced at https://github.com/npz7yyk/GreedySnake
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
Learning What to Write: Write-Gated KV for Efficient Long-Context Inference
arXiv:2512.17452v1 Announce Type: new
Abstract: Long-context LLM inference is bottlenecked by the quadratic attention complexity and linear KV cache growth. Prior approaches mitigate this via post-hoc selection or eviction but overlook the root inefficiency: indiscriminate writing to persistent memory. In this paper, we formalize KV cache management as a causal system of three primitives: KV Admission, Selection, and Eviction. We instantiate KV Admission via Write-Gated KV, a lightweight mechanism that learns to predict token utility before it enters the cache. By filtering out low-utility states early to maintain a compact global cache alongside a sliding local cache, Write-Gated KV reduces memory usage by 46-57% and delivers 3.03-3.45$\times$ prefill and 1.89-2.56$\times$ decode speedups on Llama model with negligible accuracy loss, all while remaining compatible with FlashAttention and paged-KV systems. These results demonstrate that learning what to write, is a principled and practical recipe for efficient long-context inference. Code is available at https://github.com/EMCLab-Sinica/WG-KV .
Fonte: arXiv cs.LG
Vision • Score 95
PhysFire-WM: A Physics-Informed World Model for Emulating Fire Spread Dynamics
arXiv:2512.17152v1 Announce Type: new
Abstract: Fine-grained fire prediction plays a crucial role in emergency response. Infrared images and fire masks provide complementary thermal and boundary information, yet current methods are predominantly limited to binary mask modeling with inherent signal sparsity, failing to capture the complex dynamics of fire. While world models show promise in video generation, their physical inconsistencies pose significant challenges for fire forecasting. This paper introduces PhysFire-WM, a Physics-informed World Model for emulating Fire spread dynamics. Our approach internalizes combustion dynamics by encoding structured priors from a Physical Simulator to rectify physical discrepancies, coupled with a Cross-task Collaborative Training strategy (CC-Train) that alleviates the issue of limited information in mask-based modeling. Through parameter sharing and gradient coordination, CC-Train effectively integrates thermal radiation dynamics and spatial boundary delineation, enhancing both physical realism and geometric accuracy. Extensive experiments on a fine-grained multimodal fire dataset demonstrate the superior accuracy of PhysFire-WM in fire spread prediction. Validation underscores the importance of physical priors and cross-task collaboration, providing new insights for applying physics-informed world models to disaster prediction.
Fonte: arXiv cs.CV
RL • Score 96
Adaptive Graph Pruning with Sudden-Events Evaluation for Traffic Prediction using Online Semi-Decentralized ST-GNNs
arXiv:2512.17352v1 Announce Type: new
Abstract: Spatio-Temporal Graph Neural Networks (ST-GNNs) are well-suited for processing high-frequency data streams from geographically distributed sensors in smart mobility systems. However, their deployment at the edge across distributed compute nodes (cloudlets) createssubstantial communication overhead due to repeated transmission of overlapping node features between neighbouring cloudlets. To address this, we propose an adaptive pruning algorithm that dynamically filters redundant neighbour features while preserving the most informative spatial context for prediction. The algorithm adjusts pruning rates based on recent model performance, allowing each cloudlet to focus on regions experiencing traffic changes without compromising accuracy. Additionally, we introduce the Sudden Event Prediction Accuracy (SEPA), a novel event-focused metric designed to measure responsiveness to traffic slowdowns and recoveries, which are often missed by standard error metrics. We evaluate our approach in an online semi-decentralized setting with traditional FL, server-free FL, and Gossip Learning on two large-scale traffic datasets, PeMS-BAY and PeMSD7-M, across short-, mid-, and long-term prediction horizons. Experiments show that, in contrast to standard metrics, SEPA exposes the true value of spatial connectivity in predicting dynamic and irregular traffic. Our adaptive pruning algorithm maintains prediction accuracy while significantly lowering communication cost in all online semi-decentralized settings, demonstrating that communication can be reduced without compromising responsiveness to critical traffic events.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Globally Optimal Solution to the Generalized Relative Pose Estimation Problem using Affine Correspondences
arXiv:2512.17188v1 Announce Type: new
Abstract: Mobile devices equipped with a multi-camera system and an inertial measurement unit (IMU) are widely used nowadays, such as self-driving cars. The task of relative pose estimation using visual and inertial information has important applications in various fields. To improve the accuracy of relative pose estimation of multi-camera systems, we propose a globally optimal solver using affine correspondences to estimate the generalized relative pose with a known vertical direction. First, a cost function about the relative rotation angle is established after decoupling the rotation matrix and translation vector, which minimizes the algebraic error of geometric constraints from affine correspondences. Then, the global optimization problem is converted into two polynomials with two unknowns based on the characteristic equation and its first derivative is zero. Finally, the relative rotation angle can be solved using the polynomial eigenvalue solver, and the translation vector can be obtained from the eigenvector. Besides, a new linear solution is proposed when the relative rotation is small. The proposed solver is evaluated on synthetic data and real-world datasets. The experiment results demonstrate that our method outperforms comparable state-of-the-art methods in accuracy.
Fonte: arXiv cs.CV
NLP/LLMs • Score 96
Aprendizado por Reforço para Agente Autoaperfeiçoável com Biblioteca de Habilidades
Agentes baseados em Large Language Model (LLM) demonstraram capacidades notáveis em raciocínio complexo e interações de múltiplas etapas, mas enfrentam dificuldades para melhorar e se adaptar continuamente em novos ambientes. Propomos uma abordagem baseada em Aprendizado por Reforço (RL) para aprimorar as capacidades de autoaperfeiçoamento dos agentes com uma biblioteca de habilidades.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Task Schema and Binding: A Double Dissociation Study of In-Context Learning
arXiv:2512.17325v1 Announce Type: new
Abstract: We provide causal mechanistic validation that in-context learning (ICL) decomposes into two separable mechanisms: Task Schema (abstract task type recognition) and Binding (specific input-output associations). Through activation patching experiments across 9 models from 7 Transformer families plus Mamba (370M-13B parameters), we establish three key findings:
1. Double dissociation: Task Schema transfers at 100% via late MLP patching; Binding transfers at 62% via residual stream patching -- proving separable mechanisms
2. Prior-Schema trade-off: Schema reliance inversely correlates with prior knowledge (Spearman rho = -0.596, p < 0.001, N=28 task-model pairs)
3. Architecture generality: The mechanism operates across all tested architectures including the non-Transformer Mamba
These findings offer a mechanistic account of the ICL puzzle that contrasts with prior views treating ICL as a monolithic mechanism (whether retrieval-based, gradient descent-like, or purely Bayesian). By establishing that Schema and Binding are neurally dissociable -- not merely behavioral modes -- we provide causal evidence for dual-process theories of ICL. Models rely on Task Schema when prior knowledge is absent, but prior knowledge interferes through attentional mis-routing (72.7% recency bias) rather than direct output competition (0%). This explains why arbitrary mappings succeed (zero prior leads to full Schema reliance) while factual overrides fail -- and reveals that the true bottleneck is attentional, not output-level. Practical implications: Understanding these dual mechanisms enables more efficient prompt engineering -- reliable schema transfer reduces required demonstrations for novel tasks, while prior-aware design can mitigate the 38% binding failure rate in high-prior scenarios, improving ICL system reliability in production deployments.
Fonte: arXiv cs.LG
NLP/LLMs • Score 92
EMAG: Amostragem de Difusão Auto-Retificadora com Orientação de Média Móvel Exponencial
No contexto de modelos generativos de difusão e correspondência de fluxo, técnicas de orientação são amplamente utilizadas para melhorar a qualidade e a consistência das amostras. Propomos o EMAG, um mecanismo que modifica a atenção durante a inferência em transformers de difusão, produzindo negativos mais difíceis e semanticamente fiéis, melhorando a qualidade e a pontuação de preferência humana (HPS) em +0,46 em relação ao CFG.
Fonte: arXiv cs.CV
MLOps/Systems • Score 95
Mapping the Podcast Ecosystem with the Structured Podcast Research Corpus
arXiv:2411.07892v2 Announce Type: replace
Abstract: Podcasts provide highly diverse content to a massive listener base through a unique on-demand modality. However, limited data has prevented large-scale computational analysis of the podcast ecosystem. To fill this gap, we introduce a massive dataset of over 1.1M podcast transcripts that is largely comprehensive of all English language podcasts available through public RSS feeds from May and June of 2020. This data is not limited to text, but rather includes audio features and speaker turns for a subset of 370K episodes, and speaker role inferences and other metadata for all 1.1M episodes. Using this data, we also conduct a foundational investigation into the content, structure, and responsiveness of this ecosystem. Together, our data and analyses open the door to continued computational research of this popular and impactful medium.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
LLM-as-a-qualitative-judge: automating error analysis in natural language generation
arXiv:2506.09147v4 Announce Type: replace
Abstract: Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that instance-specific issues output by LLM-as-a-qualitative-judge match those annotated by humans in 2/3 cases, and that LLM-as-a-qualitative-judge is capable of producing error type reports resembling the reports composed by human annotators. We also demonstrate in a case study how the use of LLM-as-a-qualitative-judge can substantially improve NLG systems performance. Our code and data are publicly available at https://github.com/tunde-ajayi/llm-as-a-qualitative-judge.
Fonte: arXiv cs.CL
RL • Score 92
Computational analysis reveals historical trajectory of East-Polynesian lunar calendars
arXiv:2512.17525v1 Announce Type: cross
Abstract: We investigate a type of lunar calendar known as lists of the 'nights of the moon', found throughout East Polynesia, including Rapa Nui (Easter Island). Using computational methods, we analyzed the lexical and structural divergence of 49 calendric lists from all major archipelagos, each containing about 30 night names. Our results, presented as a rooted phylogenetic tree, show a clear split into two main groups: one including lists from Rapa Nui, Mangareva, and the Marquesas; the other comprising lists from New Zealand, Hawaii, the Cook Islands, the Austral Islands, Tahiti, and the Tuamotu. This pattern aligns with a recent alternative classification of East Polynesian languages into 'Distal' (Marquesan, Mangarevan, Rapanui) and 'Proximal' (Maori, Hawaiian, Tahitian, etc.) subgroups. Since both language and lunar calendars are symbolic systems passed down and changed within communities - and given the geographic isolation of many archipelagos - we interpret this correspondence as evidence that the early divergence of East Polynesian lunar calendars mirrors early population movements and language splits in the region.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Towards Safer Chatbots: Automated Policy Compliance Evaluation of Custom GPTs
arXiv:2502.01436v3 Announce Type: replace
Abstract: User-configured chatbots built on top of large language models are increasingly available through centralized marketplaces such as OpenAI's GPT Store. While these platforms enforce usage policies intended to prevent harmful or inappropriate behavior, the scale and opacity of customized chatbots make systematic policy enforcement challenging. As a result, policy-violating chatbots continue to remain publicly accessible despite existing review processes. This paper presents a fully automated method for evaluating the compliance of Custom GPTs with its marketplace usage policy using black-box interaction. The method combines large-scale GPT discovery, policy-driven red-teaming prompts, and automated compliance assessment using an LLM-as-a-judge. We focus on three policy-relevant domains explicitly addressed in OpenAI's usage policies: Romantic, Cybersecurity, and Academic GPTs. We validate our compliance assessment component against a human-annotated ground-truth dataset, achieving an F1 score of 0.975 for binary policy violation detection. We then apply the method in a large-scale empirical study of 782 Custom GPTs retrieved from the GPT Store. The results show that 58.7% of the evaluated GPTs exhibit at least one policy-violating response, with substantial variation across policy domains. A comparison with the base models (GPT-4 and GPT-4o) indicates that most violations originate from model-level behavior, while customization tends to amplify these tendencies rather than create new failure modes. Our findings reveal limitations in current review mechanisms for user-configured chatbots and demonstrate the feasibility of scalable, behavior-based policy compliance evaluation.
Fonte: arXiv cs.CL
RL • Score 96
Quando o Raciocínio Encontra Suas Leis
arXiv:2512.17901v1 Tipo de Anúncio: novo. Este artigo apresenta as Leis do Raciocínio (LoRe), um framework unificado que caracteriza padrões intrínsecos de raciocínio em Large Reasoning Models (LRMs). Propomos a lei de computação e uma lei de precisão suplementar, introduzindo o LoRe-Bench para medir essas propriedades em modelos de raciocínio. Avaliações mostram que a maioria dos modelos apresenta monotonicidade razoável, mas carece de composicionalidade.
Fonte: arXiv cs.AI
RL • Score 96
Electric Vehicle Charging Load Forecasting: An Experimental Comparison of Machine Learning Methods
arXiv:2512.17257v1 Announce Type: new
Abstract: With the growing popularity of electric vehicles as a means of addressing climate change, concerns have emerged regarding their impact on electric grid management. As a result, predicting EV charging demand has become a timely and important research problem. While substantial research has addressed energy load forecasting in transportation, relatively few studies systematically compare multiple forecasting methods across different temporal horizons and spatial aggregation levels in diverse urban settings. This work investigates the effectiveness of five time series forecasting models, ranging from traditional statistical approaches to machine learning and deep learning methods. Forecasting performance is evaluated for short-, mid-, and long-term horizons (on the order of minutes, hours, and days, respectively), and across spatial scales ranging from individual charging stations to regional and city-level aggregations. The analysis is conducted on four publicly available real-world datasets, with results reported independently for each dataset. To the best of our knowledge, this is the first work to systematically evaluate EV charging demand forecasting across such a wide range of temporal horizons and spatial aggregation levels using multiple real-world datasets.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens
arXiv:2512.17375v1 Announce Type: new
Abstract: Reward models and LLM-as-a-Judge systems are central to modern post-training pipelines such as RLHF, DPO, and RLAIF, where they provide scalar feedback and binary decisions that guide model selection and RL-based fine-tuning. We show that these judge systems exhibit a recurring vulnerability: short sequences of low-perplexity control tokens can flip many binary evaluations from correct ``No'' judgments to incorrect ``Yes'' judgments by steering the last-layer logit gap. These control tokens are patterns that a policy model could plausibly generate during post-training, and thus represent realistic reward-hacking risks rather than worst-case adversarial strings. Our method, AdvJudge-Zero, uses the model's next-token distribution and beam-search exploration to discover diverse control-token sequences from scratch, and our analysis shows that the induced hidden-state perturbations concentrate in a low-rank ``soft mode'' that is anti-aligned with the judge's refusal direction. Empirically, these tokens cause very high false positive rates when large open-weight and specialized judge models score incorrect answers on math and reasoning benchmarks. Finally, we show that LoRA-based adversarial training on small sets of control-token-augmented examples can markedly reduce these false positives while preserving evaluation quality.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
PAACE: Um Framework de Engenharia de Contexto Automatizado e Consciente de Planejamento
Modelos de Linguagem de Grande Escala (LLM) estão sendo cada vez mais utilizados em fluxos de trabalho complexos que envolvem planejamento, uso de ferramentas, reflexão e interação com sistemas de conhecimento externos. Este trabalho apresenta o PAACE, um framework unificado para otimizar o estado evolutivo de agentes LLM através de modelagem de relevância de próximas tarefas, análise de estrutura de planejamento, co-refinamento de instruções e compressão que preserva funções.
Fonte: arXiv cs.AI
MLOps/Systems • Score 93
SDUM: Um Modelo Profundo Desdobrado Escalável para Reconstrução Universal de MRI
O SDUM é um framework universal que combina um reconstrutor baseado em Restormer, um estimador de mapa de sensibilidade de bobina aprendido (CSME) e consistência de dados ponderada ciente de amostragem (SWDC). Ele demonstra um comportamento de escalabilidade semelhante a modelos fundamentais, alcançando resultados de ponta em desafios de reconstrução de MRI sem ajuste fino específico de tarefa.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
Strategic Planning and Rationalizing on Trees Make LLMs Better Debaters
arXiv:2505.14886v2 Announce Type: replace
Abstract: Winning competitive debates requires sophisticated reasoning and argument skills. There are unique challenges in the competitive debate: (1) The time constraints force debaters to make strategic choices about which points to pursue rather than covering all possible arguments; (2) The persuasiveness of the debate relies on the back-and-forth interaction between arguments, which a single final game status cannot evaluate. To address these challenges, we propose TreeDebater, a novel debate framework that excels in competitive debate. We introduce two tree structures: the Rehearsal Tree and Debate Flow Tree. The Rehearsal Tree anticipates the attack and defenses to evaluate the strength of the claim, while the Debate Flow Tree tracks the debate status to identify the active actions. TreeDebater allocates its time budget among candidate actions and uses the speech time controller and feedback from the simulated audience to revise its statement. The human evaluation on both the stage-level and the debate-level comparison shows that our TreeDebater outperforms the state-of-the-art multi-agent debate system, with a +15.6% improvement in stage-level persuasiveness with DeepSeek and +10% debate-level opinion shift win. Further investigation shows that TreeDebater shows better strategies in limiting time to important debate actions, aligning with the strategies of human debate experts.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Rotterdam artery-vein segmentation (RAV) dataset
arXiv:2512.17322v1 Announce Type: new
Abstract: Purpose: To provide a diverse, high-quality dataset of color fundus images (CFIs) with detailed artery-vein (A/V) segmentation annotations, supporting the development and evaluation of machine learning algorithms for vascular analysis in ophthalmology.
Methods: CFIs were sampled from the longitudinal Rotterdam Study (RS), encompassing a wide range of ages, devices, and capture conditions. Images were annotated using a custom interface that allowed graders to label arteries, veins, and unknown vessels on separate layers, starting from an initial vessel segmentation mask. Connectivity was explicitly verified and corrected using connected component visualization tools.
Results: The dataset includes 1024x1024-pixel PNG images in three modalities: original RGB fundus images, contrast-enhanced versions, and RGB-encoded A/V masks. Image quality varied widely, including challenging samples typically excluded by automated quality assessment systems, but judged to contain valuable vascular information.
Conclusion: This dataset offers a rich and heterogeneous source of CFIs with high-quality segmentations. It supports robust benchmarking and training of machine learning models under real-world variability in image quality and acquisition settings.
Translational Relevance: By including connectivity-validated A/V masks and diverse image conditions, this dataset enables the development of clinically applicable, generalizable machine learning tools for retinal vascular analysis, potentially improving automated screening and diagnosis of systemic and ocular diseases.
Fonte: arXiv cs.CV
MLOps/Systems • Score 96
MINPO: Memory-Informed Neural Pseudo-Operator to Resolve Nonlocal Spatiotemporal Dynamics
arXiv:2512.17273v1 Announce Type: new
Abstract: Many physical systems exhibit nonlocal spatiotemporal behaviors described by integro-differential equations (IDEs). Classical methods for solving IDEs require repeatedly evaluating convolution integrals, whose cost increases quickly with kernel complexity and dimensionality. Existing neural solvers can accelerate selected instances of these computations, yet they do not generalize across diverse nonlocal structures. In this work, we introduce the Memory-Informed Neural Pseudo-Operator (MINPO), a unified framework for modeling nonlocal dynamics arising from long-range spatial interactions and/or long-term temporal memory. MINPO, employing either Kolmogorov-Arnold Networks (KANs) or multilayer perceptron networks (MLPs) as encoders, learns the nonlocal operator and its inverse directly through neural representations, and then explicitly reconstruct the unknown solution fields. The learning is guarded by a lightweight nonlocal consistency loss term to enforce coherence between the learned operator and reconstructed solution. The MINPO formulation allows to naturally capture and efficiently resolve nonlocal spatiotemporal dependencies governed by a wide spectrum of IDEs and their subsets, including fractional PDEs. We evaluate the efficacy of MINPO in comparison with classical techniques and state-of-the-art neural-based strategies based on MLPs, such as A-PINN and fPINN, along with their newly-developed KAN variants, A-PIKAN and fPIKAN, designed to facilitate a fair comparison. Our study offers compelling evidence of the accuracy of MINPO and demonstrates its robustness in handling (i) diverse kernel types, (ii) different kernel dimensionalities, and (iii) the substantial computational demands arising from repeated evaluations of kernel integrals. MINPO, thus, generalizes beyond problem-specific formulations, providing a unified framework for systems governed by nonlocal operators.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
ScoutGPT: Capturando o Impacto do Jogador a Partir de Sequências de Ação da Equipe Usando um Framework Baseado em GPT
Transferências desempenham um papel fundamental no sucesso de um clube de futebol, mas prever se uma transferência terá sucesso continua sendo difícil devido à forte dependência do contexto no desempenho em campo. Para abordar essa lacuna, introduzimos o EventGPT, um modelo de previsão de próximo evento condicionado ao jogador e ciente do valor, construído sobre um transformer autoregressivo estilo GPT.
Fonte: arXiv cs.AI
RL • Score 93
Learning solution operator of dynamical systems with diffusion maps kernel ridge regression
arXiv:2512.17203v1 Announce Type: new
Abstract: Many scientific and engineering systems exhibit complex nonlinear dynamics that are difficult to predict accurately over long time horizons. Although data-driven models have shown promise, their performance often deteriorates when the geometric structures governing long-term behavior are unknown or poorly represented. We demonstrate that a simple kernel ridge regression (KRR) framework, when combined with a dynamics-aware validation strategy, provides a strong baseline for long-term prediction of complex dynamical systems. By employing a data-driven kernel derived from diffusion maps, the proposed Diffusion Maps Kernel Ridge Regression (DM-KRR) method implicitly adapts to the intrinsic geometry of the system's invariant set, without requiring explicit manifold reconstruction or attractor modeling, procedures that often limit predictive performance. Across a broad range of systems, including smooth manifolds, chaotic attractors, and high-dimensional spatiotemporal flows, DM-KRR consistently outperforms state-of-the-art random feature, neural-network and operator-learning methods in both accuracy and data efficiency. These findings underscore that long-term predictive skill depends not only on model expressiveness, but critically on respecting the geometric constraints encoded in the data through dynamically consistent model selection. Together, simplicity, geometry awareness, and strong empirical performance point to a promising path for reliable and efficient learning of complex dynamical systems.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Generating Completions for Broca's Aphasic Sentences Using Large Language Models
arXiv:2412.17669v2 Announce Type: replace
Abstract: Broca's aphasia is a type of aphasia characterized by non-fluent, effortful and agrammatic speech production with relatively good comprehension. Since traditional aphasia treatment methods are often time-consuming, labour-intensive, and do not reflect real-world conversations, applying natural language processing based approaches such as Large Language Models (LLMs) could potentially contribute to improving existing treatment approaches. To address this issue, we explore the use of sequence-to-sequence LLMs for completing Broca's aphasic sentences. We first generate synthetic Broca's aphasic data using a rule-based system designed to mirror the linguistic characteristics of Broca's aphasic speech. Using this synthetic data (without authentic aphasic samples), we then fine-tune four pre-trained LLMs on the task of completing agrammatic sentences. We evaluate our fine-tuned models on both synthetic and authentic Broca's aphasic data. We demonstrate LLMs' capability for reconstructing agrammatic sentences, with the models showing improved performance with longer input utterances. Our result highlights the LLMs' potential in advancing communication aids for individuals with Broca's aphasia and possibly other clinical populations.
Fonte: arXiv cs.CL
MLOps/Systems • Score 93
Digitizing Nepal's Written Heritage: A Comprehensive HTR Pipeline for Old Nepali Manuscripts
arXiv:2512.17111v1 Announce Type: new
Abstract: This paper presents the first end-to-end pipeline for Handwritten Text Recognition (HTR) for Old Nepali, a historically significant but low-resource language. We adopt a line-level transcription approach and systematically explore encoder-decoder architectures and data-centric techniques to improve recognition accuracy. Our best model achieves a Character Error Rate (CER) of 4.9\%. In addition, we implement and evaluate decoding strategies and analyze token-level confusions to better understand model behaviour and error patterns. While the dataset we used for evaluation is confidential, we release our training code, model configurations, and evaluation scripts to support further research in HTR for low-resource historical scripts.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Predictive Modeling of Maritime Radar Data Using Transformer Architecture
arXiv:2512.17098v1 Announce Type: new
Abstract: Maritime autonomous systems require robust predictive capabilities to anticipate vessel motion and environmental dynamics. While transformer architectures have revolutionized AIS-based trajectory prediction and demonstrated feasibility for sonar frame forecasting, their application to maritime radar frame prediction remains unexplored, creating a critical gap given radar's all-weather reliability for navigation. This survey systematically reviews predictive modeling approaches relevant to maritime radar, with emphasis on transformer architectures for spatiotemporal sequence forecasting, where existing representative methods are analyzed according to data type, architecture, and prediction horizon. Our review shows that, while the literature has demonstrated transformer-based frame prediction for sonar sensing, no prior work addresses transformer-based maritime radar frame prediction, thereby defining a clear research gap and motivating a concrete research direction for future work in this area.
Fonte: arXiv cs.CV
NLP/LLMs • Score 96
Large Language Models as Pok\'emon Battle Agents: Strategic Play and Content Generation
arXiv:2512.17308v1 Announce Type: new
Abstract: Strategic decision-making in Pok\'emon battles presents a unique testbed for evaluating large language models. Pok\'emon battles demand reasoning about type matchups, statistical trade-offs, and risk assessment, skills that mirror human strategic thinking. This work examines whether Large Language Models (LLMs) can serve as competent battle agents, capable of both making tactically sound decisions and generating novel, balanced game content. We developed a turn-based Pok\'emon battle system where LLMs select moves based on battle state rather than pre-programmed logic. The framework captures essential Pok\'emon mechanics: type effectiveness multipliers, stat-based damage calculations, and multi-Pok\'emon team management. Through systematic evaluation across multiple model architectures we measured win rates, decision latency, type-alignment accuracy, and token efficiency. These results suggest LLMs can function as dynamic game opponents without domain-specific training, offering a practical alternative to reinforcement learning for turn-based strategic games. The dual capability of tactical reasoning and content creation, positions LLMs as both players and designers, with implications for procedural generation and adaptive difficulty systems in interactive entertainment.
Fonte: arXiv cs.AI
MLOps/Systems • Score 95
Regularized Random Fourier Features and Finite Element Reconstruction for Operator Learning in Sobolev Space
arXiv:2512.17884v1 Announce Type: cross
Abstract: Operator learning is a data-driven approximation of mappings between infinite-dimensional function spaces, such as the solution operators of partial differential equations. Kernel-based operator learning can offer accurate, theoretically justified approximations that require less training than standard methods. However, they can become computationally prohibitive for large training sets and can be sensitive to noise. We propose a regularized random Fourier feature (RRFF) approach, coupled with a finite element reconstruction map (RRFF-FEM), for learning operators from noisy data. The method uses random features drawn from multivariate Student's $t$ distributions, together with frequency-weighted Tikhonov regularization that suppresses high-frequency noise. We establish high-probability bounds on the extreme singular values of the associated random feature matrix and show that when the number of features $N$ scales like $m \log m$ with the number of training samples $m$, the system is well-conditioned, which yields estimation and generalization guarantees. Detailed numerical experiments on benchmark PDE problems, including advection, Burgers', Darcy flow, Helmholtz, Navier-Stokes, and structural mechanics, demonstrate that RRFF and RRFF-FEM are robust to noise and achieve improved performance with reduced training time compared to the unregularized random feature model, while maintaining competitive accuracy relative to kernel and neural operator tests.
Fonte: arXiv stat.ML
RL • Score 96
Distributed Learning in Markovian Restless Bandits over Interference Graphs for Stable Spectrum Sharing
arXiv:2512.17161v1 Announce Type: new
Abstract: We study distributed learning for spectrum access and sharing among multiple cognitive communication entities, such as cells, subnetworks, or cognitive radio users (collectively referred to as cells), in communication-constrained wireless networks modeled by interference graphs. Our goal is to achieve a globally stable and interference-aware channel allocation. Stability is defined through a generalized Gale-Shapley multi-to-one matching, a well-established solution concept in wireless resource allocation. We consider wireless networks where L cells share S orthogonal channels and cannot simultaneously use the same channel as their neighbors. Each channel evolves as an unknown restless Markov process with cell-dependent rewards, making this the first work to establish global Gale-Shapley stability for channel allocation in a stochastic, temporally varying restless environment. To address this challenge, we develop SMILE (Stable Multi-matching with Interference-aware LEarning), a communication-efficient distributed learning algorithm that integrates restless bandit learning with graph-constrained coordination. SMILE enables cells to distributedly balance exploration of unknown channels with exploitation of learned information. We prove that SMILE converges to the optimal stable allocation and achieves logarithmic regret relative to a genie with full knowledge of expected utilities. Simulations validate the theoretical guarantees and demonstrate SMILE's robustness, scalability, and efficiency across diverse spectrum-sharing scenarios.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
Investigando a Inteligência Geral Científica de LLMs com Fluxos de Trabalho Alinhados a Cientistas
Apesar dos avanços em IA científica, falta um framework coerente para Inteligência Geral Científica (SGI) — a capacidade de conceber, investigar e raciocinar autonomamente em domínios científicos. Apresentamos uma definição operacional de SGI baseada no Modelo de Investigação Prática (PIM) e a operacionalizamos através de quatro tarefas alinhadas a cientistas: pesquisa profunda, geração de ideias, experimentos secos/molhados e raciocínio experimental.
Fonte: arXiv cs.AI
Vision • Score 95
Stakeholder Suite: A Unified AI Framework for Mapping Actors, Topics and Arguments in Public Debates
arXiv:2512.17347v1 Announce Type: new
Abstract: Public debates surrounding infrastructure and energy projects involve complex networks of stakeholders, arguments, and evolving narratives. Understanding these dynamics is crucial for anticipating controversies and informing engagement strategies, yet existing tools in media intelligence largely rely on descriptive analytics with limited transparency. This paper presents Stakeholder Suite, a framework deployed in operational contexts for mapping actors, topics, and arguments within public debates. The system combines actor detection, topic modeling, argument extraction and stance classification in a unified pipeline. Tested on multiple energy infrastructure projects as a case study, the approach delivers fine-grained, source-grounded insights while remaining adaptable to diverse domains. The framework achieves strong retrieval precision and stance accuracy, producing arguments judged relevant in 75% of pilot use cases. Beyond quantitative metrics, the tool has proven effective for operational use: helping project teams visualize networks of influence, identify emerging controversies, and support evidence-based decision-making.
Fonte: arXiv cs.CL
NLP/LLMs • Score 96
Rumo à IA Conversacional Explicável para Diagnóstico Precoce com Modelos de Linguagem de Grande Escala
Os sistemas de saúde em todo o mundo enfrentam problemas como diagnósticos ineficientes, aumento de custos e acesso limitado a especialistas. Este estudo apresenta um chatbot diagnóstico alimentado por um Large Language Model (LLM), utilizando GPT-4o e técnicas de IA explicável, que interage com os pacientes para extrair e normalizar sintomas, priorizando diagnósticos potenciais.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
V-Agent: Um Sistema de Busca de Vídeo Interativo Usando Modelos de Visão-Linguagem
Apresentamos o V-Agent, uma nova plataforma multi-agente projetada para busca avançada de vídeos e conversas interativas entre usuário e sistema. Ao ajustar um modelo de visão-linguagem (VLM) com um pequeno conjunto de dados de preferência de vídeo e aprimorá-lo com um vetor de recuperação de um modelo de recuperação de imagem-texto, superamos as limitações dos sistemas tradicionais de recuperação baseados em texto em cenários multimodais.
Fonte: arXiv cs.AI
RL • Score 96
SHARP-QoS: Sparsely-gated Hierarchical Adaptive Routing for joint Prediction of QoS
arXiv:2512.17262v1 Announce Type: new
Abstract: Dependable service-oriented computing relies on multiple Quality of Service (QoS) parameters that are essential to assess service optimality. However, real-world QoS data are extremely sparse, noisy, and shaped by hierarchical dependencies arising from QoS interactions, and geographical and network-level factors, making accurate QoS prediction challenging. Existing methods often predict each QoS parameter separately, requiring multiple similar models, which increases computational cost and leads to poor generalization. Although recent joint QoS prediction studies have explored shared architectures, they suffer from negative transfer due to loss-scaling caused by inconsistent numerical ranges across QoS parameters and further struggle with inadequate representation learning, resulting in degraded accuracy. This paper presents an unified strategy for joint QoS prediction, called SHARP-QoS, that addresses these issues using three components. First, we introduce a dual mechanism to extract the hierarchical features from both QoS and contextual structures via hyperbolic convolution formulated in the Poincar\'e ball. Second, we propose an adaptive feature-sharing mechanism that allows feature exchange across informative QoS and contextual signals. A gated feature fusion module is employed to support dynamic feature selection among structural and shared representations. Third, we design an EMA-based loss balancing strategy that allows stable joint optimization, thereby mitigating the negative transfer. Evaluations on three datasets with two, three, and four QoS parameters demonstrate that SHARP-QoS outperforms both single- and multi-task baselines. Extensive study shows that our model effectively addresses major challenges, including sparsity, robustness to outliers, and cold-start, while maintaining moderate computational overhead, underscoring its capability for reliable joint QoS prediction.
Fonte: arXiv cs.LG
NLP/LLMs • Score 96
Acelerando o Desempenho de Jogos Multi-modal LLM via Previsão de Entrada e Correção de Erros
Agentes de controle sequencial em tempo real frequentemente enfrentam latência de inferência. Mesmo atrasos modestos no planejamento por etapa podem desestabilizar o controle e degradar o desempenho geral. Propomos um framework de especulação e correção que adapta a filosofia de prever e verificar da execução especulativa ao controle baseado em modelo com TD-MPC2.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs
arXiv:2512.17131v1 Announce Type: cross
Abstract: We propose Generalized Primal Averaging (GPA), an extension of Nesterov's method in its primal averaging formulation that addresses key limitations of recent averaging-based optimizers such as single-worker DiLoCo and Schedule-Free (SF) in the non-distributed setting. These two recent algorithmic approaches improve the performance of base optimizers, such as AdamW, through different iterate averaging strategies. Schedule-Free explicitly maintains a uniform average of past weights, while single-worker DiLoCo performs implicit averaging by periodically aggregating trajectories, called pseudo-gradients, to update the model parameters. However, single-worker DiLoCo's periodic averaging introduces a two-loop structure, increasing its memory requirements and number of hyperparameters. GPA overcomes these limitations by decoupling the interpolation constant in the primal averaging formulation of Nesterov. This decoupling enables GPA to smoothly average iterates at every step, generalizing and improving upon single-worker DiLoCo. Empirically, GPA consistently outperforms single-worker DiLoCo while removing the two-loop structure, simplifying hyperparameter tuning, and reducing its memory overhead to a single additional buffer. On the Llama-160M model, GPA provides a 24.22% speedup in terms of steps to reach the baseline (AdamW's) validation loss. Likewise, GPA achieves speedups of 12% and 27% on small and large batch setups, respectively, to attain AdamW's validation accuracy on the ImageNet ViT workload. Furthermore, we prove that for any base optimizer with regret bounded by $O(\sqrt{T})$, where $T$ is the number of iterations, GPA can match or exceed the convergence guarantee of the original optimizer, depending on the choice of interpolation constants.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Atom: Efficient On-Device Video-Language Pipelines Through Modular Reuse
arXiv:2512.17108v1 Announce Type: new
Abstract: Recent advances in video-language models have enabled powerful applications like video retrieval, captioning, and assembly. However, executing such multi-stage pipelines efficiently on mobile devices remains challenging due to redundant model loads and fragmented execution. We introduce Atom, an on-device system that restructures video-language pipelines for fast and efficient execution. Atom decomposes a billion-parameter model into reusable modules, such as the visual encoder and language decoder, and reuses them across subtasks like captioning, reasoning, and indexing. This reuse-centric design eliminates repeated model loading and enables parallel execution, reducing end-to-end latency without sacrificing performance. On commodity smartphones, Atom achieves 27--33% faster execution compared to non-reuse baselines, with only marginal performance drop ($\leq$ 2.3 Recall@1 in retrieval, $\leq$ 1.5 CIDEr in captioning). These results position Atom as a practical, scalable approach for efficient video-language understanding on edge devices.
Fonte: arXiv cs.LG
RL • Score 96
DiffeoMorph: Aprendendo a Morfose de Formas 3D Usando Simulações Baseadas em Agentes Diferenciáveis
Sistemas biológicos podem formar estruturas tridimensionais complexas através do comportamento coletivo de agentes idênticos. Neste trabalho, apresentamos o DiffeoMorph, um framework diferenciável para aprender um protocolo de morfogênese que orienta uma população de agentes a se transformar em uma forma 3D alvo, utilizando uma rede neural gráfica SE(3)-equivariant baseada em atenção.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports
arXiv:2512.17776v1 Announce Type: new
Abstract: As large language models (LLMs) advance, deep research systems can generate expert-level reports via multi-step reasoning and evidence-based synthesis, but evaluating such reports remains challenging. Existing benchmarks often lack systematic criteria for expert reporting, evaluations that rely heavily on LLM judges can fail to capture issues that require expert judgment, and source verification typically covers only a limited subset of explicitly cited statements rather than report-wide factual reliability. We introduce DEER, a benchmark for evaluating expert-level deep research reports. DEER comprises 50 report-writing tasks spanning 13 domains and an expert-grounded evaluation taxonomy (7 dimensions, 25 sub-dimension) operationalized into 130 fine-grained rubric items. DEER further provides task-specific expert guidance to help LLM judges assess expert-level report quality more consistently. Complementing rubric-based assessment, we propose a document-level fact-checking architecture that extracts and verifies all claims across the entire report, including both cited and uncited ones, and quantifies external-evidence quality. DEER correlates closely with human expert judgments and yields interpretable diagnostics of system strengths and weaknesses.
Fonte: arXiv cs.CL
MLOps/Systems • Score 93
Diagnóstico e Quantificação de Falhas para Arrays Fotovoltaicos Baseados em Modelos Físicos Diferenciáveis
O diagnóstico e a quantificação precisos de falhas são essenciais para a operação confiável e a manutenção inteligente de arrays fotovoltaicos (PV). Este artigo propõe uma nova abordagem de quantificação de falhas para strings PV baseada em um modelo de simulação de falhas rápidas diferenciáveis (DFFSM), que modela com precisão as características I-V sob múltiplas falhas e fornece gradientes analíticos em relação aos parâmetros de falha.
Fonte: arXiv cs.LG
Vision • Score 96
Dialética para Inteligência Artificial
O artigo investiga se a inteligência artificial pode descobrir conceitos a partir de experiências brutas e sem supervisão humana. Propõe-se uma definição de 'conceito' que vai além de um rótulo de dicionário, considerando a relação estrutural com a experiência total de um agente. A reversibilidade é um aspecto central, permitindo que a existência de conceitos seja uma afirmação estrutural verificável.
Fonte: arXiv cs.AI
NLP/LLMs • Score 95
Toward Ethical AI Through Bayesian Uncertainty in Neural Question Answering
arXiv:2512.17677v1 Announce Type: new
Abstract: We explore Bayesian reasoning as a means to quantify uncertainty in neural networks for question answering. Starting with a multilayer perceptron on the Iris dataset, we show how posterior inference conveys confidence in predictions. We then extend this to language models, applying Bayesian inference first to a frozen head and finally to LoRA-adapted transformers, evaluated on the CommonsenseQA benchmark. Rather than aiming for state-of-the-art accuracy, we compare Laplace approximations against maximum a posteriori (MAP) estimates to highlight uncertainty calibration and selective prediction. This allows models to abstain when confidence is low. An ``I don't know'' response not only improves interpretability but also illustrates how Bayesian methods can contribute to more responsible and ethical deployment of neural question-answering systems.
Fonte: arXiv cs.CL
MLOps/Systems • Score 90
Mistura de Especialistas Adaptativa Eficiente em Largura de Banda via Compensação de Baixa Classificação
Modelos de Mistura de Especialistas (MoE) aumentam a capacidade por meio de ativação esparsa, mas sobrecarregam a memória e a largura de banda. Apresentamos a Mistura de Especialistas Adaptativa Eficiente em Largura de Banda via Compensação de Baixa Classificação, que realiza restauração de precisão guiada por roteador usando compensadores de baixa classificação pré-computados, melhorando a relação entre largura de banda e precisão.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Are Vision Language Models Cross-Cultural Theory of Mind Reasoners?
arXiv:2512.17394v1 Announce Type: new
Abstract: Theory of Mind (ToM) -- the ability to attribute beliefs, desires, and emotions to others -- is fundamental for human social intelligence, yet remains a major challenge for artificial agents. Existing Vision-Language Models (VLMs) are increasingly applied in socially grounded tasks, but their capacity for cross-cultural ToM reasoning is largely unexplored. In this work, we introduce CulturalToM-VQA, a new evaluation benchmark containing 5095 questions designed to probe ToM reasoning across diverse cultural contexts through visual question answering. The dataset captures culturally grounded cues such as rituals, attire, gestures, and interpersonal dynamics, enabling systematic evaluation of ToM reasoning beyond Western-centric benchmarks. Our dataset is built through a VLM-assisted human-in-the-loop pipeline, where human experts first curate culturally rich images across traditions, rituals, and social interactions; a VLM then assist in generating structured ToM-focused scene descriptions, which are refined into question-answer pairs spanning a taxonomy of six ToM tasks and four graded complexity levels. The resulting dataset covers diverse theory of mind facets such as mental state attribution, false belief reasoning, non-literal communication, social norm violations, perspective coordination, and multi-agent reasoning.
Fonte: arXiv cs.CL
Vision • Score 92
Comparison of deep learning models: CNN and VGG-16 in identifying pornographic content
arXiv:2512.16947v1 Announce Type: new
Abstract: In 2020, a total of 59,741 websites were blocked by the Indonesian government due to containing negative content, including pornography, with 14,266 websites falling into this category. However, these blocked websites could still be accessed by the public using virtual private networks (VPNs). This prompted the research idea to quickly identify pornographic content. This study aims to develop a system capable of identifying websites suspected of containing pornographic image content, using a deep learning approach with convolutional neural network (CNN) and visual geometry group 16 (VGG-16) model. The two models were then explored comprehensively and holistically to determine which model was most effective in detecting pornographic content quickly. Based on the findings of the comparison between testing the CNN and VGG-16 models, research results showed that the best test results were obtained in the eighth experiment using the CNN model at an epoch value level of 50 and a learning rate of 0.001 of 0.9487 or 94.87%. This can be interpreted that the CNN model is more effective in detecting pornographic content quickly and accurately compared to using the VGG-16 model.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
Subjective Question Generation and Answer Evaluation using NLP
arXiv:2512.17289v1 Announce Type: new
Abstract: Natural Language Processing (NLP) is one of the most revolutionary technologies today. It uses artificial intelligence to understand human text and spoken words. It is used for text summarization, grammar checking, sentiment analysis, and advanced chatbots and has many more potential use cases. Furthermore, it has also made its mark on the education sector. Much research and advancements have already been conducted on objective question generation; however, automated subjective question generation and answer evaluation are still in progress. An automated system to generate subjective questions and evaluate the answers can help teachers assess student work and enhance the student's learning experience by allowing them to self-assess their understanding after reading an article or a chapter of a book. This research aims to improve current NLP models or make a novel one for automated subjective question generation and answer evaluation from text input.
Fonte: arXiv cs.CL
NLP/LLMs • Score 93
Compressão é Roteamento: Erro de Reconstrução como um Sinal Intrínseco para Modelos de Linguagem Modulares
Modelos de Linguagem de Grande Escala (LLMs) enfrentam desafios como limitações de comprimento de contexto e altos custos de inferência. Este artigo propõe a filosofia arquitetônica "Compressão é Roteamento" e apresenta um Transformer Autoencoder com 87M de parâmetros, alcançando uma compressão de sequência de 64x. Os resultados experimentais mostram uma discrepância de desempenho que valida o erro de reconstrução como uma "Impressão Digital Intrínseca de Distribuição".
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Bangla MedER: Multi-BERT Ensemble Approach for the Recognition of Bangla Medical Entity
arXiv:2512.17769v1 Announce Type: new
Abstract: Medical Entity Recognition (MedER) is an essential NLP task for extracting meaningful entities from the medical corpus. Nowadays, MedER-based research outcomes can remarkably contribute to the development of automated systems in the medical sector, ultimately enhancing patient care and outcomes. While extensive research has been conducted on MedER in English, low-resource languages like Bangla remain underexplored. Our work aims to bridge this gap. For Bangla medical entity recognition, this study first examined a number of transformer models, including BERT, DistilBERT, ELECTRA, and RoBERTa. We also propose a novel Multi-BERT Ensemble approach that outperformed all baseline models with the highest accuracy of 89.58%. Notably, it provides an 11.80% accuracy improvement over the single-layer BERT model, demonstrating its effectiveness for this task. A major challenge in MedER for low-resource languages is the lack of annotated datasets. To address this issue, we developed a high-quality dataset tailored for the Bangla MedER task. The dataset was used to evaluate the effectiveness of our model through multiple performance metrics, demonstrating its robustness and applicability. Our findings highlight the potential of Multi-BERT Ensemble models in improving MedER for Bangla and set the foundation for further advancements in low-resource medical NLP.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators
arXiv:2512.17267v1 Announce Type: new
Abstract: Evaluating user-facing AI applications remains a central challenge, especially in open-ended domains such as travel planning, clinical note generation, or dialogue. The gold standard is user feedback (e.g., thumbs up/down) or behavioral signals (e.g., retention), but these are often scarce in prototypes and research projects, or too-slow to use for system optimization. We present AutoMetrics, a framework for synthesizing evaluation metrics under low-data constraints. AutoMetrics combines retrieval from MetricBank, a collection of 48 metrics we curate, with automatically generated LLM-as-a-Judge criteria informed by lightweight human feedback. These metrics are composed via regression to maximize correlation with human signal. AutoMetrics takes you from expensive measures to interpretable automatic metrics. Across 5 diverse tasks, AutoMetrics improves Kendall correlation with human ratings by up to 33.4% over LLM-as-a-Judge while requiring fewer than 100 feedback points. We show that AutoMetrics can be used as a proxy reward to equal effect as a verifiable reward. We release the full AutoMetrics toolkit and MetricBank to accelerate adaptive evaluation of LLM applications.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding
arXiv:2512.17220v1 Announce Type: new
Abstract: Humans understand long and complex texts by relying on a holistic semantic representation of the content. This global view helps organize prior knowledge, interpret new information, and integrate evidence dispersed across a document, as revealed by the Mindscape-Aware Capability of humans in psychology. Current Retrieval-Augmented Generation (RAG) systems lack such guidance and therefore struggle with long-context tasks. In this paper, we propose Mindscape-Aware RAG (MiA-RAG), the first approach that equips LLM-based RAG systems with explicit global context awareness. MiA-RAG builds a mindscape through hierarchical summarization and conditions both retrieval and generation on this global semantic representation. This enables the retriever to form enriched query embeddings and the generator to reason over retrieved evidence within a coherent global context. We evaluate MiA-RAG across diverse long-context and bilingual benchmarks for evidence-based understanding and global sense-making. It consistently surpasses baselines, and further analysis shows that it aligns local details with a coherent global representation, enabling more human-like long-context retrieval and reasoning.
Fonte: arXiv cs.CL
Vision • Score 96
BIONIX: Um Braço Protético Sem Fio e de Baixo Custo com Controle de EEG e EMG de Duplo Sinal
As próteses de membro superior acessíveis frequentemente carecem de sistemas de controle intuitivos, limitando a funcionalidade e acessibilidade para amputados em ambientes de baixa renda. Este projeto apresenta um sistema de controle neuro-muscular de baixo custo que integra eletroencefalografia (EEG) e eletromiografia (EMG) para permitir controle em tempo real de um braço protético.
Fonte: arXiv cs.LG
NLP/LLMs • Score 95
Simulstream: Open-Source Toolkit for Evaluation and Demonstration of Streaming Speech-to-Text Translation Systems
arXiv:2512.17648v1 Announce Type: new
Abstract: Streaming Speech-to-Text Translation (StreamST) requires producing translations concurrently with incoming speech, imposing strict latency constraints and demanding models that balance partial-information decision-making with high translation quality. Research efforts on the topic have so far relied on the SimulEval repository, which is no longer maintained and does not support systems that revise their outputs. In addition, it has been designed for simulating the processing of short segments, rather than long-form audio streams, and it does not provide an easy method to showcase systems in a demo. As a solution, we introduce simulstream, the first open-source framework dedicated to unified evaluation and demonstration of StreamST systems. Designed for long-form speech processing, it supports not only incremental decoding approaches, but also re-translation methods, enabling for their comparison within the same framework both in terms of quality and latency. In addition, it also offers an interactive web interface to demo any system built within the tool.
Fonte: arXiv cs.CL
NLP/LLMs • Score 95
Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience
arXiv:2512.17260v1 Announce Type: new
Abstract: Large language models have recently made significant progress to generate rigorous mathematical proofs. In contrast, utilizing LLMs for theorem proving in formal languages (such as Lean) remains challenging and computationally expensive, particularly when addressing problems at the undergraduate level and beyond. In this work, we present \textbf{Seed-Prover 1.5}, a formal theorem-proving model trained via large-scale agentic reinforcement learning, alongside an efficient test-time scaling (TTS) workflow. Through extensive interactions with Lean and other tools, the model continuously accumulates experience during the RL process, substantially enhancing the capability and efficiency of formal theorem proving. Furthermore, leveraging recent advancements in natural language proving, our TTS workflow efficiently bridges the gap between natural and formal languages. Compared to state-of-the-art methods, Seed-Prover 1.5 achieves superior performance with a smaller compute budget. It solves \textbf{88\% of PutnamBench} (undergraduate-level), \textbf{80\% of Fate-H} (graduate-level), and \textbf{33\% of Fate-X} (PhD-level) problems. Notably, using our system, we solved \textbf{11 out of 12 problems} from Putnam 2025 within 9 hours. Our findings suggest that scaling learning from experience, driven by high-quality formal feedback, holds immense potential for the future of formal mathematical reasoning.
Fonte: arXiv cs.CL
RL • Score 96
O Papel da Ética Islâmica na Prevenção do Abuso de Deepfakes Baseados em Inteligência Artificial (IA)
O desenvolvimento significativo da tecnologia deepfake impulsionada por inteligência artificial (IA) gerou preocupações globais sobre a alteração de informações falsas e a usurpação de identidades online. Este estudo propõe um framework ético islâmico abrangente para mitigar os riscos do uso indevido de deepfakes, abordando deficiências éticas e necessidades regulatórias.
Fonte: arXiv cs.AI
RL • Score 96
Viés Conservador na Aprendizagem Multi-Professor: Por Que Agentes Preferem Consultores de Baixa Recompensa
A aprendizagem por reforço interativa (IRL) tem mostrado potencial para permitir que agentes autônomos e robôs aprendam comportamentos complexos com professores humanos, mas a dinâmica da seleção de professores ainda é pouco compreendida. Este artigo revela um fenômeno inesperado na IRL: agentes de aprendizagem preferem professores conservadores e de baixa recompensa em vez de aqueles que oferecem recompensas 20 vezes maiores.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
When F1 Fails: Granularity-Aware Evaluation for Dialogue Topic Segmentation
arXiv:2512.17083v1 Announce Type: cross
Abstract: Dialogue topic segmentation supports summarization, retrieval, memory management, and conversational continuity. Despite decades of prior work, evaluation practice in dialogue topic segmentation remains dominated by strict boundary matching and F1-based metrics, even as modern LLM-based conversational systems increasingly rely on segmentation to manage conversation history beyond the model's fixed context window, where unstructured context accumulation degrades efficiency and coherence.
This paper introduces an evaluation objective for dialogue topic segmentation that treats boundary density and segment coherence as primary criteria, alongside window-tolerant F1 (W-F1). Through extensive cross-dataset empirical evaluation, we show that reported performance differences across dialogue segmentation benchmarks are driven not by model quality, but by annotation granularity mismatches and sparse boundary labels. This indicates that many reported improvements arise from evaluation artifacts rather than improved boundary detection.
We evaluated multiple, structurally distinct dialogue segmentation strategies across eight dialogue datasets spanning task-oriented, open-domain, meeting-style, and synthetic interactions. Across these settings, we observe high segment coherence combined with extreme oversegmentation relative to sparse labels, producing misleadingly low exact-match F1 scores. We show that topic segmentation is best understood as selecting an appropriate granularity rather than predicting a single correct boundary set. We operationalize this view by explicitly separating boundary scoring from boundary selection.
Fonte: arXiv cs.AI
RL • Score 93
Design de IA Humanóide Aumenta o Antropomorfismo, mas Gera Resultados Divergentes em Engajamento e Confiança Globalmente
Mais de um bilhão de usuários em todo o mundo interagem com sistemas de IA projetados para imitar características humanas. Este estudo investiga a relação entre o design humanóide da IA e seus efeitos sobre engajamento e confiança, utilizando experimentos em 10 nações diversas. Descobrimos que a percepção de humanidade na IA é influenciada por fatores culturais, desafiando narrativas sobre riscos inerentes ao design humanóide.
Fonte: arXiv cs.AI
RL • Score 96
Aprimorando a Classificação de Espécies de Árvores: Insights do YOLOv8 e IA Explicável Aplicados a Projeções de Nuvem de Pontos TLS
Classificar espécies de árvores é uma área de pesquisa central em sensoriamento remoto florestal há décadas. Novos sensores e abordagens de classificação, como TLS e deep learning, alcançam precisão de ponta, mas seus processos de decisão permanecem obscuros. Propomos um método inovador que liga explicações do Finer-CAM a segmentos de projeções TLS, avaliando sistematicamente quais características impulsionam a discriminação de espécies.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
PILAR: Personalizando Interações em Realidade Aumentada com Explicações Centricas no Humano e Confiáveis Baseadas em LLM para Casos de Uso Diários
Sistemas de realidade aumentada (AR) impulsionados por inteligência artificial (AI) estão se integrando cada vez mais à vida cotidiana, aumentando a necessidade de explicações em tempo real. O PILAR é um novo framework que utiliza um modelo de linguagem grande (LLM) pré-treinado para gerar explicações personalizadas e contextuais, melhorando a experiência do usuário em sistemas AR baseados em AI.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Destilação de Conhecimento com Cadeia de Pensamento Estruturada para Text-to-SQL
A implementação de sistemas precisos de Text-to-SQL em nível empresarial enfrenta um difícil trilema envolvendo custo, segurança e desempenho. As soluções atuais forçam as empresas a escolher entre Modelos de Linguagem Grande (LLMs) caros e proprietários e Modelos de Linguagem Pequena (SLMs) de baixo desempenho. Propomos o Struct-SQL, um novo framework de Knowledge Distillation (KD) que treina um SLM para emular um poderoso LLM.
Fonte: arXiv cs.AI
MLOps/Systems • Score 89
Modelagem generativa de distribuições de probabilidade condicionais nos conjuntos de nível de variáveis coletivas
Neste artigo, estudamos a modelagem generativa de distribuições de probabilidade condicionais de uma distribuição $
u$ em $ extbf{R}^d$ representada por dados. Propomos uma abordagem de aprendizado geral e eficiente que aprende modelos generativos em diferentes conjuntos de nível de $
u$ simultaneamente, melhorando a qualidade do aprendizado em regiões de baixa probabilidade por meio de enriquecimento de dados.
Fonte: arXiv stat.ML
NLP/LLMs • Score 95
A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos
arXiv:2512.16978v1 Announce Type: new
Abstract: Long-form multimodal video understanding requires integrating vision, speech, and ambient audio with coherent long-range reasoning. Existing benchmarks emphasize either temporal length or multimodal richness, but rarely both and while some incorporate open-ended questions and advanced metrics, they mostly rely on single-score accuracy, obscuring failure modes. We introduce LongShOTBench, a diagnostic benchmark with open-ended, intent-driven questions; single- and multi-turn dialogues; and tasks requiring multimodal reasoning and agentic tool use across video, audio, and speech. Each item includes a reference answer and graded rubric for interpretable, and traceable evaluation. LongShOTBench is produced via a scalable, human-validated pipeline to ensure coverage and reproducibility. All samples in our LongShOTBench are human-verified and corrected. Furthermore, we present LongShOTAgent, an agentic system that analyzes long videos via preprocessing, search, and iterative refinement. On LongShOTBench, state-of-the-art MLLMs show large gaps: Gemini-2.5-Flash achieves 52.95%, open-source models remain below 30%, and LongShOTAgent attains 44.66%. These results underscore the difficulty of real-world long-form video understanding. LongShOTBench provides a practical, reproducible foundation for evaluating and improving MLLMs. All resources are available on GitHub: https://github.com/mbzuai-oryx/longshot.
Fonte: arXiv cs.CV
NLP/LLMs • Score 95
AVM: Towards Structure-Preserving Neural Response Modeling in the Visual Cortex Across Stimuli and Individuals
arXiv:2512.16948v1 Announce Type: new
Abstract: While deep learning models have shown strong performance in simulating neural responses, they often fail to clearly separate stable visual encoding from condition-specific adaptation, which limits their ability to generalize across stimuli and individuals. We introduce the Adaptive Visual Model (AVM), a structure-preserving framework that enables condition-aware adaptation through modular subnetworks, without modifying the core representation. AVM keeps a Vision Transformer-based encoder frozen to capture consistent visual features, while independently trained modulation paths account for neural response variations driven by stimulus content and subject identity. We evaluate AVM in three experimental settings, including stimulus-level variation, cross-subject generalization, and cross-dataset adaptation, all of which involve structured changes in inputs and individuals. Across two large-scale mouse V1 datasets, AVM outperforms the state-of-the-art V1T model by approximately 2% in predictive correlation, demonstrating robust generalization, interpretable condition-wise modulation, and high architectural efficiency. Specifically, AVM achieves a 9.1% improvement in explained variance (FEVE) under the cross-dataset adaptation setting. These results suggest that AVM provides a unified framework for adaptive neural modeling across biological and experimental conditions, offering a scalable solution under structural constraints. Its design may inform future approaches to cortical modeling in both neuroscience and biologically inspired AI systems.
Fonte: arXiv cs.CV
Vision • Score 95
Any-Optical-Model: Um Modelo de Fundação Universal para Sensoriamento Remoto Óptico
Os satélites ópticos, com suas diversas configurações de bandas e distâncias de amostragem no solo, fornecem evidências indispensáveis para tarefas que vão desde a vigilância de ecossistemas até a resposta a emergências. No entanto, discrepâncias significativas na composição de bandas e na resolução espacial entre diferentes sensores ópticos apresentam grandes desafios para os Modelos de Fundação de Sensoriamento Remoto (RSFMs) existentes.
Fonte: arXiv cs.CV
NLP/LLMs • Score 96
Classificação de Hipóteses Inspirada em Solomonoff com LLMs para Previsão sob Incerteza
O raciocínio sob incerteza é um desafio fundamental em IA, especialmente em tarefas do mundo real, onde problemas com dados escassos exigem generalização sistemática. Propomos um método inspirado em Solomonoff que pondera hipóteses geradas por LLM com base na simplicidade e no ajuste preditivo, produzindo previsões conservadoras e conscientes da incerteza.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Incorporação de Embeddings de Nível de Erro de Ruído para Melhorar a Robustez Assistida por LLM no Reconhecimento de Fala Persa
Os sistemas de Reconhecimento Automático de Fala (ASR) enfrentam degradação significativa de desempenho em ambientes ruidosos, especialmente para idiomas de baixo recurso como o persa. Este estudo apresenta um framework robusto de correção de erros ASR sensível ao ruído, que combina múltiplas hipóteses e modelagem consciente do ruído, demonstrando melhorias substanciais na Taxa de Erro de Palavras (WER).
Fonte: arXiv cs.AI
Vision • Score 95
Interpretable Similarity of Synthetic Image Utility
arXiv:2512.17080v1 Announce Type: new
Abstract: Synthetic medical image data can unlock the potential of deep learning (DL)-based clinical decision support (CDS) systems through the creation of large scale, privacy-preserving, training sets. Despite the significant progress in this field, there is still a largely unanswered research question: "How can we quantitatively assess the similarity of a synthetically generated set of images with a set of real images in a given application domain?". Today, answers to this question are mainly provided via user evaluation studies, inception-based measures, and the classification performance achieved on synthetic images. This paper proposes a novel measure to assess the similarity between synthetically generated and real sets of images, in terms of their utility for the development of DL-based CDS systems. Inspired by generalized neural additive models, and unlike inception-based measures, the proposed measure is interpretable (Interpretable Utility Similarity, IUS), explaining why a synthetic dataset could be more useful than another one in the context of a CDS system based on clinically relevant image features. The experimental results on publicly available datasets from various color medical imaging modalities including endoscopic, dermoscopic and fundus imaging, indicate that selecting synthetic images of high utility similarity using IUS can result in relative improvements of up to 54.6% in terms of classification performance. The generality of IUS for synthetic data assessment is demonstrated also for greyscale X-ray and ultrasound imaging modalities. IUS implementation is available at https://github.com/innoisys/ius
Fonte: arXiv cs.CV
NLP/LLMs • Score 96
Adversarial VR: Um Testbed Open-Source para Avaliação da Robustez Adversarial na Detecção e Mitigação de Ciberdoença em VR
Métodos automatizados de detecção de ciberdoença baseados em deep learning (DL) podem melhorar o conforto e a interação do usuário. No entanto, esses sistemas são suscetíveis a ataques adversariais, que podem degradar o desempenho do modelo e interromper a experiência imersiva. Este artigo apresenta o Adversarial-VR, um testbed em tempo real para avaliar estratégias de detecção e mitigação de ciberdoença sob condições adversariais.
Fonte: arXiv cs.AI
RL • Score 96
Conjunto de Dados Sintético que Preserva a Privacidade de Trajetórias Diárias Individuais para Análises de Mobilidade em Escala Urbana
Os dados de mobilidade urbana são essenciais para o planejamento urbano, previsão de demanda de transporte e modelagem de pandemias. Este estudo apresenta um conjunto de dados sintético que preserva a privacidade, reconstruindo trajetórias diárias a partir de entradas agregadas, sem a necessidade de identificadores pessoais.
Fonte: arXiv cs.AI
Vision • Score 95
Diagnostic Performance of Universal-Learning Ultrasound AI Across Multiple Organs and Tasks: the UUSIC25 Challenge
arXiv:2512.17279v1 Announce Type: new
Abstract: IMPORTANCE: Current ultrasound AI remains fragmented into single-task tools, limiting clinical utility compared to versatile modern ultrasound systems.
OBJECTIVE: To evaluate the diagnostic accuracy and efficiency of single general-purpose deep learning models for multi-organ classification and segmentation.
DESIGN: The Universal UltraSound Image Challenge 2025 (UUSIC25) involved developing algorithms on 11,644 images (public/private). Evaluation used an independent, multi-center test set of 2,479 images, including data from a center completely unseen during training to assess generalization.
OUTCOMES: Diagnostic performance (Dice Similarity Coefficient [DSC]; Area Under the Receiver Operating Characteristic Curve [AUC]) and computational efficiency (inference time, GPU memory).
RESULTS: Of 15 valid algorithms, the top model (SMART) achieved a macro-averaged DSC of 0.854 across 5 segmentation tasks and AUC of 0.766 for binary classification. Models showed high capability in segmentation (e.g., fetal head DSC: 0.942) but variability in complex tasks subject to domain shift. Notably, in breast cancer molecular subtyping, the top model's performance dropped from AUC 0.571 (internal) to 0.508 (unseen external center), highlighting generalization challenges.
CONCLUSIONS: General-purpose AI models achieve high accuracy and efficiency across multiple tasks using a single architecture. However, performance degradation on unseen data suggests domain generalization is critical for future clinical deployment.
Fonte: arXiv cs.CV
MLOps/Systems • Score 92
Look-Ahead Reasoning on Learning Platforms
arXiv:2511.14745v2 Announce Type: replace-cross
Abstract: On many learning platforms, the optimization criteria guiding model training reflect the priorities of the designer rather than those of the individuals they affect. Consequently, users may act strategically to obtain more favorable outcomes. While past work has studied strategic user behavior on learning platforms, the focus has largely been on strategic responses to a deployed model, without considering the behavior of other users. In contrast, look-ahead reasoning takes into account that user actions are coupled, and -- at scale -- impact future predictions. Within this framework, we first formalize level-k thinking, a concept from behavioral economics, where users aim to outsmart their peers by looking one step ahead. We show that, while convergence to an equilibrium is accelerated, the equilibrium remains the same, providing no benefit of higher-level reasoning for individuals in the long run. Then, we focus on collective reasoning, where users take coordinated actions by optimizing through their joint impact on the model. By contrasting collective with selfish behavior, we characterize the benefits and limits of coordination; a new notion of alignment between the learner's and the users' utilities emerges as a key concept. Look-ahead reasoning can be seen as a generalization of algorithmic collective action; we thus offer the first results characterizing the utility trade-offs of coordination when contesting algorithmic systems.
Fonte: arXiv stat.ML
NLP/LLMs • Score 96
Radar de Risco Sistêmico: Um Framework de Grafo Multicamadas para Aviso Precoce de Crises de Mercado
As crises financeiras surgem quando vulnerabilidades estruturais se acumulam entre setores, mercados e comportamentos de investidores. O Systemic Risk Radar (SRR) modela os mercados financeiros como grafos multicamadas para detectar sinais precoces de fragilidade sistêmica e transições para regimes de colapso. Avaliamos o SRR em três crises principais: a crise das Dot-com, a Crise Financeira Global e o choque da COVID-19.
Fonte: arXiv cs.AI
RL • Score 95
A Certified Unlearning Approach without Access to Source Data
arXiv:2506.06486v3 Announce Type: replace-cross
Abstract: With the growing adoption of data privacy regulations, the ability to erase private or copyrighted information from trained models has become a crucial requirement. Traditional unlearning methods often assume access to the complete training dataset, which is unrealistic in scenarios where the source data is no longer available. To address this challenge, we propose a certified unlearning framework that enables effective data removal \final{without access to the original training data samples}. Our approach utilizes a surrogate dataset that approximates the statistical properties of the source data, allowing for controlled noise scaling based on the statistical distance between the two. \updated{While our theoretical guarantees assume knowledge of the exact statistical distance, practical implementations typically approximate this distance, resulting in potentially weaker but still meaningful privacy guarantees.} This ensures strong guarantees on the model's behavior post-unlearning while maintaining its overall utility. We establish theoretical bounds, introduce practical noise calibration techniques, and validate our method through extensive experiments on both synthetic and real-world datasets. The results demonstrate the effectiveness and reliability of our approach in privacy-sensitive settings.
Fonte: arXiv stat.ML
NLP/LLMs • Score 95
Towards Human-Guided, Data-Centric LLM Co-Pilots
arXiv:2501.10321v3 Announce Type: replace-cross
Abstract: Machine learning (ML) has the potential to revolutionize various domains, but its adoption is often hindered by the disconnect between the needs of domain experts and translating these needs into robust and valid ML tools. Despite recent advances in LLM-based co-pilots to democratize ML for non-technical domain experts, these systems remain predominantly focused on model-centric aspects while overlooking critical data-centric challenges. This limitation is problematic in complex real-world settings where raw data often contains complex issues, such as missing values, label noise, and domain-specific nuances requiring tailored handling. To address this we introduce CliMB-DC, a human-guided, data-centric framework for LLM co-pilots that combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing. At its core, CliMB-DC introduces a novel, multi-agent reasoning system that combines a strategic coordinator for dynamic planning and adaptation with a specialized worker agent for precise execution. Domain expertise is then systematically incorporated to guide the reasoning process using a human-in-the-loop approach. To guide development, we formalize a taxonomy of key data-centric challenges that co-pilots must address. Thereafter, to address the dimensions of the taxonomy, we integrate state-of-the-art data-centric tools into an extensible, open-source architecture, facilitating the addition of new tools from the research community. Empirically, using real-world healthcare datasets we demonstrate CliMB-DC's ability to transform uncurated datasets into ML-ready formats, significantly outperforming existing co-pilot baselines for handling data-centric challenges. CliMB-DC promises to empower domain experts from diverse domains -- healthcare, finance, social sciences and more -- to actively participate in driving real-world impact using ML.
Fonte: arXiv stat.ML
NLP/LLMs • Score 95
Studying the Effects of Collaboration in Interactive Theme Discovery Systems
arXiv:2408.09030v4 Announce Type: replace
Abstract: NLP-assisted solutions have gained considerable traction to support qualitative data analysis. However, there does not exist a unified evaluation framework that can account for the many different settings in which qualitative researchers may employ them. In this paper, we take a first step in this direction by proposing an evaluation framework to study the way in which different tools may result in different outcomes depending on the collaboration strategy employed. Specifically, we study the impact of synchronous vs. asynchronous collaboration using two different NLP-assisted qualitative research tools and present a comprehensive analysis of significant differences in the consistency, cohesiveness, and correctness of their outputs.
Fonte: arXiv cs.CL
MLOps/Systems • Score 95
A Systems-Theoretic View on the Convergence of Algorithms under Disturbances
arXiv:2512.17598v1 Announce Type: cross
Abstract: Algorithms increasingly operate within complex physical, social, and engineering systems where they are exposed to disturbances, noise, and interconnections with other dynamical systems. This article extends known convergence guarantees of an algorithm operating in isolation (i.e., without disturbances) and systematically derives stability bounds and convergence rates in the presence of such disturbances. By leveraging converse Lyapunov theorems, we derive key inequalities that quantify the impact of disturbances. We further demonstrate how our result can be utilized to assess the effects of disturbances on algorithmic performance in a wide variety of applications, including communication constraints in distributed learning, sensitivity in machine learning generalization, and intentional noise injection for privacy. This underpins the role of our result as a unifying tool for algorithm analysis in the presence of noise, disturbances, and interconnections with other dynamical systems.
Fonte: arXiv stat.ML
NLP/LLMs • Score 95
Spatially-informed transformers: Injecting geostatistical covariance biases into self-attention for spatio-temporal forecasting
arXiv:2512.17696v1 Announce Type: cross
Abstract: The modeling of high-dimensional spatio-temporal processes presents a fundamental dichotomy between the probabilistic rigor of classical geostatistics and the flexible, high-capacity representations of deep learning. While Gaussian processes offer theoretical consistency and exact uncertainty quantification, their prohibitive computational scaling renders them impractical for massive sensor networks. Conversely, modern transformer architectures excel at sequence modeling but inherently lack a geometric inductive bias, treating spatial sensors as permutation-invariant tokens without a native understanding of distance. In this work, we propose a spatially-informed transformer, a hybrid architecture that injects a geostatistical inductive bias directly into the self-attention mechanism via a learnable covariance kernel. By formally decomposing the attention structure into a stationary physical prior and a non-stationary data-driven residual, we impose a soft topological constraint that favors spatially proximal interactions while retaining the capacity to model complex dynamics. We demonstrate the phenomenon of ``Deep Variography'', where the network successfully recovers the true spatial decay parameters of the underlying process end-to-end via backpropagation. Extensive experiments on synthetic Gaussian random fields and real-world traffic benchmarks confirm that our method outperforms state-of-the-art graph neural networks. Furthermore, rigorous statistical validation confirms that the proposed method delivers not only superior predictive accuracy but also well-calibrated probabilistic forecasts, effectively bridging the gap between physics-aware modeling and data-driven learning.
Fonte: arXiv stat.ML
MLOps/Systems • Score 95
Penalized Fair Regression for Multiple Groups in Chronic Kidney Disease
arXiv:2512.17340v1 Announce Type: cross
Abstract: Fair regression methods have the potential to mitigate societal bias concerns in health care, but there has been little work on penalized fair regression when multiple groups experience such bias. We propose a general regression framework that addresses this gap with unfairness penalties for multiple groups. Our approach is demonstrated for binary outcomes with true positive rate disparity penalties. It can be efficiently implemented through reduction to a cost-sensitive classification problem. We additionally introduce novel score functions for automatically selecting penalty weights. Our penalized fair regression methods are empirically studied in simulations, where they achieve a fairness-accuracy frontier beyond that of existing comparison methods. Finally, we apply these methods to a national multi-site primary care study of chronic kidney disease to develop a fair classifier for end-stage renal disease. There we find substantial improvements in fairness for multiple race and ethnicity groups who experience societal bias in the health care system without any appreciable loss in overall fit.
Fonte: arXiv stat.ML
MLOps/Systems • Score 92
Sharp Structure-Agnostic Lower Bounds for General Functional Estimation
arXiv:2512.17341v1 Announce Type: new
Abstract: The design of efficient nonparametric estimators has long been a central problem in statistics, machine learning, and decision making. Classical optimal procedures often rely on strong structural assumptions, which can be misspecified in practice and complicate deployment. This limitation has sparked growing interest in structure-agnostic approaches -- methods that debias black-box nuisance estimates without imposing structural priors. Understanding the fundamental limits of these methods is therefore crucial. This paper provides a systematic investigation of the optimal error rates achievable by structure-agnostic estimators. We first show that, for estimating the average treatment effect (ATE), a central parameter in causal inference, doubly robust learning attains optimal structure-agnostic error rates. We then extend our analysis to a general class of functionals that depend on unknown nuisance functions and establish the structure-agnostic optimality of debiased/double machine learning (DML). We distinguish two regimes -- one where double robustness is attainable and one where it is not -- leading to different optimal rates for first-order debiasing, and show that DML is optimal in both regimes. Finally, we instantiate our general lower bounds by deriving explicit optimal rates that recover existing results and extend to additional estimands of interest. Our results provide theoretical validation for widely used first-order debiasing methods and guidance for practitioners seeking optimal approaches in the absence of structural assumptions. This paper generalizes and subsumes the ATE lower bound established in \citet{jin2024structure} by the same authors.
Fonte: arXiv stat.ML
MLOps/Systems • Score 96
NetworkFF: Unified Layer Optimization in Forward-Only Neural Networks
arXiv:2512.17531v1 Announce Type: new
Abstract: The Forward-Forward algorithm eliminates backpropagation's memory constraints and biological implausibility through dual forward passes with positive and negative data. However, conventional implementations suffer from critical inter-layer isolation, where layers optimize goodness functions independently without leveraging collective learning dynamics. This isolation constrains representational coordination and limits convergence efficiency in deeper architectures. This paper introduces Collaborative Forward-Forward (CFF) learning, extending the original algorithm through inter-layer cooperation mechanisms that preserve forward-only computation while enabling global context integration. Our framework implements two collaborative paradigms: Fixed CFF (F-CFF) with constant inter-layer coupling and Adaptive CFF (A-CFF) with learnable collaboration parameters that evolve during training. The collaborative goodness function incorporates weighted contributions from all layers, enabling coordinated feature learning while maintaining memory efficiency and biological plausibility. Comprehensive evaluation on MNIST and Fashion-MNIST demonstrates significant performance improvements over baseline Forward-Forward implementations. These findings establish inter-layer collaboration as a fundamental enhancement to Forward-Forward learning, with immediate applicability to neuromorphic computing architectures and energy-constrained AI systems.
Fonte: arXiv cs.LG
RL • Score 95
Método de Direção Alternada de Multiplicadores para Decomposições de Matrizes Não Lineares
Apresentamos um algoritmo baseado no método de direção alternada de multiplicadores (ADMM) para resolver decomposições de matrizes não lineares (NMD). Dada uma matriz de entrada $X \in \mathbb{R}^{m \times n}$ e um rank de fatoração $r \ll \min(m, n)$, NMD busca matrizes $W \in \mathbb{R}^{m \times r}$ e $H \in \mathbb{R}^{r \times n}$ de forma que $X \approx f(WH)$, onde $f$ é uma função não linear elemento a elemento.
Fonte: arXiv stat.ML
Vision • Score 96
Bots Não Ficam Parados: Um Estudo Longitudinal sobre Mudança de Comportamento de Bots, Deriva Temporal e Evolução da Estrutura de Recursos
Os bots sociais estão profundamente integrados nas plataformas online para promoção, persuasão e manipulação. A maioria dos sistemas de detecção de bots ainda trata características comportamentais como estáticas, assumindo implicitamente que os bots se comportam de maneira estacionária ao longo do tempo. Este estudo analisa a mudança em sinais comportamentais individuais e suas inter-relações, utilizando 2.615 contas de bots promocionais e 2,8 milhões de tweets.
Fonte: arXiv cs.AI
MLOps/Systems • Score 96
UmniBench: Modelo Unificado de Compreensão e Geração Orientado a Benchmark Omnidimensional
O UmniBench é um benchmark projetado para Modelos Unificados Multimodais (UMMs), permitindo uma avaliação omnidimensional. Ele avalia a capacidade de compreensão, geração e edição em um único processo, utilizando prompts e pares de QA examinados por humanos. O benchmark abrange 13 domínios principais e mais de 200 conceitos, oferecendo uma visão abrangente e objetiva dos modelos unificados.
Fonte: arXiv cs.AI
NLP/LLMs • Score 96
Sobre o Papel da Informação Contextual e Estados do Ego no Comportamento de Agentes LLM para Diálogos de Análise Transacional
Agentes impulsionados por LLM são utilizados em diversas áreas, desde suporte ao cliente até educação, com um crescente interesse em sua capacidade de agir de forma mais humana. Este artigo propõe um Sistema Multi-Agente (MAS) inspirado na teoria da Análise Transacional (TA), onde cada agente é dividido em três estados do ego: Pai, Adulto e Criança, enriquecendo o processo de resposta com um mecanismo de recuperação de informações contextuais.
Fonte: arXiv cs.AI
RL • Score 96
Conhecimento Inesperado: Auditoria das Recomendações de Busca do Wikipedia e Grokipedia
As plataformas de conhecimento enciclopédico são portas de entrada essenciais para a exploração de informações online. A recente liberação do Grokipedia, uma enciclopédia totalmente gerada por IA, apresenta uma nova alternativa às plataformas tradicionais como o Wikipedia. Este trabalho fornece a primeira análise comparativa dos mecanismos de busca no Wikipedia e Grokipedia.
Fonte: arXiv cs.AI
RL • Score 96
Assessing Long-Term Electricity Market Design for Ambitious Decarbonization Targets using Multi-Agent Reinforcement Learning
arXiv:2512.17444v1 Announce Type: new
Abstract: Electricity systems are key to transforming today's society into a carbon-free economy. Long-term electricity market mechanisms, including auctions, support schemes, and other policy instruments, are critical in shaping the electricity generation mix. In light of the need for more advanced tools to support policymakers and other stakeholders in designing, testing, and evaluating long-term markets, this work presents a multi-agent reinforcement learning model capable of capturing the key features of decarbonizing energy systems. Profit-maximizing generation companies make investment decisions in the wholesale electricity market, responding to system needs, competitive dynamics, and policy signals. The model employs independent proximal policy optimization, which was selected for suitability to the decentralized and competitive environment. Nevertheless, given the inherent challenges of independent learning in multi-agent settings, an extensive hyperparameter search ensures that decentralized training yields market outcomes consistent with competitive behavior. The model is applied to a stylized version of the Italian electricity system and tested under varying levels of competition, market designs, and policy scenarios. Results highlight the critical role of market design for decarbonizing the electricity sector and avoiding price volatility. The proposed framework allows assessing long-term electricity markets in which multiple policy and market mechanisms interact simultaneously, with market participants responding and adapting to decarbonization pathways.
Fonte: arXiv cs.LG
RL • Score 96
Learning Safe Autonomous Driving Policies Using Predictive Safety Representations
arXiv:2512.17586v1 Announce Type: new
Abstract: Safe reinforcement learning (SafeRL) is a prominent paradigm for autonomous driving, where agents are required to optimize performance under strict safety requirements. This dual objective creates a fundamental tension, as overly conservative policies limit driving efficiency while aggressive exploration risks safety violations. The Safety Representations for Safer Policy Learning (SRPL) framework addresses this challenge by equipping agents with a predictive model of future constraint violations and has shown promise in controlled environments. This paper investigates whether SRPL extends to real-world autonomous driving scenarios. Systematic experiments on the Waymo Open Motion Dataset (WOMD) and NuPlan demonstrate that SRPL can improve the reward-safety tradeoff, achieving statistically significant improvements in success rate (effect sizes r = 0.65-0.86) and cost reduction (effect sizes r = 0.70-0.83), with p < 0.05 for observed improvements. However, its effectiveness depends on the underlying policy optimizer and the dataset distribution. The results further show that predictive safety representations play a critical role in improving robustness to observation noise. Additionally, in zero-shot cross-dataset evaluation, SRPL-augmented agents demonstrate improved generalization compared to non-SRPL methods. These findings collectively demonstrate the potential of predictive safety representations to strengthen SafeRL for autonomous driving.
Fonte: arXiv cs.LG