Reinforcement Learning

Veja os papers deste label, com traduções para PT-BR.

Ver todos os labels

Artigos

🎯Reinforcement Learning804 artigos encontrados

Limpar filtro

RLScore 85

OmniDiagram: Advancing Unified Diagram Code Generation via Visual Interrogation Reward

arXiv:2604.05514v1 Announce Type: new Abstract: The paradigm of programmable diagram generation is evolving rapidly, playing a crucial role in structured visualization. However, most existing studies are confined to a narrow range of task formulations and language support, constraining their applicability to diverse diagram types. In this work, we propose OmniDiagram, a unified framework that incorporates diverse diagram code languages and task definitions. To address the challenge of aligning code logic with visual fidelity in Reinforcement Learning (RL), we introduce a novel visual feedback strategy named Visual Interrogation Verifies All (\textsc{Viva}). Unlike brittle syntax-based rules or pixel-level matching, \textsc{Viva} rewards the visual structure of rendered diagrams through a generative approach. Specifically, \textsc{Viva} actively generates targeted visual inquiries to scrutinize diagram visual fidelity and provides fine-grained feedback for optimization. This mechanism facilitates a self-evolving training process, effectively obviating the need for manually annotated ground truth code. Furthermore, we construct M3$^2$Diagram, the first large-scale diagram code generation dataset, containing over 196k high-quality instances. Experimental results confirm that the combination of SFT and our \textsc{Viva}-based RL allows OmniDiagram to establish a new state-of-the-art (SOTA) across diagram code generation benchmarks.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

UniCreative: Unifying Long-form Logic and Short-form Sparkle via Reference-Free Reinforcement Learning

arXiv:2604.05517v1 Announce Type: new Abstract: A fundamental challenge in creative writing lies in reconciling the inherent tension between maintaining global coherence in long-form narratives and preserving local expressiveness in short-form texts. While long-context generation necessitates explicit macroscopic planning, short-form creativity often demands spontaneous, constraint-free expression. Existing alignment paradigms, however, typically employ static reward signals and rely heavily on high-quality supervised data, which is costly and difficult to scale. To address this, we propose \textbf{UniCreative}, a unified reference-free reinforcement learning framework. We first introduce \textbf{AC-GenRM}, an adaptive constraint-aware reward model that dynamically synthesizes query-specific criteria to provide fine-grained preference judgments. Leveraging these signals, we propose \textbf{ACPO}, a policy optimization algorithm that aligns models with human preferences across both content quality and structural paradigms without supervised fine-tuning and ground-truth references. Empirical results demonstrate that AC-GenRM aligns closely with expert evaluations, while ACPO significantly enhances performance across diverse writing tasks. Crucially, our analysis reveals an emergent meta-cognitive ability: the model learns to autonomously differentiate between tasks requiring rigorous planning and those favoring direct generation, validating the effectiveness of our direct alignment approach.

Fonte: arXiv cs.AI

RLScore 85

Bypassing the CSI Bottleneck: MARL-Driven Spatial Control for Reflector Arrays

arXiv:2604.05162v1 Announce Type: new Abstract: Reconfigurable Intelligent Surfaces (RIS) are pivotal for next-generation smart radio environments, yet their practical deployment is severely bottlenecked by the intractable computational overhead of Channel State Information (CSI) estimation. To bypass this fundamental physical-layer barrier, we propose an AI-native, data-driven paradigm that replaces complex channel modeling with spatial intelligence. This paper presents a fully autonomous Multi-Agent Reinforcement Learning (MARL) framework to control mechanically adjustable metallic reflector arrays. By mapping high-dimensional mechanical constraints to a reduced-order virtual focal point space, we deploy a Centralized Training with Decentralized Execution (CTDE) architecture. Using Multi-Agent Proximal Policy Optimization (MAPPO), our decentralized agents learn cooperative beam-focusing strategies relying on user coordinates, achieving CSI-free operation. High-fidelity ray-tracing simulations in dynamic non-line-of-sight (NLOS) environments demonstrate that this multi-agent approach rapidly adapts to user mobility, yielding up to a 26.86 dB enhancement over static flat reflectors and outperforming single-agent and hardware-constrained DRL baselines in both spatial selectivity and temporal stability. Crucially, the learned policies exhibit good deployment resilience, sustaining stable signal coverage even under 1.0-meter localization noise. These results validate the efficacy of MARL-driven spatial abstractions as a scalable, highly practical pathway toward AI-empowered wireless networks.

Fonte: arXiv cs.AI

RLScore 85

TRACE: Capability-Targeted Agentic Training

arXiv:2604.05336v1 Announce Type: new Abstract: Large Language Models (LLMs) deployed in agentic environments must exercise multiple capabilities across different task instances, where a capability is performing one or more actions in a trajectory that are necessary for successfully solving a subset of tasks in the environment. Many existing approaches either rely on synthetic training data that is not targeted to the model's actual capability deficits in the target environment or train directly on the target environment, where the model needs to implicitly learn the capabilities across tasks. We introduce TRACE (Turning Recurrent Agent failures into Capability-targeted training Environments), an end-to-end system for environment-specific agent self-improvement. TRACE contrasts successful and failed trajectories to automatically identify lacking capabilities, synthesizes a targeted training environment for each that rewards whether the capability was exercised, and trains a LoRA adapter via RL on each synthetic environment, routing to the relevant adapter at inference. Empirically, TRACE generalizes across different environments, improving over the base agent by +14.1 points on $\tau^2$-bench (customer service) and +7 perfect scores on ToolSandbox (tool use), outperforming the strongest baseline by +7.4 points and +4 perfect scores, respectively. Given the same number of rollouts, TRACE scales more efficiently than baselines, outperforming GRPO and GEPA by +9.2 and +7.4 points on $\tau^2$-bench.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning

arXiv:2604.05483v1 Announce Type: new Abstract: Large Language Models (LLMs) have shown a high capability in answering questions on a diverse range of topics. However, these models sometimes produce biased, ideologized or incorrect responses, limiting their applications if there is no clear understanding of which topics their answers can be trusted. In this research, we introduce a novel algorithm, named as GMRL-BD, designed to identify the untrustworthy boundaries (in terms of topics) of a given LLM, with black-box access to the LLM and under specific query constraints. Based on a general Knowledge Graph (KG) derived from Wikipedia, our algorithm incorporates with multiple reinforcement learning agents to efficiently identify topics (some nodes in KG) where the LLM is likely to generate biased answers. Our experiments demonstrated the efficiency of our algorithm, which can detect the untrustworthy boundary with just limited queries to the LLM. Additionally, we have released a new dataset containing popular LLMs including Llama2, Vicuna, Falcon, Qwen2, Gemma2 and Yi-1.5, along with labels indicating the topics on which each LLM is likely to be biased.

Fonte: arXiv cs.AI

RLScore 85

Breakthrough the Suboptimal Stable Point in Value-Factorization-Based Multi-Agent Reinforcement Learning

arXiv:2604.05297v1 Announce Type: new Abstract: Value factorization, a popular paradigm in MARL, faces significant theoretical and algorithmic bottlenecks: its tendency to converge to suboptimal solutions remains poorly understood and unsolved. Theoretically, existing analyses fail to explain this due to their primary focus on the optimal case. To bridge this gap, we introduce a novel theoretical concept: the stable point, which characterizes the potential convergence of value factorization in general cases. Through an analysis of stable point distributions in existing methods, we reveal that non-optimal stable points are the primary cause of poor performance. However, algorithmically, making the optimal action the unique stable point is nearly infeasible. In contrast, iteratively filtering suboptimal actions by rendering them unstable emerges as a more practical approach for global optimality. Inspired by this, we propose a novel Multi-Round Value Factorization (MRVF) framework. Specifically, by measuring a non-negative payoff increment relative to the previously selected action, MRVF transforms inferior actions into unstable points, thereby driving each iteration toward a stable point with a superior action. Experiments on challenging benchmarks, including predator-prey tasks and StarCraft II Multi-Agent Challenge (SMAC), validate our analysis of stable points and demonstrate the superiority of MRVF over state-of-the-art methods.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

ActivityEditor: Learning to Synthesize Physically Valid Human Mobility

arXiv:2604.05529v1 Announce Type: new Abstract: Human mobility modeling is indispensable for diverse urban applications. However, existing data-driven methods often suffer from data scarcity, limiting their applicability in regions where historical trajectories are unavailable or restricted. To bridge this gap, we propose \textbf{ActivityEditor}, a novel dual-LLM-agent framework designed for zero-shot cross-regional trajectory generation. Our framework decomposes the complex synthesis task into two collaborative stages. Specifically, an intention-based agent, which leverages demographic-driven priors to generate structured human intentions and coarse activity chains to ensure high-level socio-semantic coherence. These outputs are then refined by editor agent to obtain mobility trajectories through iteratively revisions that enforces human mobility law. This capability is acquired through reinforcement learning with multiple rewards grounded in real-world physical constraints, allowing the agent to internalize mobility regularities and ensure high-fidelity trajectory generation. Extensive experiments demonstrate that \textbf{ActivityEditor} achieves superior zero-shot performance when transferred across diverse urban contexts. It maintains high statistical fidelity and physical validity, providing a robust and highly generalizable solution for mobility simulation in data-scarce scenarios. Our code is available at: https://anonymous.4open.science/r/ActivityEditor-066B.

Fonte: arXiv cs.AI

RLScore 85

Learning to Focus: CSI-Free Hierarchical MARL for Reconfigurable Reflectors

arXiv:2604.05165v1 Announce Type: new Abstract: Reconfigurable Intelligent Surfaces (RIS) has a potential to engineer smart radio environments for next-generation millimeter-wave (mmWave) networks. However, the prohibitive computational overhead of Channel State Information (CSI) estimation and the dimensionality explosion inherent in centralized optimization severely hinder practical large-scale deployments. To overcome these bottlenecks, we introduce a ``CSI-free" paradigm powered by a Hierarchical Multi-Agent Reinforcement Learning (HMARL) architecture to control mechanically reconfigurable reflective surfaces. By substituting pilot-based channel estimation with accessible user localization data, our framework leverages spatial intelligence for macro-scale wave propagation management. The control problem is decomposed into a two-tier neural architecture: a high-level controller executes temporally extended, discrete user-to-reflector allocations, while low-level controllers autonomously optimize continuous focal points utilizing Multi-Agent Proximal Policy Optimization (MAPPO) under a Centralized Training with Decentralized Execution (CTDE) scheme. Comprehensive deterministic ray-tracing evaluations demonstrate that this hierarchical framework achieves massive RSSI improvements of up to 7.79 dB over centralized baselines. Furthermore, the system exhibits robust multi-user scalability and maintains highly resilient beam-focusing performance under practical sub-meter localization tracking errors. By eliminating CSI overhead while maintaining high-fidelity signal redirection, this work establishes a scalable and cost-effective blueprint for intelligent wireless environments.

Fonte: arXiv cs.AI

RLScore 85

Neural Assistive Impulses: Synthesizing Exaggerated Motions for Physics-based Characters

arXiv:2604.05394v1 Announce Type: new Abstract: Physics-based character animation has become a fundamental approach for synthesizing realistic, physically plausible motions. While current data-driven deep reinforcement learning (DRL) methods can synthesize complex skills, they struggle to reproduce exaggerated, stylized motions, such as instantaneous dashes or mid-air trajectory changes, which are required in animation but violate standard physical laws. The primary limitation stems from modeling the character as an underactuated floating-base system, in which internal joint torques and momentum conservation strictly govern motion. Direct attempts to enforce such motions via external wrenches often lead to training instability, as velocity discontinuities produce sparse, high-magnitude force spikes that prevent policy convergence. We propose Assistive Impulse Neural Control, a framework that reformulates external assistance in impulse space rather than force space to ensure numerical stability. We decompose the assistive signal into an analytic high-frequency component derived from Inverse Dynamics and a learned low-frequency residual correction, governed by a hybrid neural policy. We demonstrate that our method enables robust tracking of highly agile, dynamically infeasible maneuvers that were previously intractable for physics-based methods.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Towards Intelligent Energy Security: A Unified Spatio-Temporal and Graph Learning Framework for Scalable Electricity Theft Detection in Smart Grids

arXiv:2604.03344v1 Announce Type: new Abstract: Electricity theft and non-technical losses (NTLs) remain critical challenges in modern smart grids, causing significant economic losses and compromising grid reliability. This study introduces the SmartGuard Energy Intelligence System (SGEIS), an integrated artificial intelligence framework for electricity theft detection and intelligent energy monitoring. The proposed system combines supervised machine learning, deep learning-based time-series modeling, Non-Intrusive Load Monitoring (NILM), and graph-based learning to capture both temporal and spatial consumption patterns. A comprehensive data processing pipeline is developed, incorporating feature engineering, multi-scale temporal analysis, and rule-based anomaly labeling. Deep learning models, including Long Short-Term Memory (LSTM), Temporal Convolutional Networks (TCN), and Autoencoders, are employed to detect abnormal usage patterns. In parallel, ensemble learning methods such as Random Forest, Gradient Boosting, XGBoost, and LightGBM are utilized for classification. To model grid topology and spatial dependencies, Graph Neural Networks (GNNs) are applied to identify correlated anomalies across interconnected nodes. The NILM module enhances interpretability by disaggregating appliance-level consumption from aggregate signals. Experimental results demonstrate strong performance, with Gradient Boosting achieving a ROC-AUC of 0.894, while graph-based models attain over 96% accuracy in identifying high-risk nodes. The hybrid framework improves detection robustness by integrating temporal, statistical, and spatial intelligence. Overall, SGEIS provides a scalable and practical solution for electricity theft detection, offering high accuracy, improved interpretability, and strong potential for real-world smart grid deployment.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

BioAlchemy: Distilling Biological Literature into Reasoning-Ready Reinforcement Learning Training Data

arXiv:2604.03506v1 Announce Type: new Abstract: Despite the large corpus of biology training text, the impact of reasoning models on biological research generally lags behind math and coding. In this work, we show that biology questions from current large-scale reasoning datasets do not align well with modern research topic distributions in biology, and that this topic imbalance may negatively affect performance. In addition, we find that methods for extracting challenging and verifiable research problems from biology research text are a critical yet underdeveloped ingredient in applying reinforcement learning for better performance on biology research tasks. We introduce BioAlchemy, a pipeline for sourcing a diverse set of verifiable question-and-answer pairs from a scientific corpus of biology research text. We curate BioAlchemy-345K, a training dataset containing over 345K scientific reasoning problems in biology. Then, we demonstrate how aligning our dataset to the topic distribution of modern scientific biology can be used with reinforcement learning to improve reasoning performance. Finally, we present BioAlchemist-8B, which improves over its base reasoning model by 9.12% on biology benchmarks. These results demonstrate the efficacy of our approach for developing stronger scientific reasoning capabilities in biology. The BioAlchemist-8B model is available at: https://huggingface.co/BioAlchemy.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Learning Additively Compositional Latent Actions for Embodied AI

arXiv:2604.03340v1 Announce Type: new Abstract: Latent action learning infers pseudo-action labels from visual transitions, providing an approach to leverage internet-scale video for embodied AI. However, most methods learn latent actions without structural priors that encode the additive, compositional structure of physical motion. As a result, latents often entangle irrelevant scene details or information about future observations with true state changes and miscalibrate motion magnitude. We introduce Additively Compositional Latent Action Model (AC-LAM), which enforces scene-wise additive composition structure over short horizons on the latent action space. These AC constraints encourage simple algebraic structure in the latent action space~(identity, inverse, cycle consistency) and suppress information that does not compose additively. Empirically, AC-LAM learns more structured, motion-specific, and displacement-calibrated latent actions and provides stronger supervision for downstream policy learning, outperforming state-of-the-art LAMs across simulated and real-world tabletop tasks.

Fonte: arXiv cs.CV

NLP/LLMsScore 85

Neural Operators for Multi-Task Control and Adaptation

arXiv:2604.03449v1 Announce Type: new Abstract: Neural operator methods have emerged as powerful tools for learning mappings between infinite-dimensional function spaces, yet their potential in optimal control remains largely unexplored. We focus on multi-task control problems, whose solution is a mapping from task description (e.g., cost or dynamics functions) to optimal control law (e.g., feedback policy). We approximate these solution operators using a permutation-invariant neural operator architecture. Across a range of parametric optimal control environments and a locomotion benchmark, a single operator trained via behavioral cloning accurately approximates the solution operator and generalizes to unseen tasks, out-of-distribution settings, and varying amounts of task observations. We further show that the branch-trunk structure of our neural operator architecture enables efficient and flexible adaptation to new tasks. We develop structured adaptation strategies ranging from lightweight updates to full-network fine-tuning, achieving strong performance across different data and compute settings. Finally, we introduce meta-trained operator variants that optimize the initialization for few-shot adaptation. These methods enable rapid task adaptation with limited data and consistently outperform a popular meta-learning baseline. Together, our results demonstrate that neural operators provide a unified and efficient framework for multi-task control and adaptation.

Fonte: arXiv cs.LG

RLScore 85

Online learning of smooth functions on $\mathbb{R}$

arXiv:2604.03525v1 Announce Type: new Abstract: We study adversarial online learning of real-valued functions on $\mathbb{R}$. In each round the learner is queried at $x_t\in\mathbb{R}$, predicts $\hat y_t$, and then observes the true value $f(x_t)$; performance is measured by cumulative $p$-loss $\sum_{t\ge 1}|\hat y_t-f(x_t)|^p$. For the class \[ \mathcal{G}_q=\Bigl\{f:\mathbb{R}\to\mathbb{R}\ \text{absolutely continuous}:\ \int_{\mathbb{R}}|f'(x)|^q\,dx\le 1\Bigr\}, \] we show that the standard model becomes ill-posed on $\mathbb{R}$: for every $p\ge 1$ and $q>1$, an adversary can force infinite loss. Motivated by this obstruction, we analyze three modified learning scenarios that limit the influence of queries that are far from previously observed inputs. In Scenario 1 the adversary must choose each new query within distance $1$ of some past query. In Scenario 2 the adversary may query anywhere, but the learner is penalized only on rounds whose query lies within distance $1$ of a past query. In Scenario 3 the loss in round $t$ is multiplied by a weight $g(\min_{j<t}|x_t-x_j|)$. We obtain sharp characterizations for Scenarios 1-2 in several regimes. For Scenario 3 we identify a clean threshold phenomenon: if $g$ decays too slowly, then the adversary can force infinite weighted loss. In contrast, for rapidly decaying weights such as $g(z)=e^{-cz}$ we obtain finite and sharp guarantees in the quadratic case $p=q=2$. Finally, we study a natural multivariable slice generalization $\mathcal{G}_{q,d}$ of $\mathcal{G}_q$ on $\mathbb{R}^d$ and show a sharp dichotomy: while the one-dimensional case admits finite opt-values in certain regimes, for every $d\ge 2$ the slice class $\mathcal{G}_{q,d}$ is too permissive, and even under Scenarios 1-3 an adversary can force infinite loss.

Fonte: arXiv cs.LG

RLScore 85

Mitigating Value Hallucination in Dyna Planning via Multistep Predecessor Models

arXiv:2006.04363v2 Announce Type: replace-cross Abstract: Dyna-style reinforcement learning (RL) agents improve sample efficiency over model-free RL agents by updating the value function with simulated experience generated by an environment model. However, it is often difficult to learn accurate models of environment dynamics, and even small errors may result in failure of Dyna agents. In this paper, we highlight that one potential cause of that failure is bootstrapping off of the values of simulated states, and introduce a new Dyna algorithm to avoid this failure. We discuss a design space of Dyna algorithms, based on using successor or predecessor models -- simulating forwards or backwards -- and using one-step or multi-step updates. Three of the variants have been explored, but surprisingly the fourth variant has not: using predecessor models with multi-step updates. We present the \emph{Hallucinated Value Hypothesis} (HVH): updating the values of real states towards values of simulated states can result in misleading action values which adversely affect the control policy. We discuss and evaluate all four variants of Dyna amongst which three update real states toward simulated states -- so potentially toward hallucinated values -- and our proposed approach, which does not. The experimental results provide evidence for the HVH, and suggest that using predecessor models with multi-step updates is a promising direction toward developing Dyna algorithms that are more robust to model error.

Fonte: arXiv stat.ML

NLP/LLMsScore 85

PRAISE: Prefix-Based Rollout Reuse in Agentic Search Training

arXiv:2604.03675v1 Announce Type: new Abstract: In agentic search, large language models (LLMs) are trained to perform multi-turn retrieval and reasoning for complex tasks such as multi-hop question answering (QA). However, current search-based Reinforcement Learning (RL) methods suffer from two core limitations: expensive long-horizon rollouts are under-utilized during training, and supervision is typically available only at the final answer, resulting in severe reward sparsity. We present Prefix-based Rollout reuse for Agentic search with Intermediate Step rEwards (PRAISE), a framework for improving both data efficiency and credit assignment in agentic search training. Given a complete search trajectory, PRAISE extracts prefix states at different search turns, elicits intermediate answers from them, and uses these prefixes both to construct additional training trajectories and to derive step-level rewards from performance differences across prefixes. Our method uses a single shared model for both search policy learning and prefix answer evaluation, enabling joint optimization without extra human annotations or a separate reward model. Experiments on multi-hop QA benchmarks show that PRAISE consistently improves performance over strong baselines.

Fonte: arXiv cs.AI

RLScore 85

RL-Driven Sustainable Land-Use Allocation for the Lake Malawi Basin

arXiv:2604.03768v1 Announce Type: new Abstract: Unsustainable land-use practices in ecologically sensitive regions threaten biodiversity, water resources, and the livelihoods of millions. This paper presents a deep reinforcement learning (RL) framework for optimizing land-use allocation in the Lake Malawi Basin to maximize total ecosystem service value (ESV). Drawing on the benefit transfer methodology of Costanza et al., we assign biome-specific ESV coefficients -- locally anchored to a Malawi wetland valuation -- to nine land-cover classes derived from Sentinel-2 imagery. The RL environment models a 50x50 cell grid at 500m resolution, where a Proximal Policy Optimization (PPO) agent with action masking iteratively transfers land-use pixels between modifiable classes. The reward function combines per-cell ecological value with spatial coherence objectives: contiguity bonuses for ecologically connected land-use patches (forest, cropland, built area etc.) and buffer zone penalties for high-impact development adjacent to water bodies. We evaluate the framework across three scenarios: (i) pure ESV maximization, (ii) ESV with spatial reward shaping, and (iii) a regenerative agriculture policy scenario. Results demonstrate that the agent effectively learns to increase total ESV; that spatial reward shaping successfully steers allocations toward ecologically sound patterns, including homogeneous land-use clustering and slight forest consolidation near water bodies; and that the framework responds meaningfully to policy parameter changes, establishing its utility as a scenario-analysis tool for environmental planning.

Fonte: arXiv cs.AI

RLScore 85

Decomposing Communication Gain and Delay Cost Under Cross-Timestep Delays in Cooperative Multi-Agent Reinforcement Learning

arXiv:2604.03785v1 Announce Type: new Abstract: Communication is essential for coordination in \emph{cooperative} multi-agent reinforcement learning under partial observability, yet \emph{cross-timestep} delays cause messages to arrive multiple timesteps after generation, inducing temporal misalignment and making information stale when consumed. We formalize this setting as a delayed-communication partially observable Markov game (DeComm-POMG) and decompose a message's effect into \emph{communication gain} and \emph{delay cost}, yielding the Communication Gain and Delay Cost (CGDC) metric. We further establish a value-loss bound showing that the degradation induced by delayed messages is upper-bounded by a discounted accumulation of an information gap between the action distributions induced by timely versus delayed messages. Guided by CGDC, we propose \textbf{CDCMA}, an actor--critic framework that requests messages only when predicted CGDC is positive, predicts future observations to reduce misalignment at consumption, and fuses delayed messages via CGDC-guided attention. Experiments on no-teammate-vision variants of Cooperative Navigation and Predator Prey, and on SMAC maps across multiple delay levels show consistent improvements in performance, robustness, and generalization, with ablations validating each component.

Fonte: arXiv cs.AI

RLScore 85

When Adaptive Rewards Hurt: Causal Probing and the Switching-Stability Dilemma in LLM-Guided LEO Satellite Scheduling

arXiv:2604.03562v1 Announce Type: new Abstract: Adaptive reward design for deep reinforcement learning (DRL) in multi-beam LEO satellite scheduling is motivated by the intuition that regime-aware reward weights should outperform static ones. We systematically test this intuition and uncover a switching-stability dilemma: near-constant reward weights (342.1 Mbps) outperform carefully-tuned dynamic weights (103.3+/-96.8 Mbps) because PPO requires a quasistationary reward signal for value function convergence. Weight adaptation-regardless of quality-degrades performance by repeatedly restarting convergence. To understand why specific weights matter, we introduce a single-variable causal probing method that independently perturbs each reward term by +/-20% and measures PPO response after 50k steps. Probing reveals counterintuitive leverage: a +20% increase in the switching penalty yields +157 Mbps for polar handover and +130 Mbps for hot-cold regimes-findings inaccessible to human experts or trained MLPs without systematic probing. We evaluate four MDP architect variants (fixed, rule-based, learned MLP, finetuned LLM) across known and novel traffic regimes. The MLP achieves 357.9 Mbps on known regimes and 325.2 Mbps on novel regimes, while the fine-tuned LLM collapses to 45.3+/-43.0 Mbps due to weight oscillation rather than lack of domain knowledge-output consistency, not knowledge, is the binding constraint. Our findings provide an empirically-grounded roadmap for LLM-DRL integration in communication systems, identifying where LLMs add irreplaceable value (natural language intent understanding) versus where simpler methods suffice.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Comparative reversal learning reveals rigid adaptation in LLMs under non-stationary uncertainty

arXiv:2604.04182v1 Announce Type: new Abstract: Non-stationary environments require agents to revise previously learned action values when contingencies change. We treat large language models (LLMs) as sequential decision policies in a two-option probabilistic reversal-learning task with three latent states and switch events triggered by either a performance criterion or timeout. We compare a deterministic fixed transition cycle to a stochastic random schedule that increases volatility, and evaluate DeepSeek-V3.2, Gemini-3, and GPT-5.2, with human data as a behavioural reference. Across models, win-stay was near ceiling while lose-shift was markedly attenuated, revealing asymmetric use of positive versus negative evidence. DeepSeek-V3.2 showed extreme perseveration after reversals and weak acquisition, whereas Gemini-3 and GPT-5.2 adapted more rapidly but still remained less loss-sensitive than humans. Random transitions amplified reversal-specific persistence across LLMs yet did not uniformly reduce total wins, demonstrating that high aggregate payoff can coexist with rigid adaptation. Hierarchical reinforcement-learning (RL) fits indicate dissociable mechanisms: rigidity can arise from weak loss learning, inflated policy determinism, or value polarisation via counterfactual suppression. These results motivate reversal-sensitive diagnostics and volatility-aware models for evaluating LLMs under non-stationary uncertainty.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

DARE: Diffusion Large Language Models Alignment and Reinforcement Executor

arXiv:2604.04215v1 Announce Type: new Abstract: Diffusion large language models (dLLMs) are emerging as a compelling alternative to dominant autoregressive models, replacing strictly sequential token generation with iterative denoising and parallel generation dynamics. However, their open-source ecosystem remains fragmented across model families and, in particular, across post-training pipelines, where reinforcement learning objectives, rollout implementations and evaluation scripts are often released as paper-specific codebases. This fragmentation slows research iteration, raises the engineering burden of reproduction, and makes fair comparison across algorithms difficult. We present \textbf{DARE} (\textbf{d}LLMs \textbf{A}lignment and \textbf{R}einforcement \textbf{E}xecutor), an open framework for post-training and evaluating dLLMs. Built on top of verl~\cite{sheng2024hybridflow} and OpenCompass~\cite{2023opencompass}, DARE unifies supervised fine-tuning, parameter-efficient fine-tuning, preference optimization, and dLLM-specific reinforcement learning under a shared execution stack for both masked and block diffusion language models. Across representative model families including LLaDA, Dream, SDAR, and LLaDA2.x, DARE provides broad algorithmic coverage, reproducible benchmark evaluation, and practical acceleration. Extensive empirical results position that DARE serves as a reusable research substrate for developing, comparing, and deploying post-training methods for current and emerging dLLMs.

Fonte: arXiv cs.CL

Theory/OptimizationScore 85

Improving Feasibility via Fast Autoencoder-Based Projections

arXiv:2604.03489v1 Announce Type: new Abstract: Enforcing complex (e.g., nonconvex) operational constraints is a critical challenge in real-world learning and control systems. However, existing methods struggle to efficiently enforce general classes of constraints. To address this, we propose a novel data-driven amortized approach that uses a trained autoencoder as an approximate projector to provide fast corrections to infeasible predictions. Specifically, we train an autoencoder using an adversarial objective to learn a structured, convex latent representation of the feasible set. This enables rapid correction of neural network outputs by projecting their associated latent representations onto a simple convex shape before decoding into the original feasible set. We test our approach on a diverse suite of constrained optimization and reinforcement learning problems with challenging nonconvex constraints. Results show that our method effectively enforces constraints at a low computational cost, offering a practical alternative to expensive feasibility correction techniques based on traditional solvers.

Fonte: arXiv cs.LG

RLScore 85

Provable Multi-Task Reinforcement Learning: A Representation Learning Framework with Low Rank Rewards

arXiv:2604.03891v1 Announce Type: new Abstract: Multi-task representation learning (MTRL) is an approach that learns shared latent representations across related tasks, facilitating collaborative learning that improves the overall learning efficiency. This paper studies MTRL for multi-task reinforcement learning (RL), where multiple tasks have the same state-action space and transition probabilities, but different rewards. We consider T linear Markov Decision Processes (MDPs) where the reward functions and transition dynamics admit linear feature embeddings of dimension d. The relatedness among the tasks is captured by a low-rank structure on the reward matrices. Learning shared representations across multiple RL tasks is challenging due to the complex and policy-dependent nature of data that leads to a temporal progression of error. Our approach adopts a reward-free reinforcement learning framework to first learn a data-collection policy. This policy then informs an exploration strategy for estimating the unknown reward matrices. Importantly, the data collected under this well-designed policy enable accurate estimation, which ultimately supports the learning of an near-optimal policy. Unlike existing approaches that rely on restrictive assumptions such as Gaussian features, incoherence conditions, or access to optimal solutions, we propose a low-rank matrix estimation method that operates under more general feature distributions encountered in RL settings. Theoretical analysis establishes that accurate low-rank matrix recovery is achievable under these relaxed assumptions, and we characterize the relationship between representation error and sample complexity. Leveraging the learned representation, we construct near-optimal policies and prove a regret bound. Experimental results demonstrate that our method effectively learns robust shared representations and task dynamics from finite data.

Fonte: arXiv cs.LG

RLScore 85

Nearly Optimal Best Arm Identification for Semiparametric Bandits

arXiv:2604.03969v1 Announce Type: new Abstract: We study fixed-confidence Best Arm Identification (BAI) in semiparametric bandits, where rewards are linear in arm features plus an unknown additive baseline shift. Unlike linear-bandit BAI, this setting requires orthogonalized regression, and its instance-optimal sample complexity has remained open. For the transductive setting, we establish an attainable instance-dependent lower bound characterized by the corresponding linear-bandit complexity on shifted features. We then propose a computationally efficient phase-elimination algorithm based on a new $XY$-design for orthogonalized regression. Our analysis yields a nearly optimal high-probability sample-complexity upper bound, up to log factors and an additive $d^2$ term, and experiments on synthetic instances and the Jester dataset show clear gains over prior baselines.

Fonte: arXiv stat.ML

RLScore 85

Sharp asymptotic theory for Q-learning with LDTZ learning rate and its generalization

arXiv:2604.04218v1 Announce Type: new Abstract: Despite the sustained popularity of Q-learning as a practical tool for policy determination, a majority of relevant theoretical literature deals with either constant ($\eta_{t}\equiv \eta$) or polynomially decaying ($\eta_{t} = \eta t^{-\alpha}$) learning schedules. However, it is well known that these choices suffer from either persistent bias or prohibitively slow convergence. In contrast, the recently proposed linear decay to zero (\texttt{LD2Z}: $\eta_{t,n}=\eta(1-t/n)$) schedule has shown appreciable empirical performance, but its theoretical and statistical properties remain largely unexplored, especially in the Q-learning setting. We address this gap in the literature by first considering a general class of power-law decay to zero (\texttt{PD2Z}-$\nu$: $\eta_{t,n}=\eta(1-t/n)^{\nu}$). Proceeding step-by-step, we present a sharp non-asymptotic error bound for Q-learning with \texttt{PD2Z}-$\nu$ schedule, which then is used to derive a central limit theory for a new \textit{tail} Polyak-Ruppert averaging estimator. Finally, we also provide a novel time-uniform Gaussian approximation (also known as \textit{strong invariance principle}) for the partial sum process of Q-learning iterates, which facilitates bootstrap-based inference. All our theoretical results are complemented by extensive numerical experiments. Beyond being new theoretical and statistical contributions to the Q-learning literature, our results definitively establish that \texttt{LD2Z} and in general \texttt{PD2Z}-$\nu$ achieve a best-of-both-worlds property: they inherit the rapid decay from initialization (characteristic of constant step-sizes) while retaining the asymptotic convergence guarantees (characteristic of polynomially decaying schedules). This dual advantage explains the empirical success of \texttt{LD2Z} while providing practical guidelines for inference through our results.

Fonte: arXiv stat.ML

RLScore 85

Delayed Homomorphic Reinforcement Learning for Environments with Delayed Feedback

arXiv:2604.03641v1 Announce Type: new Abstract: Reinforcement learning in real-world systems is often accompanied by delayed feedback, which breaks the Markov assumption and impedes both learning and control. Canonical state augmentation approaches cause the state-space explosion, which introduces a severe sample-complexity burden. Despite recent progress, the state-of-the-art augmentation-based baselines remain incomplete: they either predominantly reduce the burden on the critic or adopt non-unified treatments for the actor and critic. To provide a structured and sample-efficient solution, we propose delayed homomorphic reinforcement learning (DHRL), a framework grounded in MDP homomorphisms that collapses belief-equivalent augmented states and enables efficient policy learning on the resulting abstract MDP without loss of optimality. We provide theoretical analyses of state-space compression bounds and sample complexity, and introduce a practical algorithm. Experiments on continuous control tasks in MuJoCo benchmark confirm that our algorithm outperforms strong augmentation-based baselines, particularly under long delays.

Fonte: arXiv cs.LG

RLScore 85

Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration

arXiv:2604.02869v1 Announce Type: new Abstract: Training tool-calling agents with reinforcement learning on multi-turn tasks remains challenging due to sparse outcome rewards and difficult credit assignment across conversation turns. We present the first application of MT-GRPO (Multi-Turn Group Relative Policy Optimization) combined with GTPO (Generalized Token-level Policy Optimization) for training a tool-calling agent on realistic customer service tasks with an LLM-based user simulator. Through systematic analysis of training rollouts, we discover that naively designed dense per-turn rewards degrade performance by up to 14 percentage points due to misalignment between reward discriminativeness and advantage direction. We introduce Iterative Reward Calibration, a methodology for designing per-turn rewards using empirical discriminative analysis of rollout data, and show that our GTPO hybrid advantage formulation eliminates the advantage misalignment problem. Applied to the Tau-Bench airline benchmark, our approach improves Qwen3.5-4B from 63.8 percent to 66.7 percent (+2.9pp) and Qwen3-30B-A3B from 58.0 percent to 69.5 percent (+11.5pp) -- with the trained 4B model exceeding GPT-4.1 (49.4 percent) and GPT-4o (42.8 percent) despite being 50 times smaller, and the 30.5B MoE model approaching Claude Sonnet 4.5 (70.0 percent). To our knowledge, these are the first published RL training results on Tau-Bench. We release our code, reward calibration analysis, and training recipes.

Fonte: arXiv cs.AI

RLScore 85

Dynamic Mask Enhanced Intelligent Multi-UAV Deployment for Urban Vehicular Networks

arXiv:2604.02358v1 Announce Type: cross Abstract: Vehicular Ad Hoc Networks (VANETs) play a crucial role in realizing vehicle-road collaboration and intelligent transportation. However, urban VANETs often face challenges such as frequent link disconnections and subnet fragmentation, which hinder reliable connectivity. To address these issues, we dynamically deploy multiple Unmanned Aerial Vehicles (UAVs) as communication relays to enhance VANET. A novel Score based Dynamic Action Mask enhanced QMIX algorithm (Q-SDAM) is proposed for multi-UAV deployment, which maximizes vehicle connectivity while minimizing multi-UAV energy consumption. Specifically, we design a score-based dynamic action mask mechanism to guide UAV agents in exploring large action spaces, accelerate the learning process and enhance optimization performance. The practicality of Q-SDAM is validated using real-world datasets. We show that Q-SDAM improves connectivity by 18.2% while reducing energy consumption by 66.6% compared with existing algorithms.

Fonte: arXiv cs.AI

RLScore 85

OPRIDE: Offline Preference-based Reinforcement Learning via In-Dataset Exploration

arXiv:2604.02349v1 Announce Type: cross Abstract: Preference-based reinforcement learning (PbRL) can help avoid sophisticated reward designs and align better with human intentions, showing great promise in various real-world applications. However, obtaining human feedback for preferences can be expensive and time-consuming, which forms a strong barrier for PbRL. In this work, we address the problem of low query efficiency in offline PbRL, pinpointing two primary reasons: inefficient exploration and overoptimization of learned reward functions. In response to these challenges, we propose a novel algorithm, \textbf{O}ffline \textbf{P}b\textbf{R}L via \textbf{I}n-\textbf{D}ataset \textbf{E}xploration (OPRIDE), designed to enhance the query efficiency of offline PbRL. OPRIDE consists of two key features: a principled exploration strategy that maximizes the informativeness of the queries and a discount scheduling mechanism aimed at mitigating overoptimization of the learned reward functions. Through empirical evaluations, we demonstrate that OPRIDE significantly outperforms prior methods, achieving strong performance with notably fewer queries. Moreover, we provide theoretical guarantees of the algorithm's efficiency. Experimental results across various locomotion, manipulation, and navigation tasks underscore the efficacy and versatility of our approach.

Fonte: arXiv cs.AI

RLScore 85

Contextual Intelligence The Next Leap for Reinforcement Learning

arXiv:2604.02348v1 Announce Type: new Abstract: Reinforcement learning (RL) has produced spectacular results in games, robotics, and continuous control. Yet, despite these successes, learned policies often fail to generalize beyond their training distribution, limiting real-world impact. Recent work on contextual RL (cRL) shows that exposing agents to environment characteristics -- contexts -- can improve zero-shot transfer. So far, the community has treated context as a monolithic, static observable, an approach that constrains the generalization capabilities of RL agents. To achieve contextual intelligence we first propose a novel taxonomy of contexts that separates allogenic (environment-imposed) from autogenic (agent-driven) factors. We identify three fundamental research directions that must be addressed to promote truly contextual intelligence: (1) Learning with heterogeneous contexts to explicitly exploit the taxonomy levels so agents can reason about their influence on the world and vice versa; (2) Multi-time-scale modeling to recognize that allogenic variables evolve slowly or remain static, whereas autogenic variables may change within an episode, potentially requiring different learning mechanisms; (3) Integration of abstract, high-level contexts to incorporate roles, resource & regulatory regimes, uncertainties, and other non-physical descriptors that crucially influence behavior. We envision context as a first-class modeling primitive, empowering agents to reason about who they are, what the world permits, and how both evolve over time. By doing so, we aim to catalyze a new generation of context-aware agents that can be deployed safely and efficiently in the real world.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

CharTool: Tool-Integrated Visual Reasoning for Chart Understanding

arXiv:2604.02794v1 Announce Type: new Abstract: Charts are ubiquitous in scientific and financial literature for presenting structured data. However, chart reasoning remains challenging for multimodal large language models (MLLMs) due to the lack of high-quality training data, as well as the need for fine-grained visual grounding and precise numerical computation. To address these challenges, we first propose DuoChart, a scalable dual-source data pipeline that combines synthesized charts with real-world charts to construct diverse, high-quality chart training data. We then introduce CharTool, which equips MLLMs with external tools, including image cropping for localized visual perception and code-based computation for accurate numerical reasoning. Through agentic reinforcement learning on DuoChart, CharTool learns tool-integrated reasoning grounded in chart content. Extensive experiments on six chart benchmarks show that our method consistently improves over strong MLLM baselines across model scales. Notably, CharTool-7B outperforms the base model by **+8.0%** on CharXiv (Reasoning) and **+9.78%** on ChartQAPro, while achieving competitive performance with substantially larger or proprietary models. Moreover, CharTool demonstrates positive generalization to out-of-domain visual math reasoning benchmarks.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks

arXiv:2604.02795v1 Announce Type: new Abstract: Rubric-based Reinforcement Learning (RL) has emerged as a promising approach for aligning Large Language Models (LLMs) with complex, open-domain instruction following tasks. However, existing methods predominantly rely on response-level rewards, introducing severe reward sparsity and reward ambiguity problems. To address these issues, we propose Rubrics to Tokens (RTT), a novel rubric-based RL framework that bridges coarse response-level scores and fine-grained token-level credit assignment. RTT introduces a Token-Level Relevance Discriminator to predict which tokens in the response are responsible for a specific constraint, and optimizes the policy model via RTT-GRPO, which integrates response-level and token-level advantages within a unified framework. Furthermore, when transitioning from one-dimensional, outcome-level reward to three-dimensional reward space in the token-level rubric-based RL, we propose a novel group normalization method, called Intra-sample Token Group Normalization, to accommodate this shift. Extensive experiments and benchmarks demonstrate that RTT consistently outperforms other baselines in both instruction- and rubric-level accuracy across different models.

Fonte: arXiv cs.CL

RLScore 85

Coupled Control, Structured Memory, and Verifiable Action in Agentic AI (SCRAT -- Stochastic Control with Retrieval and Auditable Trajectories): A Comparative Perspective from Squirrel Locomotion and Scatter-Hoarding

arXiv:2604.03201v1 Announce Type: new Abstract: Agentic AI is increasingly judged not by fluent output alone but by whether it can act, remember, and verify under partial observability, delay, and strategic observation. Existing research often studies these demands separately: robotics emphasizes control, retrieval systems emphasize memory, and alignment or assurance work emphasizes checking and oversight. This article argues that squirrel ecology offers a sharp comparative case because arboreal locomotion, scatter-hoarding, and audience-sensitive caching couple all three demands in one organism. We synthesize evidence from fox, eastern gray, and, in one field comparison, red squirrels, and impose an explicit inference ladder: empirical observation, minimal computational inference, and AI design conjecture. We introduce a minimal hierarchical partially observed control model with latent dynamics, structured episodic memory, observer-belief state, option-level actions, and delayed verifier signals. This motivates three hypotheses: (H1) fast local feedback plus predictive compensation improves robustness under hidden dynamics shifts; (H2) memory organized for future control improves delayed retrieval under cue conflict and load; and (H3) verifiers and observer models inside the action-memory loop reduce silent failure and information leakage while remaining vulnerable to misspecification. A downstream conjecture is that role-differentiated proposer/executor/checker/adversary systems may reduce correlated error under asymmetric information and verification burden. The contribution is a comparative perspective and benchmark agenda: a disciplined program of falsifiable claims about the coupling of control, memory, and verifiable action.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

VBGS-SLAM: Variational Bayesian Gaussian Splatting Simultaneous Localization and Mapping

arXiv:2604.02696v1 Announce Type: new Abstract: 3D Gaussian Splatting (3DGS) has shown promising results for 3D scene modeling using mixtures of Gaussians, yet its existing simultaneous localization and mapping (SLAM) variants typically rely on direct, deterministic pose optimization against the splat map, making them sensitive to initialization and susceptible to catastrophic forgetting as map evolves. We propose Variational Bayesian Gaussian Splatting SLAM (VBGS-SLAM), a novel framework that couples the splat map refinement and camera pose tracking in a generative probabilistic form. By leveraging conjugate properties of multivariate Gaussians and variational inference, our method admits efficient closed-form updates and explicitly maintains posterior uncertainty over both poses and scene parameters. This uncertainty-aware method mitigates drift and enhances robustness in challenging conditions, while preserving the efficiency and rendering quality of existing 3DGS. Our experiments demonstrate superior tracking performance and robustness in long sequence prediction, alongside efficient, high-quality novel view synthesis across diverse synthetic and real-world scenes.

Fonte: arXiv cs.CV

RLScore 90

GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning

arXiv:2604.02721v1 Announce Type: new Abstract: Competitive programming remains one of the last few human strongholds in coding against AI. The best AI system to date still underperforms the best humans competitive programming: the most recent best result, Google's Gemini~3 Deep Think, attained 8th place even not being evaluated under live competition conditions. In this work, we introduce GrandCode, a multi-agent RL system designed for competitive programming. The capability of GrandCode is attributed to two key factors: (1) It orchestrates a variety of agentic modules (hypothesis proposal, solver, test generator, summarization, etc) and jointly improves them through post-training and online test-time RL; (2) We introduce Agentic GRPO specifically designed for multi-stage agent rollouts with delayed rewards and the severe off-policy drift that is prevalent in agentic RL. GrandCode is the first AI system that consistently beats all human participants in live contests of competitive programming: in the most recent three Codeforces live competitions, i.e., Round~1087 (Mar 21, 2026), Round~1088 (Mar 28, 2026), and Round~1089 (Mar 29, 2026), GrandCode placed first in all of them, beating all human participants, including legendary grandmasters. GrandCode shows that AI systems have reached a point where they surpass the strongest human programmers on the most competitive coding tasks.

Fonte: arXiv cs.AI

RLScore 85

Generalization Limits of Reinforcement Learning Alignment

arXiv:2604.02652v1 Announce Type: new Abstract: The safety of large language models (LLMs) relies on alignment techniques such as reinforcement learning from human feedback (RLHF). However, recent theoretical analyses suggest that reinforcement learning-based training does not acquire new capabilities but merely redistributes the utilization probabilities of existing ones. In this study, we propose ``compound jailbreaks'' targeting OpenAI gpt-oss-20b, which exploit the generalization failures of alignment. This approach combines multiple attack techniques -- each individually defended against -- to saturate the instruction hierarchy maintenance process. Our evaluation shows that the attack success rate (ASR) increased from 14.3\% with individual methods to 71.4\% with the combined approach. These results provide empirical evidence for the hypothesis that safety training does not generalize as broadly as model capabilities, highlighting the need for multifaceted safety evaluations using compound attack scenarios.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge

arXiv:2604.02621v1 Announce Type: new Abstract: Reinforcement Learning (RL) has been shown to substantially improve the reasoning capability of small and large language models (LLMs), but existing approaches typically rely on verifiable rewards, hence ground truth labels. We propose an RL framework that uses rewards from an LLM that acts as a judge evaluating model outputs over large amounts of unlabeled data, enabling label-free knowledge distillation and replacing the need of ground truth supervision. Notably, the judge operates with a single-token output, making reward computation efficient. When combined with verifiable rewards, our approach yields substantial performance gains across math reasoning benchmarks. These results suggest that LLM-based evaluators can produce effective training signals for RL fine-tuning.

Fonte: arXiv cs.CL

RLScore 85

Fast Best-in-Class Regret for Contextual Bandits

arXiv:2510.15483v2 Announce Type: replace Abstract: We study the problem of stochastic contextual bandits in the agnostic setting, where the goal is to compete with the best policy in a given class without assuming realizability or imposing model restrictions on losses or rewards. In this work, we establish the first fast rate for regret relative to the best-in-class policy. Our proposed algorithm updates the policy at every round by minimizing a pessimistic objective, defined as a clipped inverse-propensity estimate of the policy value plus a variance penalty. By leveraging entropy assumptions on the policy class and a H\"olderian error-bound condition (a generalization of the margin condition), we achieve fast best-in-class regret rates, including polylogarithmic rates in the parametric case. The analysis is driven by a sequential self-normalized maximal inequality for bounded martingale empirical processes, which yields uniform variance-adaptive confidence bounds and guarantees pessimism under adaptive data collection.

Fonte: arXiv stat.ML

RLScore 85

Reinforcement Learning from Human Feedback: A Statistical Perspective

arXiv:2604.02507v1 Announce Type: new Abstract: Reinforcement learning from human feedback (RLHF) has emerged as a central framework for aligning large language models (LLMs) with human preferences. Despite its practical success, RLHF raises fundamental statistical questions because it relies on noisy, subjective, and often heterogeneous feedback to learn reward models and optimize policies. This survey provides a statistical perspective on RLHF, focusing primarily on the LLM alignment setting. We introduce the main components of RLHF, including supervised fine-tuning, reward modeling, and policy optimization, and relate them to familiar statistical ideas such as Bradley-Terry-Luce (BTL) model, latent utility estimation, active learning, experimental design, and uncertainty quantification. We review methods for learning reward functions from pairwise preference data and for optimizing policies through both two-stage RLHF pipelines and emerging one-stage approaches such as direct preference optimization. We further discuss recent extensions including reinforcement learning from AI feedback, inference-time algorithms, and reinforcement learning from verifiable rewards, as well as benchmark datasets, evaluation protocols, and open-source frameworks that support RLHF research. We conclude by highlighting open challenges in RLHF. An accompanying GitHub demo https://github.com/Pangpang-Liu/RLHF_demo illustrates key components of the RLHF pipeline.

Fonte: arXiv stat.ML

NLP/LLMsScore 85

Jump Start or False Start? A Theoretical and Empirical Evaluation of LLM-initialized Bandits

arXiv:2604.02527v1 Announce Type: new Abstract: The recent advancement of Large Language Models (LLMs) offers new opportunities to generate user preference data to warm-start bandits. Recent studies on contextual bandits with LLM initialization (CBLI) have shown that these synthetic priors can significantly lower early regret. However, these findings assume that LLM-generated choices are reasonably aligned with actual user preferences. In this paper, we systematically examine how LLM-generated preferences perform when random and label-flipping noise is injected into the synthetic training data. For aligned domains, we find that warm-starting remains effective up to 30% corruption, loses its advantage around 40%, and degrades performance beyond 50%. When there is systematic misalignment, even without added noise, LLM-generated priors can lead to higher regret than a cold-start bandit. To explain these behaviors, we develop a theoretical analysis that decomposes the effect of random label noise and systematic misalignment on the prior error driving the bandit's regret, and derive a sufficient condition under which LLM-based warm starts are provably better than a cold-start bandit. We validate these results across multiple conjoint datasets and LLMs, showing that estimated alignment reliably tracks when warm-starting improves or degrades recommendation quality.

Fonte: arXiv cs.LG

RLScore 85

Prism: Policy Reuse via Interpretable Strategy Mapping in Reinforcement Learning

arXiv:2604.02353v1 Announce Type: cross Abstract: We present PRISM (Policy Reuse via Interpretable Strategy Mapping), a framework that grounds reinforcement learning agents' decisions in discrete, causally validated concepts and uses those concepts as a zero-shot transfer interface between agents trained with different algorithms. PRISM clusters each agent's encoder features into $K$ concepts via K-means. Causal intervention establishes that these concepts directly drive - not merely correlate with - agent behavior: overriding concept assignments changes the selected action in 69.4% of interventions ($p = 8.6 \times 10^{-86}$, 2500 interventions). Concept importance and usage frequency are dissociated: the most-used concept (C47, 33.0% frequency) causes only a 9.4% win-rate drop when ablated, while ablating C16 (15.4% frequency) collapses win rate from 100% to 51.8%. Because concepts causally encode strategy, aligning them via optimal bipartite matching transfers strategic knowledge zero-shot. On Go~7$\times$7 with three independently trained agents, concept transfer achieves 69.5%$\pm$3.2% and 76.4%$\pm$3.4% win rate against a standard engine across the two successful transfer pairs (10 seeds), compared to 3.5% for a random agent and 9.2% without alignment. Transfer succeeds when the source policy is strong; geometric alignment quality predicts nothing ($R^2 \approx 0$). The framework is scoped to domains where strategic state is naturally discrete: the identical pipeline on Atari Breakout yields bottleneck policies at random-agent performance, confirming that the Go results reflect a structural property of the domain.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

A Survey on AI for 6G: Challenges and Opportunities

arXiv:2604.02370v1 Announce Type: cross Abstract: As wireless communication evolves, each generation of networks brings new technologies that change how we connect and interact. Artificial Intelligence (AI) is becoming crucial in shaping the future of sixth-generation (6G) networks. By combining AI and Machine Learning (ML), 6G aims to offer high data rates, low latency, and extensive connectivity for applications including smart cities, autonomous systems, holographic telepresence, and the tactile internet. This paper provides a detailed overview of the role of AI in supporting 6G networks. It focuses on key technologies like deep learning, reinforcement learning, federated learning, and explainable AI. It also looks at how AI integrates with essential network functions and discusses challenges related to scalability, security, and energy efficiency, along with new solutions. Additionally, this work highlights perspectives that connect AI-driven analytics to 6G service domains like Ultra-Reliable Low-Latency Communication (URLLC), Enhanced Mobile Broadband (eMBB), Massive Machine-Type Communication (mMTC), and Integrated Sensing and Communication (ISAC). It addresses concerns about standardization, ethics, and sustainability. By summarizing recent research trends and identifying future directions, this survey offers a valuable reference for researchers and practitioners at the intersection of AI and next-generation wireless communication.

Fonte: arXiv cs.AI

RLScore 85

Mitigating Data Scarcity in Spaceflight Applications for Offline Reinforcement Learning Using Physics-Informed Deep Generative Models

arXiv:2604.02438v1 Announce Type: new Abstract: The deployment of reinforcement learning (RL)-based controllers on physical systems is often limited by poor generalization to real-world scenarios, known as the simulation-to-reality (sim-to-real) gap. This gap is particularly challenging in spaceflight, where real-world training data are scarce due to high cost and limited planetary exploration data. Traditional approaches, such as system identification and synthetic data generation, depend on sufficient data and often fail due to modeling assumptions or lack of physics-based constraints. We propose addressing this data scarcity by introducing physics-based learning bias in a generative model. Specifically, we develop the Mutual Information-based Split Variational Autoencoder (MI-VAE), a physics-informed VAE that learns differences between observed system trajectories and those predicted by physics-based models. The latent space of the MI-VAE enables generation of synthetic datasets that respect physical constraints. We evaluate MI-VAE on a planetary lander problem, focusing on limited real-world data and offline RL training. Results show that augmenting datasets with MI-VAE samples significantly improves downstream RL performance, outperforming standard VAEs in statistical fidelity, sample diversity, and policy success rate. This work demonstrates a scalable strategy for enhancing autonomous controller robustness in complex, data-constrained environments.

Fonte: arXiv cs.LG

VisionScore 85

Moondream Segmentation: From Words to Masks

arXiv:2604.02593v1 Announce Type: new Abstract: We present Moondream Segmentation, a referring image segmentation extension of Moondream 3, a vision-language model. Given an image and a referring expression, the model autoregressively decodes a vector path and iteratively refines the rasterized mask into a final detailed mask. We introduce a reinforcement learning stage that resolves ambiguity in the supervised signal by directly optimizing mask quality. Rollouts from this stage produce coarse-to-ground-truth targets for the refiner. To mitigate evaluation noise from polygon annotations, we release RefCOCO-M, a cleaned RefCOCO validation split with boundary-accurate masks. Moondream Segmentation achieves a cIoU of 80.2% on RefCOCO (val) and 62.6% mIoU on LVIS (val).

Fonte: arXiv cs.CV

NLP/LLMsScore 85

LLM Reasoning with Process Rewards for Outcome-Guided Steps

arXiv:2604.02341v1 Announce Type: cross Abstract: Mathematical reasoning in large language models has improved substantially with reinforcement learning using verifiable rewards, where final answers can be checked automatically and converted into reliable training signals. Most such pipelines optimize outcome correctness only, which yields sparse feedback for long, multi-step solutions and offers limited guidance on intermediate reasoning errors. Recent work therefore introduces process reward models (PRMs) to score intermediate steps and provide denser supervision. In practice, PRM scores are often imperfectly aligned with final correctness and can reward locally fluent reasoning that still ends in an incorrect answer. When optimized as absolute rewards, such signals can amplify fluent failure modes and induce reward hacking. We propose PROGRS, a framework that leverages PRMs while keeping outcome correctness dominant. PROGRS treats process rewards as relative preferences within outcome groups rather than absolute targets. We introduce outcome-conditioned centering, which shifts PRM scores of incorrect trajectories to have zero mean within each prompt group. It removes systematic bias while preserving informative rankings. PROGRS combines a frozen quantile-regression PRM with a multi-scale coherence evaluator. We integrate the resulting centered process bonus into Group Relative Policy Optimization (GRPO) without auxiliary objectives or additional trainable components. Across MATH-500, AMC, AIME, MinervaMath, and OlympiadBench, PROGRS consistently improves Pass@1 over outcome-only baselines and achieves stronger performance with fewer rollouts. These results show that outcome-conditioned centering enables safe and effective use of process rewards for mathematical reasoning.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation

arXiv:2604.02355v1 Announce Type: new Abstract: Combining Chain-of-Thought (CoT) with Reinforcement Learning (RL) improves text-to-image (T2I) generation, yet the underlying interaction between CoT's exploration and RL's optimization remains unclear. We present a systematic entropy-based analysis that yields three key insights: (1) CoT expands the generative exploration space, while RL contracts it toward high-reward regions; (2) final reward is strongly negatively correlated with both the mean and variance of image-token entropy, highlighting the need to reduce uncertainty and instability; and (3) the entropy of the textual CoT directly governs downstream image quality, with lower-entropy CoTs leading to better generations. Motivated by these findings, we propose Entropy-Guided Group Relative Policy Optimization (EG-GRPO), a fine-tuning strategy that reallocates optimization budget by uncertainty: low-entropy tokens are excluded from reward-driven updates to preserve stability, while high-entropy tokens receive an entropy bonus that encourages structured exploration without collapse. Experiments on standard T2I benchmarks demonstrate that EG-GRPO achieves state-of-the-art performance.

Fonte: arXiv cs.LG

VisionScore 90

ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving

arXiv:2604.02714v1 Announce Type: new Abstract: End-to-end autonomous driving models based on Vision-Language-Action (VLA) architectures have shown promising results by learning driving policies through behavior cloning on expert demonstrations. However, imitation learning inherently limits the model to replicating observed behaviors without exploring diverse driving strategies, leaving it brittle in novel or out-of-distribution scenarios. Reinforcement learning (RL) offers a natural remedy by enabling policy exploration beyond the expert distribution. Yet VLA models, typically trained on offline datasets, lack directly observable state transitions, necessitating a learned world model to anticipate action consequences. In this work, we propose a unified understanding-and-generation framework that leverages world modeling to simultaneously enable meaningful exploration and provide dense supervision. Specifically, we augment trajectory prediction with future RGB and depth image generation as dense world modeling objectives, requiring the model to learn fine-grained visual and geometric representations that substantially enrich the planning backbone. Beyond serving as a supervisory signal, the world model further acts as a source of intrinsic reward for policy exploration: its image prediction uncertainty naturally measures a trajectory's novelty relative to the training distribution, where high uncertainty indicates out-of-distribution scenarios that, if safe, represent valuable learning opportunities. We incorporate this exploration signal into a safety-gated reward and optimize the policy via Group Relative Policy Optimization (GRPO). Experiments on the NAVSIM and nuScenes benchmarks demonstrate the effectiveness of our approach, achieving a state-of-the-art PDMS score of 93.7 and an EPDMS of 88.8 on NAVSIM. The code and demo will be publicly available at https://zihaosheng.github.io/ExploreVLA/.

Fonte: arXiv cs.CV

RLScore 85

Interpretable Deep Reinforcement Learning for Element-level Bridge Life-cycle Optimization

arXiv:2604.02528v1 Announce Type: new Abstract: The new Specifications for the National Bridge Inventory (SNBI), in effect from 2022, emphasize the use of element-level condition states (CS) for risk-based bridge management. Instead of a general component rating, element-level condition data use an array of relative CS quantities (i.e., CS proportions) to represent the condition of a bridge. Although this greatly increases the granularity of bridge condition data, it introduces challenges to set up optimal life-cycle policies due to the expanded state space from one single categorical integer to four-dimensional probability arrays. This study proposes a new interpretable reinforcement learning (RL) approach to seek optimal life-cycle policies based on element-level state representations. Compared to existing RL methods, the proposed algorithm yields life-cycle policies in the form of oblique decision trees with reasonable amounts of nodes and depth, making them directly understandable and auditable by humans and easily implementable into current bridge management systems. To achieve near-optimal policies, the proposed approach introduces three major improvements to existing RL methods: (a) the use of differentiable soft tree models as actor function approximators, (b) a temperature annealing process during training, and (c) regularization paired with pruning rules to limit policy complexity. Collectively, these improvements can yield interpretable life-cycle policies in the form of deterministic oblique decision trees. The benefits and trade-offs from these techniques are demonstrated in both supervised and reinforcement learning settings. The resulting framework is illustrated in a life-cycle optimization problem for steel girder bridges.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

FTimeXer: Frequency-aware Time-series Transformer with Exogenous variables for Robust Carbon Footprint Forecasting

arXiv:2604.02347v1 Announce Type: new Abstract: Accurate and up-to-date forecasting of the power grid's carbon footprint is crucial for effective product carbon footprint (PCF) accounting and informed decarbonization decisions. However, the carbon intensity of the grid exhibits high non-stationarity, and existing methods often struggle to effectively leverage periodic and oscillatory patterns. Furthermore, these methods tend to perform poorly when confronted with irregular exogenous inputs, such as missing data or misalignment. To tackle these challenges, we propose FTimeXer, a frequency-aware time-series Transformer designed with a robust training scheme that accommodates exogenous factors. FTimeXer features an Fast Fourier Transform (FFT)-driven frequency branch combined with gated time-frequency fusion, allowing it to capture multi-scale periodicity effectively. It also employs stochastic exogenous masking in conjunction with consistency regularization, which helps reduce spurious correlations and enhance stability. Experiments conducted on three real-world datasets show consistent improvements over strong baselines. As a result, these enhancements lead to more reliable forecasts of grid carbon factors, which are essential for effective PCF accounting and informed decision-making regarding decarbonization.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

Chart-RL: Policy Optimization Reinforcement Learning for Enhanced Visual Reasoning in Chart Question Answering with Vision Language Models

arXiv:2604.03157v1 Announce Type: new Abstract: The recent advancements in Vision Language Models (VLMs) have demonstrated progress toward true intelligence requiring robust reasoning capabilities. Beyond pattern recognition, linguistic reasoning must integrate with visual comprehension, particularly for Chart Question Answering (CQA) tasks involving complex data visualizations. Current VLMs face significant limitations in CQA, including imprecise numerical extraction, difficulty interpreting implicit visual relationships, and inadequate attention mechanisms for capturing spatial relationships in charts. In this work, we address these challenges by presenting Chart-RL, a novel reinforcement learning framework that enhances VLMs chart understanding through feedback-driven policy optimization of visual perception and logical inference. Our key innovation includes a comprehensive framework integrating Reinforcement Learning (RL) from Policy Optimization techniques along with adaptive reward functions, that demonstrates superior performance compared to baseline foundation models and competitive results against larger state-of-the-art architectures. We also integrated Parameter-Efficient Fine-Tuning through Low-Rank Adaptation (LoRA) in the RL framework that only requires single GPU configurations while preserving performance integrity. We conducted extensive benchmarking across open-source, proprietary, and state-of-the-art closed-source models utilizing the ChartQAPro dataset. The RL fine-tuned Qwen3-VL-4B-Instruct model achieved an answer accuracy of 0.634, surpassing the 0.580 accuracy of the Qwen3-VL-8B-Instruct foundation model despite utilizing half the parameter count, while simultaneously reducing inference latency from 31 seconds to 9 seconds.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning

arXiv:2604.02091v1 Announce Type: new Abstract: Rerankers play a pivotal role in refining retrieval results for Retrieval-Augmented Generation. However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process. This isolation leads to a fundamental misalignment: documents identified as topically relevant by information retrieval metrics often fail to provide the actual utility required by the LLM for precise answer generation. To bridge this gap, we introduce ReRanking Preference Optimization (RRPO), a reinforcement learning framework that directly aligns reranking with the LLM's generation quality. By formulating reranking as a sequential decision-making process, RRPO optimizes for context utility using LLM feedback, thereby eliminating the need for expensive human annotations. To ensure training stability, we further introduce a reference-anchored deterministic baseline. Extensive experiments on knowledge-intensive benchmarks demonstrate that RRPO significantly outperforms strong baselines, including the powerful list-wise reranker RankZephyr. Further analysis highlights the versatility of our framework: it generalizes seamlessly to diverse readers (e.g., GPT-4o), integrates orthogonally with query expansion modules like Query2Doc, and remains robust even when trained with noisy supervisors.

Fonte: arXiv cs.CL

RLScore 85

DeltaMem: Towards Agentic Memory Management via Reinforcement Learning

arXiv:2604.01560v1 Announce Type: new Abstract: Recent advances in persona-centric memory have revealed the powerful capability of multi-agent systems in managing persona memory, especially in conversational scenarios. However, these complex frameworks often suffer from information loss and are fragile across varying scenarios, resulting in suboptimal performance. In this paper, we propose DeltaMem, an agentic memory management system that formulates persona-centric memory management as an end-to-end task within a single-agent setting. To further improve the performance of our agentic memory manager, we draw inspiration from the evolution of human memory and synthesize a user-assistant dialogue dataset along with corresponding operation-level memory updating labels. Building on this, we introduce a novel Memory-based Levenshtein Distance to formalize the memory updating reward, and propose a tailored reinforcement learning framework to further enhance the management capabilities of DeltaMem. Extensive experiments show that both training-free and RL-trained DeltaMem outperform all product-level baselines across diverse long-term memory benchmarks, including LoCoMo, HaluMem, and PersonaMem.

Fonte: arXiv cs.CL

RLScore 85

DEFT: Distribution-guided Efficient Fine-Tuning for Human Alignment

arXiv:2604.01787v1 Announce Type: new Abstract: Reinforcement Learning from Human Feedback (RLHF), using algorithms like Proximal Policy Optimization (PPO), aligns Large Language Models (LLMs) with human values but is costly and unstable. Alternatives have been proposed to replace PPO or integrate Supervised Fine-Tuning (SFT) and contrastive learning for direct fine-tuning and value alignment. However, these methods still require voluminous data to learn preferences and may weaken the generalization ability of LLMs. To further enhance alignment efficiency and performance while mitigating the loss of generalization ability, this paper introduces Distribution-guided Efficient Fine-Tuning (DEFT), an efficient alignment framework incorporating data filtering and distributional guidance by calculating the differential distribution reward based on the output distribution of language model and the discrepancy distribution of preference data. A small yet high-quality subset is filtered from the raw data using a differential distribution reward, which is then incorporated into existing alignment methods to guide the model's output distribution. Experimental results demonstrate that the methods enhanced by DEFT outperform the original methods in both alignment capability and generalization ability, with significantly reduced training time.

Fonte: arXiv cs.CL

RLScore 85

Scaling Reasoning Tokens via RL and Parallel Thinking: Evidence From Competitive Programming

arXiv:2604.01302v1 Announce Type: new Abstract: We study how to scale reasoning token budgets for competitive programming through two complementary approaches: training-time reinforcement learning (RL) and test-time parallel thinking. During RL training, we observe an approximately log-linear relationship between validation accuracy and the average number of generated reasoning tokens over successive checkpoints, and show two ways to shift this training trajectory: verification RL warmup raises the starting point, while randomized clipping produces a steeper trend in the observed regime. As scaling single-generation reasoning during RL quickly becomes expensive under full attention, we introduce a multi-round parallel thinking pipeline that distributes the token budget across threads and rounds of generation, verification, and refinement. We train the model end-to-end on this pipeline to match the training objective to the test-time structure. Starting from Seed-OSS-36B, the full system with 16 threads and 16 rounds per thread matches the underlying RL model's oracle pass@16 at pass@1 using 7.6 million tokens per problem on average, and surpasses GPT-5-high on 456 hard competitive programming problems from AetherCode.

Fonte: arXiv cs.CL

RLScore 85

Residuals-based Offline Reinforcement Learning

arXiv:2604.01378v1 Announce Type: new Abstract: Offline reinforcement learning (RL) has received increasing attention for learning policies from previously collected data without interaction with the real environment, which is particularly important in high-stakes applications. While a growing body of work has developed offline RL algorithms, these methods often rely on restrictive assumptions about data coverage and suffer from distribution shift. In this paper, we propose a residuals-based offline RL framework for general state and action spaces. Specifically, we define a residuals-based Bellman optimality operator that explicitly incorporates estimation error in learning transition dynamics into policy optimization by leveraging empirical residuals. We show that this Bellman operator is a contraction mapping and identify conditions under which its fixed point is asymptotically optimal and possesses finite-sample guarantees. We further develop a residuals-based offline deep Q-learning (DQN) algorithm. Using a stochastic CartPole environment, we demonstrate the effectiveness of our residuals-based offline DQN algorithm.

Fonte: arXiv cs.LG

RLScore 85

Soft MPCritic: Amortized Model Predictive Value Iteration

arXiv:2604.01477v1 Announce Type: new Abstract: Reinforcement learning (RL) and model predictive control (MPC) offer complementary strengths, yet combining them at scale remains computationally challenging. We propose soft MPCritic, an RL-MPC framework that learns in (soft) value space while using sample-based planning for both online control and value target generation. soft MPCritic instantiates MPC through model predictive path integral control (MPPI) and trains a terminal Q-function with fitted value iteration, aligning the learned value function with the planner and implicitly extending the effective planning horizon. We introduce an amortized warm-start strategy that recycles planned open-loop action sequences from online observations when computing batched MPPI-based value targets. This makes soft MPCritic computationally practical, while preserving solution quality. soft MPCritic plans in a scenario-based fashion with an ensemble of dynamic models trained for next-step prediction accuracy. Together, these ingredients enable soft MPCritic to learn effectively through robust, short-horizon planning on classic and complex control tasks. These results establish soft MPCritic as a practical and scalable blueprint for synthesizing MPC policies in settings where policy extraction and direct, long-horizon planning may fail.

Fonte: arXiv cs.LG

RLScore 85

Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training

arXiv:2604.01597v1 Announce Type: new Abstract: Traditional RL algorithms like Proximal Policy Optimization (PPO) typically train on the entire rollout buffer, operating under the assumption that all generated episodes provide a beneficial optimization signal. However, these episodes frequently contain noisy or unfaithful reasoning, which can degrade model performance and slow down training. In this paper, we propose \textbf{Influence-Guided PPO (I-PPO)}, a novel framework that integrates data attribution into the RL post-training loop. By calculating an influence score for each episode using a gradient-based approximation, I-PPO identifies and eliminates episodes that are anti-aligned with a validation gradient. Our experiments demonstrate that I-PPO consistently outperforms SFT and PPO baselines. We show that our filtering process acts as an intrinsic early stopping mechanism, accelerating training efficiency while effectively reducing unfaithful CoT reasoning.

Fonte: arXiv cs.LG

MultimodalScore 85

Reinforcing Consistency in Video MLLMs with Structured Rewards

arXiv:2604.01460v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress in video understanding. However, seemingly plausible outputs often suffer from poor visual and temporal grounding: a model may fabricate object existence, assign incorrect attributes, or collapse repeated events while still producing a globally reasonable caption or answer. We study this failure mode through a compositional consistency audit that decomposes a caption into supporting factual and temporal claims, investigating whether a correct high-level prediction is actually backed by valid lower-level evidence. Our top-down audit reveals that even correct root relational claims often lack reliable attribute and existence support. This indicates that standard sentence-level supervision is a weak proxy for faithful video understanding. Furthermore, when turning to reinforcement learning (RL) for better alignment, standard sentence-level rewards often prove too coarse to accurately localize specific grounding failures. To address this, we replace generic sentence-level rewards with a structured reward built from factual and temporal units. Our training objective integrates three complementary components: (1) an instance-aware scene-graph reward for factual objects, attributes, and relations; (2) a temporal reward for event ordering and repetition; and (3) a video-grounded VQA reward for hierarchical self-verification. Across temporal, general video understanding, and hallucination-oriented benchmarks, this objective yields consistent gains on open-source backbones. These results suggest that structured reward shaping is a practical route to more faithful video understanding.

Fonte: arXiv cs.CV

RLScore 85

When Reward Hacking Rebounds: Understanding and Mitigating It with Representation-Level Signals

arXiv:2604.01476v1 Announce Type: new Abstract: Reinforcement learning for LLMs is vulnerable to reward hacking, where models exploit shortcuts to maximize reward without solving the intended task. We systematically study this phenomenon in coding tasks using an environment-manipulation setting, where models can rewrite evaluator code to trivially pass tests without solving the task, as a controlled testbed. Across both studied models, we identify a reproducible three-phase rebound pattern: models first attempt to rewrite the evaluator but fail, as their rewrites embed test cases their own solutions cannot pass. They then temporarily retreat to legitimate solving. When legitimate reward remains scarce, they rebound into successful hacking with qualitatively different strategies. Using representation engineering, we extract concept directions for shortcut, deception, and evaluation awareness from domain-general contrastive pairs and find that the shortcut direction tracks hacking behavior most closely, making it an effective representational proxy for detection. Motivated by this finding, we propose Advantage Modification, which integrates shortcut concept scores into GRPO advantage computation to penalize hacking rollouts before policy updates. Because the penalty is internalized into the training signal rather than applied only at inference time, Advantage Modification provides more robust suppression of hacking compared with generation-time activation steering.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

Improving Latent Generalization Using Test-time Compute

arXiv:2604.01430v1 Announce Type: new Abstract: Language Models (LMs) exhibit two distinct mechanisms for knowledge acquisition: in-weights learning (i.e., encoding information within the model weights) and in-context learning (ICL). Although these two modes offer complementary strengths, in-weights learning frequently struggles to facilitate deductive reasoning over the internalized knowledge. We characterize this limitation as a deficit in latent generalization, of which the reversal curse is one example. Conversely, in-context learning demonstrates highly robust latent generalization capabilities. To improve latent generalization from in-weights knowledge, prior approaches rely on train-time data augmentation, yet these techniques are task-specific, scale poorly, and fail to generalize to out-of-distribution knowledge. To overcome these shortcomings, this work studies how models can be taught to use test-time compute, or 'thinking', specifically to improve latent generalization. We use Reinforcement Learning (RL) from correctness feedback to train models to produce long chains-of-thought (CoTs) to improve latent generalization. Our experiments show that this thinking approach not only resolves many instances of latent generalization failures on in-distribution knowledge but also, unlike augmentation baselines, generalizes to new knowledge for which no RL was performed. Nevertheless, on pure reversal tasks, we find that thinking does not unlock direct knowledge inversion, but the generate-and-verify ability of thinking models enables them to get well above chance performance. The brittleness of factual self-verification means thinking models still remain well below the performance of in-context learning for this task. Overall, our results establish test-time thinking as a flexible and promising direction for improving the latent generalization of LMs.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

Machine Learning for Network Attacks Classification and Statistical Evaluation of Adversarial Learning Methodologies for Synthetic Data Generation

arXiv:2603.17717v2 Announce Type: replace-cross Abstract: Supervised detection of network attacks has always been a critical part of network intrusion detection systems (NIDS). Nowadays, in a pivotal time for artificial intelligence (AI), with even more sophisticated attacks that utilize advanced techniques, such as generative artificial intelligence (GenAI) and reinforcement learning, it has become a vital component if we wish to protect our personal data, which are scattered across the web. In this paper, we address two tasks, in the first unified multi-modal NIDS dataset, which incorporates flow-level data, packet payload information and temporal contextual features, from the reprocessed CIC-IDS-2017, CIC-IoT-2023, UNSW-NB15 and CIC-DDoS-2019, with the same feature space. In the first task we use machine learning (ML) algorithms, with stratified cross validation, in order to prevent network attacks, with stability and reliability. In the second task we use adversarial learning algorithms to generate synthetic data, compare them with the real ones and evaluate their fidelity, utility and privacy using the SDV framework, f-divergences, distinguishability and non-parametric statistical tests. The findings provide stable ML models for intrusion detection and generative models with high fidelity and utility, by combining the Synthetic Data Vault framework, the TRTS and TSTR tests, with non-parametric statistical tests and f-divergence measures.

Fonte: arXiv stat.ML

RLScore 85

What Capable Agents Must Know: Selection Theorems for Robust Decision-Making under Uncertainty

arXiv:2603.02491v2 Announce Type: replace-cross Abstract: As artificial agents become increasingly capable, what internal structure is *necessary* for an agent to act competently under uncertainty? Classical results show that optimal control can be *implemented* using belief states or world models, but not that such representations are required. We prove quantitative "selection theorems" showing that strong task performance (low *average-case regret*) forces world models, belief-like memory and -- under task mixtures -- persistent variables resembling core primitives associated with emotion, along with informational modularity under block-structured tasks. Our results cover stochastic policies, partial observability, and evaluation under task distributions, without assuming optimality, determinism, or access to an explicit model. Technically, we reduce predictive modeling to binary "betting" decisions and show that regret bounds limit probability mass on suboptimal bets, enforcing the predictive distinctions needed to separate high-margin outcomes. In fully observed settings, this yields approximate recovery of the interventional transition kernel; under partial observability, it implies necessity of predictive state and belief-like memory, addressing an open question in prior world-model recovery work.

Fonte: arXiv stat.ML

RLScore 85

Malliavin Calculus for Counterfactual Gradient Estimation in Adaptive Inverse Reinforcement Learning

arXiv:2604.01345v1 Announce Type: new Abstract: Inverse reinforcement learning (IRL) recovers the loss function of a forward learner from its observed responses adaptive IRL aims to reconstruct the loss function of a forward learner by passively observing its gradients as it performs reinforcement learning (RL). This paper proposes a novel passive Langevin-based algorithm that achieves adaptive IRL. The key difficulty in adaptive IRL is that the required gradients in the passive algorithm are counterfactual, that is, they are conditioned on events of probability zero under the forward learner's trajectory. Therefore, naive Monte Carlo estimators are prohibitively inefficient, and kernel smoothing, though common, suffers from slow convergence. We overcome this by employing Malliavin calculus to efficiently estimate the required counterfactual gradients. We reformulate the counterfactual conditioning as a ratio of unconditioned expectations involving Malliavin quantities, thus recovering standard estimation rates. We derive the necessary Malliavin derivatives and their adjoint Skorohod integral formulations for a general Langevin structure, and provide a concrete algorithmic approach which exploits these for counterfactual gradient estimation.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

Thinking While Listening: Fast-Slow Recurrence for Long-Horizon Sequential Modeling

arXiv:2604.01577v1 Announce Type: new Abstract: We extend the recent latent recurrent modeling to sequential input streams. By interleaving fast, recurrent latent updates with self-organizational ability between slow observation updates, our method facilitates the learning of stable internal structures that evolve alongside the input. This mechanism allows the model to maintain coherent and clustered representations over long horizons, improving out-of-distribution generalization in reinforcement learning and algorithmic tasks compared to sequential baselines such as LSTM, state space models, and Transformer variants.

Fonte: arXiv cs.LG

RLScore 85

Physics Informed Reinforcement Learning with Gibbs Priors for Topology Control in Power Grids

arXiv:2604.01830v1 Announce Type: new Abstract: Topology control for power grid operation is a challenging sequential decision making problem because the action space grows combinatorially with the size of the grid and action evaluation through simulation is computationally expensive. We propose a physics-informed Reinforcement Learning framework that combines semi-Markov control with a Gibbs prior, that encodes the system's physics, over the action space. The decision is only taken when the grid enters a hazardous regime, while a graph neural network surrogate predicts the post action overload risk of feasible topology actions. These predictions are used to construct a physics-informed Gibbs prior that both selects a small state-dependent candidate set and reweights policy logits before action selection. In this way, our method reduces exploration difficulty and online simulation cost while preserving the flexibility of a learned policy. We evaluate the approach in three realistic benchmark environments of increasing difficulty. Across all settings, the proposed method achieves a strong balance between control quality and computational efficiency: it matches oracle-level performance while being approximately $6\times$ faster on the first benchmark, reaches $94.6\%$ of oracle reward with roughly $200\times$ lower decision time on the second one, and on the most challenging benchmark improves over a PPO baseline by up to $255\%$ in reward and $284\%$ in survived steps while remaining about $2.5\times$ faster than a strong specialized engineering baseline. These results show that our method provides an effective mechanism for topology control in power grids.

Fonte: arXiv cs.LG

RLScore 85

Pseudo-Quantized Actor-Critic Algorithm for Robustness to Noisy Temporal Difference Error

arXiv:2604.01613v1 Announce Type: new Abstract: In reinforcement learning (RL), temporal difference (TD) errors are widely adopted for optimizing value and policy functions. However, since the TD error is defined by a bootstrap method, its computation tends to be noisy and destabilize learning. Heuristics to improve the accuracy of TD errors, such as target networks and ensemble models, have been introduced so far. While these are essential approaches for the current deep RL algorithms, they cause side effects like increased computational cost and reduced learning efficiency. Therefore, this paper revisits the TD learning algorithm based on control as inference, deriving a novel algorithm capable of robust learning against noisy TD errors. First, the distribution model of optimality, a binary random variable, is represented by a sigmoid function. Alongside forward and reverse Kullback-Leibler divergences, this new model derives a robust learning rule: when the sigmoid function saturates with a large TD error probably due to noise, the gradient vanishes, implicitly excluding it from learning. Furthermore, the two divergences exhibit distinct gradient-vanishing characteristics. Building on these analyses, the optimality is decomposed into multiple levels to achieve pseudo-quantization of TD errors, aiming for further noise reduction. Additionally, a Jensen-Shannon divergence-based approach is approximately derived to inherit the characteristics of both divergences. These benefits are verified through RL benchmarks, demonstrating stable learning even when heuristics are insufficient or rewards contain noise.

Fonte: arXiv cs.LG

RLScore 92

DISCO-TAB: A Hierarchical Reinforcement Learning Framework for Privacy-Preserving Synthesis of Complex Clinical Data

arXiv:2604.01481v1 Announce Type: new Abstract: The development of robust clinical decision support systems is frequently impeded by the scarcity of high-fidelity, privacy-preserving biomedical data. While Generative Large Language Models (LLMs) offer a promising avenue for synthetic data generation, they often struggle to capture the complex, non-linear dependencies and severe class imbalances inherent in Electronic Health Records (EHR), leading to statistically plausible but clinically invalid records. To bridge this gap, we introduce DISCO-TAB (DIScriminator-guided COntrol for TABular synthesis), a novel framework that orchestrates a fine-tuned LLM with a multi-objective discriminator system optimized via Reinforcement Learning. Unlike prior methods relying on scalar feedback, DISCO-TAB evaluates synthesis at four granularities, token, sentence, feature, and row, while integrating Automated Constraint Discovery and Inverse-Frequency Reward Shaping to autonomously preserve latent medical logic and resolve minority-class collapse. We rigorously validate our framework across diverse benchmarks, including high-dimensional, small-sample medical datasets (e.g., Heart Failure, Parkinson's). Our results demonstrate that hierarchical feedback yields state-of-the-art performance, achieving up to 38.2% improvement in downstream clinical classifier utility compared to GAN and Diffusion baselines, while ensuring exceptional statistical fidelity (JSD < 0.01) and robust resistance to membership inference attacks. This work establishes a new standard for generating trustworthy, utility-preserving synthetic tabular data for sensitive healthcare applications.

Fonte: arXiv cs.LG

RLScore 85

Learning with Incomplete Context: Linear Contextual Bandits with Pretrained Imputation

arXiv:2510.09908v3 Announce Type: replace Abstract: The rise of large-scale pretrained models has made it feasible to generate predictive or synthetic features at low cost, raising the question of how to incorporate such surrogate predictions into downstream decision-making. We study this problem in the setting of online linear contextual bandits, where contexts may be complex, nonstationary, and only partially observed. In addition to bandit data, we assume access to an auxiliary dataset containing fully observed contexts--common in practice since such data are collected without adaptive interventions. We propose PULSE-UCB, an algorithm that leverages pretrained models trained on the auxiliary data to impute missing features during online decision-making. We establish regret guarantees that decompose into a standard bandit term plus an additional component reflecting pretrained model quality. In the i.i.d. context case with H\"older-smooth missing features, PULSE-UCB achieves near-optimal performance, supported by matching lower bounds. Our results quantify how uncertainty in predicted contexts affects decision quality and how much historical data is needed to improve downstream learning.

Fonte: arXiv stat.ML

NLP/LLMsScore 85

Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning

arXiv:2604.00344v1 Announce Type: new Abstract: Large Language Models (LLMs) have shown remarkable performance in completing various tasks. However, solving complex problems often requires the coordination of multiple agents, raising a fundamental question: how to effectively select and interconnect these agents. In this paper, we propose \textbf{Agent Q-Mix}, a reinforcement learning framework that reformulates topology selection as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. Our method learns decentralized communication decisions using QMIX value factorization, where each agent selects from a set of communication actions that jointly induce a round-wise communication graph. At its core, Agent Q-Mix combines a topology-aware GNN encoder, GRU memory, and per-agent Q-heads under a Centralized Training with Decentralized Execution (CTDE) paradigm. The framework optimizes a reward function that balances task accuracy with token cost. Across seven core benchmarks in coding, reasoning, and mathematics, Agent Q-Mix achieves the highest average accuracy compared to existing methods while demonstrating superior token efficiency and robustness against agent failure. Notably, on the challenging Humanity's Last Exam (HLE) using Gemini-3.1-Flash-Lite as a backbone, Agent Q-Mix achieves 20.8\% accuracy, outperforming Microsoft Agent Framework (19.2\%) and LangGraph (19.2\%), followed by AutoGen and Lobster by OpenClaw. These results underscore the effectiveness of learned, decentralized topology optimization in pushing the boundaries of multi-agent reasoning.

Fonte: arXiv cs.CL

NLP/LLMsScore 85

RefineRL: Avançando a Programação Competitiva com Aprendizado por Reforço de Auto-Refinamento

arXiv:2604.00790v1 Tipo de Anúncio: novo Resumo: Embora modelos de linguagem grandes (LLMs) tenham demonstrado forte desempenho em tarefas complexas de raciocínio, como programação competitiva (CP), os métodos existentes predominantemente se concentram em configurações de tentativa única, negligenciando sua capacidade de refinamento iterativo. Neste artigo, apresentamos o RefineRL, uma abordagem inovadora projetada para liberar as capacidades de auto-refinamento dos LLMs para a resolução de problemas de CP. O RefineRL introduz duas inovações principais: (1) Skeptical-Agent, um agente de auto-refinamento iterativo equipado com ferramentas de execução local para validar soluções geradas contra casos de teste públicos de problemas de CP. Este agente sempre mantém uma atitude cética em relação às suas próprias saídas e, assim, impõe um rigoroso auto-refinamento mesmo quando a validação sugere correção. (2) Uma solução de aprendizado por reforço (RL) para incentivar os LLMs a se auto-refinarem com apenas dados padrão RLVR (ou seja, problemas emparelhados com suas respostas verificáveis). Experimentos extensivos em Qwen3-4B e Qwen3-4B-2507 demonstram que nosso método produz ganhos substanciais: após nosso treinamento em RL, esses compactos modelos de 4B integrados com o Skeptical-Agent não apenas superam modelos muito maiores de 32B, mas também se aproximam do desempenho de tentativa única de modelos de 235B. Essas descobertas sugerem que o auto-refinamento possui considerável potencial para escalar o raciocínio dos LLMs, com um potencial significativo para avanços futuros.

Fonte: arXiv cs.AI

RLScore 85

All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models

arXiv:2604.00479v1 Announce Type: new Abstract: Recent studies have demonstrated that Reinforcement Learning (RL), notably Group Relative Policy Optimization (GRPO), can intrinsically elicit and enhance the reasoning capabilities of Vision-Language Models (VLMs). However, despite the promise, the underlying mechanisms that drive the effectiveness of RL models as well as their limitations remain underexplored. In this paper, we highlight a fundamental behavioral distinction between RL and base models, where the former engages in deeper yet narrow reasoning, while base models, despite less refined along individual path, exhibit broader and more diverse thinking patterns. Through further analysis of training dynamics, we show that GRPO is prone to diversity collapse, causing models to prematurely converge to a limited subset of reasoning strategies while discarding the majority of potential alternatives, leading to local optima and poor scalability. To address this, we propose Multi-Group Policy Optimization (MUPO), a simple yet effective approach designed to incentivize divergent thinking across multiple solutions, and demonstrate its effectiveness on established benchmarks. Project page: https://xytian1008.github.io/MUPO/

Fonte: arXiv cs.CV

RLScore 85

Lipschitz Dueling Bandits over Continuous Action Spaces

arXiv:2604.00523v1 Announce Type: new Abstract: We study for the first time, stochastic dueling bandits over continuous action spaces with Lipschitz structure, where feedback is purely comparative. While dueling bandits and Lipschitz bandits have been studied separately, their combination has remained unexplored. We propose the first algorithm for Lipschitz dueling bandits, using round-based exploration and recursive region elimination guided by an adaptive reference arm. We develop new analytical tools for relative feedback and prove a regret bound of $\tilde O\left(T^{\frac{d_z+1}{d_z+2}}\right)$, where $d_z$ is the zooming dimension of the near-optimal region. Further, our algorithm takes only logarithmic space in terms of the total time horizon, best achievable by any bandit algorithm over a continuous action space.

Fonte: arXiv cs.LG

RLScore 85

Learning to Play Blackjack: A Curriculum Learning Perspective

arXiv:2604.00076v1 Announce Type: new Abstract: Reinforcement Learning (RL) agents often struggle with efficiency and performance in complex environments. We propose a novel framework that uses a Large Language Model (LLM) to dynamically generate a curriculum over available actions, enabling the agent to incorporate each action individually. We apply this framework to the game of Blackjack, where the LLM creates a multi-stage training path that progressively introduces complex actions to a Tabular Q-Learning and a Deep Q-Network (DQN) agent. Our evaluation in a realistic 8-deck simulation over 10 independent runs demonstrates significant performance gains over standard training methods. The curriculum-based approach increases the DQN agent's average win rate from 43.97% to 47.41%, reduces the average bust rate from 32.9% to 28.0%, and accelerates the overall workflow by over 74%, with the agent's full training completing faster than the baseline's evaluation phase alone. These results validate that LLM-guided curricula can build more effective, robust, and efficient RL agents.

Fonte: arXiv cs.LG

MultimodalScore 90

AceTone: Bridging Words and Colors for Conditional Image Grading

arXiv:2604.00530v1 Announce Type: new Abstract: Color affects how we interpret image style and emotion. Previous color grading methods rely on patch-wise recoloring or fixed filter banks, struggling to generalize across creative intents or align with human aesthetic preferences. In this study, we propose AceTone, the first approach that supports multimodal conditioned color grading within a unified framework. AceTone formulates grading as a generative color transformation task, where a model directly produces 3D-LUTs conditioned on text prompts or reference images. We develop a VQ-VAE based tokenizer which compresses a $3\times32^3$ LUT vector to 64 discrete tokens with $\Delta E<2$ fidelity. We further build a large-scale dataset, AceTone-800K, and train a vision-language model to predict LUT tokens, followed by reinforcement learning to align outputs with perceptual fidelity and aesthetics. Experiments show that AceTone achieves state-of-the-art performance on both text-guided and reference-guided grading tasks, improving LPIPS by up to 50% over existing methods. Human evaluations confirm that AceTone's results are visually pleasing and stylistically coherent, demonstrating a new pathway toward language-driven, aesthetic-aligned color grading.

Fonte: arXiv cs.CV

NLP/LLMsScore 85

MSA-Thinker: Discrimination-Calibration Reasoning with Hint-Guided Reinforcement Learning for Multimodal Sentiment Analysis

arXiv:2604.00013v1 Announce Type: cross Abstract: Multimodal sentiment analysis aims to understand human emotions by integrating textual, auditory, and visual modalities. Although Multimodal Large Language Models (MLLMs) have achieved state-of-the-art performance via supervised fine-tuning (SFT), their end-to-end "black-box" nature limits interpretability. Existing methods incorporating Chain-of-Thought (CoT) reasoning are hindered by high annotation costs, while Reinforcement Learning (RL) faces challenges such as low exploration efficiency and sparse rewards, particularly on hard samples. To address these issues, we propose a novel training framework that integrates structured Discrimination-Calibration (DC) reasoning with Hint-based Reinforcement Learning. First, we perform cold-start SFT using high-quality CoT data synthesized by a teacher model (Qwen3Omni-30B), which inherently contains the DC structure. This equips the model with a reasoning paradigm that performs macro discrimination followed by fine-grained calibration from the initial stage. Building on this, we propose Hint-GRPO, which leverages the discrimination phase within the DC structure as a verifiable anchor during RL to provide directional hints for hard samples, guiding policy optimization and effectively mitigating the reward sparsity problem. Experiments on the Qwen2.5Omni-7B model demonstrate that our method not only achieves higher accuracy in fine-grained sentiment regression tasks but also generates high-quality structured reasoning chains. Crucially, it exhibits superior generalization capability in cross-domain evaluations. This enhances model interpretability while validating the positive contribution of explicit reasoning steps to model robustness, offering a new paradigm for building trustworthy and efficient sentiment analysis systems.

Fonte: arXiv cs.AI

RLScore 85

Softmax gradient policy for variance minimization and risk-averse multi armed bandits

arXiv:2604.00241v1 Announce Type: new Abstract: Algorithms for the Multi-Armed Bandit (MAB) problem play a central role in sequential decision-making and have been extensively explored both theoretically and numerically. While most classical approaches aim to identify the arm with the highest expected reward, we focus on a risk-aware setting where the goal is to select the arm with the lowest variance, favoring stability over potentially high but uncertain returns. To model the decision process, we consider a softmax parameterization of the policy; we propose a new algorithm to select the minimal variance (or minimal risk) arm and prove its convergence under natural conditions. The algorithm constructs an unbiased estimate of the objective by using two independent draws from the current's arm distribution. We provide numerical experiments that illustrate the practical behavior of these algorithms and offer guidance on implementation choices. The setting also covers general risk-aware problems where there is a trade-off between maximizing the average reward and minimizing its variance.

Fonte: arXiv cs.LG

RLScore 85

GUIDE: Reinforcement Learning for Behavioral Action Support in Type 1 Diabetes

arXiv:2604.00385v1 Announce Type: new Abstract: Type 1 Diabetes (T1D) management requires continuous adjustment of insulin and lifestyle behaviors to maintain blood glucose within a safe target range. Although automated insulin delivery (AID) systems have improved glycemic outcomes, many patients still fail to achieve recommended clinical targets, warranting new approaches to improve glucose control in patients with T1D. While reinforcement learning (RL) has been utilized as a promising approach, current RL-based methods focus primarily on insulin-only treatment and do not provide behavioral recommendations for glucose control. To address this gap, we propose GUIDE, an RL-based decision-support framework designed to complement AID technologies by providing behavioral recommendations to prevent abnormal glucose events. GUIDE generates structured actions defined by intervention type, magnitude, and timing, including bolus insulin administration and carbohydrate intake events. GUIDE integrates a patient-specific glucose level predictor trained on real-world continuous glucose monitoring data and supports both offline and online RL algorithms within a unified environment. We evaluate both off-policy and on-policy methods across 25 individuals with T1D using standardized glycemic metrics. Among the evaluated approaches, the CQL-BC algorithm demonstrates the highest average time-in-range, reaching 85.49% while maintaining low hypoglycemia exposures. Behavioral similarity analysis further indicates that the learned CQL-BC policy preserves key structural characteristics of patient action patterns, achieving a mean cosine similarity of 0.87 $\pm$ 0.09 across subjects. These findings suggest that conservative offline RL with a structured behavioral action space can provide clinically meaningful and behaviorally plausible decision support for personalized diabetes management.

Fonte: arXiv cs.LG

RLScore 85

Gradient-Based Data Valuation Improves Curriculum Learning for Game-Theoretic Motion Planning

arXiv:2604.00388v1 Announce Type: new Abstract: We demonstrate that gradient-based data valuation produces curriculum orderings that significantly outperform metadata-based heuristics for training game-theoretic motion planners. Specifically, we apply TracIn gradient-similarity scoring to GameFormer on the nuPlan benchmark and construct a curriculum that weights training scenarios by their estimated contribution to validation loss reduction. Across three random seeds, the TracIn-weighted curriculum achieves a mean planning ADE of $1.704\pm0.029$\,m, significantly outperforming the metadata-based interaction-difficulty curriculum ($1.822\pm0.014$\,m; paired $t$-test $p=0.021$, Cohen's $d_z=3.88$) while exhibiting lower variance than the uniform baseline ($1.772\pm0.134$\,m). Our analysis reveals that TracIn scores and scenario metadata are nearly orthogonal (Spearman $\rho=-0.014$), indicating that gradient-based valuation captures training dynamics invisible to hand-crafted features. We further show that gradient-based curriculum weighting succeeds where hard data selection fails: TracIn-curated 20\% subsets degrade performance by $2\times$, whereas full-data curriculum weighting with the same scores yields the best results. These findings establish gradient-based data valuation as a practical tool for improving sample efficiency in game-theoretic planning.

Fonte: arXiv cs.LG

RLScore 85

Learning Shared Representations for Multi-Task Linear Bandits

arXiv:2604.00531v1 Announce Type: new Abstract: Multi-task representation learning is an approach that learns shared latent representations across related tasks, facilitating knowledge transfer and improving sample efficiency. This paper introduces a novel approach to multi-task representation learning in linear bandits. We consider a setting with T concurrent linear bandit tasks, each with feature dimension d, that share a common latent representation of dimension r \ll min{d,T}$, capturing their underlying relatedness. We propose a new Optimism in the Face of Uncertainty Linear (OFUL) algorithm that leverages shared low-rank representations to enhance decision-making in a sample-efficient manner. Our algorithm first collects data through an exploration phase, estimates the shared model via spectral initialization, and then conducts OFUL based learning over a newly constructed confidence set. We provide theoretical guarantees for the confidence set and prove that the unknown reward vectors lie within the confidence set with high probability. We derive cumulative regret bounds and show that the proposed approach achieves \tilde{O}(\sqrt{drNT}), a significant improvement over solving the T tasks independently, resulting in a regret of \tilde{O}(dT\sqrt{N}). We performed numerical simulations to validate the performance of our algorithm for different problem sizes.

Fonte: arXiv cs.LG

RLScore 85

Execution-Verified Reinforcement Learning for Optimization Modeling

arXiv:2604.00442v1 Announce Type: new Abstract: Automating optimization modeling with LLMs is a promising path toward scalable decision intelligence, but existing approaches either rely on agentic pipelines built on closed-source LLMs with high inference latency, or fine-tune smaller LLMs using costly process supervision that often overfits to a single solver API. Inspired by reinforcement learning with verifiable rewards, we propose Execution-Verified Optimization Modeling (EVOM), an execution-verified learning framework that treats a mathematical programming solver as a deterministic, interactive verifier. Given a natural-language problem and a target solver, EVOM generates solver-specific code, executes it in a sandboxed harness, and converts execution outcomes into scalar rewards, optimized with GRPO and DAPO in a closed-loop generate-execute-feedback-update process. This outcome-only formulation removes the need for process-level supervision, and enables cross-solver generalization by switching the verification environment rather than reconstructing solver-specific datasets. Experiments on NL4OPT, MAMO, IndustryOR, and OptiBench across Gurobi, OR-Tools, and COPT show that EVOM matches or outperforms process-supervised SFT, supports zero-shot solver transfer, and achieves effective low-cost solver adaptation by continuing training under the target solver backend.

Fonte: arXiv cs.AI

RLScore 85

Hierarchical Apprenticeship Learning from Imperfect Demonstrations with Evolving Rewards

arXiv:2604.00258v1 Announce Type: new Abstract: While apprenticeship learning has shown promise for inducing effective pedagogical policies directly from student interactions in e-learning environments, most existing approaches rely on optimal or near-optimal expert demonstrations under a fixed reward. Real-world student interactions, however, are often inherently imperfect and evolving: students explore, make errors, revise strategies, and refine their goals as understanding develops. In this work, we argue that imperfect student demonstrations are not noise to be discarded, but structured signals-provided their relative quality is ranked. We introduce HALIDE, Hierarchical Apprenticeship Learning from Imperfect Demonstrations with Evolving Rewards, which not only leverages sub-optimal student demonstrations, but ranks them within a hierarchical learning framework. HALIDE models student behavior at multiple levels of abstraction, enabling inference of higher-level intent and strategy from suboptimal actions while explicitly capturing the temporal evolution of student reward functions. By integrating demonstration quality into hierarchical reward inference,HALIDE distinguishes transient errors from suboptimal strategies and meaningful progress toward higher-level learning goals. Our results show that HALIDE more accurately predicts student pedagogical decisions than approaches that rely on optimal trajectories, fixed rewards, or unranked imperfect demonstrations.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

PsychAgent: An Experience-Driven Lifelong Learning Agent for Self-Evolving Psychological Counselor

arXiv:2604.00931v1 Announce Type: new Abstract: Existing methods for AI psychological counselors predominantly rely on supervised fine-tuning using static dialogue datasets. However, this contrasts with human experts, who continuously refine their proficiency through clinical practice and accumulated experience. To bridge this gap, we propose an Experience-Driven Lifelong Learning Agent (\texttt{PsychAgent}) for psychological counseling. First, we establish a Memory-Augmented Planning Engine tailored for longitudinal multi-session interactions, which ensures therapeutic continuity through persistent memory and strategic planning. Second, to support self-evolution, we design a Skill Evolution Engine that extracts new practice-grounded skills from historical counseling trajectories. Finally, we introduce a Reinforced Internalization Engine that integrates the evolved skills into the model via rejection fine-tuning, aiming to improve performance across diverse scenarios. Comparative analysis shows that our approach achieves higher scores than strong general LLMs (e.g., GPT-5.4, Gemini-3) and domain-specific baselines across all reported evaluation dimensions. These results suggest that lifelong learning can improve the consistency and overall quality of multi-session counseling responses.

Fonte: arXiv cs.AI

RLScore 85

Offline Constrained RLHF with Multiple Preference Oracles

arXiv:2604.00200v1 Announce Type: new Abstract: We study offline constrained reinforcement learning from human feedback with multiple preference oracles. Motivated by applications that trade off performance with safety or fairness, we aim to maximize target population utility subject to a minimum protected group welfare constraint. From pairwise comparisons collected under a reference policy, we estimate oracle-specific rewards via maximum likelihood and analyze how statistical uncertainty propagates through the dual program. We cast the constrained objective as a KL-regularized Lagrangian whose primal optimizer is a Gibbs policy, reducing learning to a convex dual problem. We propose a dual-only algorithm that ensures high-probability constraint satisfaction and provide the first finite-sample performance guarantees for offline constrained preference learning. Finally, we extend our theoretical analysis to accommodate multiple constraints and general f-divergence regularization.

Fonte: arXiv cs.LG

RLScore 85

REM-CTX: Automated Peer Review via Reinforcement Learning with Auxiliary Context

arXiv:2604.00248v1 Announce Type: new Abstract: Most automated peer review systems rely on textual manuscript content alone, leaving visual elements such as figures and external scholarly signals underutilized. We introduce REM-CTX, a reinforcement-learning system that incorporates auxiliary context into the review generation process via correspondence-aware reward functions. REM-CTX trains an 8B-parameter language model with Group Relative Policy Optimization (GRPO) and combines a multi-aspect quality reward with two correspondence rewards that explicitly encourage alignment with auxiliary context. Experiments on manuscripts across Computer, Biological, and Physical Sciences show that REM-CTX achieves the highest overall review quality among six baselines, outperforming other systems with substantially larger commercial models, and surpassing the next-best RL baseline across both quality and contextual grounding metrics. Ablation studies confirm that the two correspondence rewards are complementary: each selectively improves its targeted correspondence reward while preserving all quality dimensions, and the full model outperforms all partial variants. Analysis of training dynamics reveals that the criticism aspect is negatively correlated with other metrics during training, suggesting that future studies should group multi-dimension rewards for review generation.

Fonte: arXiv cs.CL

RLScore 75

Evolution Strategies for Deep RL pretraining

arXiv:2604.00066v1 Announce Type: new Abstract: Although Deep Reinforcement Learning has proven highly effective for complex decision-making problems, it demands significant computational resources and careful parameter adjustment in order to develop successful strategies. Evolution strategies offer a more straightforward, derivative-free approach that is less computationally costly and simpler to deploy. However, ES generally do not match the performance levels achieved by DRL, which calls into question their suitability for more demanding scenarios. This study examines the performance of ES and DRL across tasks of varying difficulty, including Flappy Bird, Breakout and Mujoco environments, as well as whether ES could be used for initial training to enhance DRL algorithms. The results indicate that ES do not consistently train faster than DRL. When used as a preliminary training step, they only provide benefits in less complex environments (Flappy Bird) and show minimal or no improvement in training efficiency or stability across different parameter settings when applied to more sophisticated tasks (Breakout and MuJoCo Walker).

Fonte: arXiv cs.LG

RLScore 85

Autonomous Adaptive Solver Selection for Chemistry Integration via Reinforcement Learning

arXiv:2604.00264v1 Announce Type: new Abstract: The computational cost of stiff chemical kinetics remains a dominant bottleneck in reacting-flow simulation, yet hybrid integration strategies are typically driven by hand-tuned heuristics or supervised predictors that make myopic decisions from instantaneous local state. We introduce a constrained reinforcement learning (RL) framework that autonomously selects between an implicit BDF integrator (CVODE) and a quasi-steady-state (QSS) solver during chemistry integration. Solver selection is cast as a Markov decision process. The agent learns trajectory-aware policies that account for how present solver choices influence downstream error accumulation, while minimizing computational cost under a user-prescribed accuracy tolerance enforced through a Lagrangian reward with online multiplier adaptation. Across sampled 0D homogeneous reactor conditions, the RL-adaptive policy achieves a mean speedup of approximately $3\times$, with speedups ranging from $1.11\times$ to $10.58\times$, while maintaining accurate ignition delays and species profiles for a 106-species \textit{n}-dodecane mechanism and adding approximately $1\%$ inference overhead. Without retraining, the 0D-trained policy transfers to 1D counterflow diffusion flames over strain rates $10$--$2000~\mathrm{s}^{-1}$, delivering consistent $\approx 2.2\times$ speedup relative to CVODE while preserving near-reference temperature accuracy and selecting CVODE at only $12$--$15\%$ of space-time points. Overall, the results demonstrate the potential of the proposed reinforcement learning framework to learn problem-specific integration strategies while respecting accuracy constraints, thereby opening a pathway toward adaptive, self-optimizing workflows for multiphysics systems with spatially heterogeneous stiffness.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

Asymmetric Actor-Critic for Multi-turn LLM Agents

arXiv:2604.00304v1 Announce Type: new Abstract: Large language models (LLMs) exhibit strong reasoning and conversational abilities, but ensuring reliable behavior in multi-turn interactions remains challenging. In many real-world applications, agents must succeed in one-shot settings where retries are impossible. Existing approaches either rely on reflection or post-hoc evaluation, which require additional attempts, or assume fully trainable models that cannot leverage proprietary LLMs. We propose an asymmetric actor-critic framework for reliable conversational agents. A powerful proprietary LLM acts as the actor, while a smaller open-source critic provides runtime supervision, monitoring the actor's actions and intervening within the same interaction trajectory. Unlike training-based actor-critic methods, our framework supervises a fixed actor operating in open-ended conversational environments. The design leverages a generation-verification asymmetry: while high-quality generation requires large models, effective oversight can often be achieved by smaller ones. We further introduce a data generation pipeline that produces supervision signals for critic fine-tuning without modifying the actor. Experiments on $\tau$-bench and UserBench show that our approach significantly improves reliability and task success over strong single-agent baselines. Moreover, lightweight open-source critics rival or surpass larger proprietary models in the critic role, and critic fine-tuning yields additional gains over several state-of-the-art methods.

Fonte: arXiv cs.CL

NLP/LLMsScore 92

A Reasoning-Enabled Vision-Language Foundation Model for Chest X-ray Interpretation

arXiv:2604.00493v1 Announce Type: new Abstract: Chest X-rays (CXRs) are among the most frequently performed imaging examinations worldwide, yet rising imaging volumes increase radiologist workload and the risk of diagnostic errors. Although artificial intelligence (AI) systems have shown promise for CXR interpretation, most generate only final predictions, without making explicit how visual evidence is translated into radiographic findings and diagnostic predictions. We present CheXOne, a reasoning-enabled vision-language model for CXR interpretation. CheXOne jointly generates diagnostic predictions and explicit, clinically grounded reasoning traces that connect visual evidence, radiographic findings, and these predictions. The model is trained on 14.7 million instruction and reasoning samples curated from 30 public datasets spanning 36 CXR interpretation tasks, using a two-stage framework that combines instruction tuning with reinforcement learning to improve reasoning quality. We evaluate CheXOne in zero-shot settings across visual question answering, report generation, visual grounding and reasoning assessment, covering 17 evaluation settings. CheXOne outperforms existing medical and general-domain foundation models and achieves strong performance on independent public benchmarks. A clinical reader study demonstrates that CheXOne-drafted reports are comparable to or better than resident-written reports in 55% of cases, while effectively addressing clinical indications and enhancing both report writing and CXR interpretation efficiency. Further analyses involving radiologists reveal that the generated reasoning traces show high clinical factuality and provide causal support for the final predictions, offering a plausible explanation for the performance gains. These results suggest that explicit reasoning can improve model performance, interpretability and clinical utility in AI-assisted CXR interpretation.

Fonte: arXiv cs.CV

RLScore 85

Enhancing Policy Learning with World-Action Model

arXiv:2603.28955v1 Announce Type: new Abstract: This paper presents the World-Action Model (WAM), an action-regularized world model that jointly reasons over future visual observations and the actions that drive state transitions. Unlike conventional world models trained solely via image prediction, WAM incorporates an inverse dynamics objective into DreamerV2 that predicts actions from latent state transitions, encouraging the learned representations to capture action-relevant structure critical for downstream control. We evaluate WAM on enhancing policy learning across eight manipulation tasks from the CALVIN benchmark. We first pretrain a diffusion policy via behavioral cloning on world model latents, then refine it with model-based PPO inside the frozen world model. Without modifying the policy architecture or training procedure, WAM improves average behavioral cloning success from 59.4% to 71.2% over DreamerV2 and DiWA baselines. After PPO fine-tuning, WAM achieves 92.8% average success versus 79.8% for the baseline, with two tasks reaching 100%, using 8.7x fewer training steps.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Disentangled Graph Prompting for Out-Of-Distribution Detection

arXiv:2603.29644v1 Announce Type: new Abstract: When testing data and training data come from different distributions, deep neural networks (DNNs) will face significant safety risks in practical applications. Therefore, out-of-distribution (OOD) detection techniques, which can identify OOD samples at test time and alert the system, are urgently needed. Existing graph OOD detection methods usually characterize fine-grained in-distribution (ID) patterns from multiple perspectives, and train end-to-end graph neural networks (GNNs) for prediction. However, due to the unavailability of OOD data during training, the absence of explicit supervision signals could lead to sub-optimal performance of end-to-end encoders. To address this issue, we follow the pre-training+prompting paradigm to utilize pre-trained GNN encoders, and propose Disentangled Graph Prompting (DGP), to capture fine-grained ID patterns with the help of ID graph labels. Specifically, we design two prompt generators that respectively generate class-specific and class-agnostic prompt graphs by modifying the edge weights of an input graph. We also design several effective losses to train the prompt generators and prevent trivial solutions. We conduct extensive experiments on ten datasets to demonstrate the superiority of our proposed DGP, which achieves a relative AUC improvement of 3.63% over the best graph OOD detection baseline. Ablation studies and hyper-parameter experiments further show the effectiveness of DGP. Code is available at https://github.com/BUPT-GAMMA/DGP.

Fonte: arXiv cs.LG

RLScore 85

MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters

arXiv:2603.29272v1 Announce Type: new Abstract: We present MaskAdapt, a framework for flexible motion adaptation in physics-based humanoid control. The framework follows a two-stage residual learning paradigm. In the first stage, we train a mask-invariant base policy using stochastic body-part masking and a regularization term that enforces consistent action distributions across masking conditions. This yields a robust motion prior that remains stable under missing observations, anticipating later adaptation in those regions. In the second stage, a residual policy is trained atop the frozen base controller to modify only the targeted body parts while preserving the original behaviors elsewhere. We demonstrate the versatility of this design through two applications: (i) motion composition, where varying masks enable multi-part adaptation within a single sequence, and (ii) text-driven partial goal tracking, where designated body parts follow kinematic targets provided by a pre-trained text-conditioned autoregressive motion generator. Through experiments, MaskAdapt demonstrates strong robustness and adaptability, producing diverse behaviors under masked observations and delivering superior targeted motion adaptation compared to prior work.

Fonte: arXiv cs.CV

RLScore 85

Target-Aligned Reinforcement Learning

arXiv:2603.29501v1 Announce Type: new Abstract: Many reinforcement learning algorithms rely on target networks - lagged copies of the online network - to stabilize training. While effective, this mechanism introduces a fundamental stability-recency tradeoff: slower target updates improve stability but reduce the recency of learning signals, hindering convergence speed. We propose Target-Aligned Reinforcement Learning (TARL), a framework that emphasizes transitions for which the target and online network estimates are highly aligned. By focusing updates on well-aligned targets, TARL mitigates the adverse effects of stale target estimates while retaining the stabilizing benefits of target networks. We provide a theoretical analysis demonstrating that target alignment correction accelerates convergence, and empirically demonstrate consistent improvements over standard reinforcement learning algorithms across various benchmark environments.

Fonte: arXiv cs.LG

MLOps/SystemsScore 88

ARCS: Autoregressive Circuit Synthesis with Topology-Aware Graph Attention and Spec Conditioning

arXiv:2603.29068v1 Announce Type: new Abstract: I present ARCS, a system for amortized analog circuit generation that produces complete, SPICE-simulatable designs (topology and component values) in milliseconds rather than the minutes required by search-based methods. A hybrid pipeline combining two learned generators (a graph VAE and a flow-matching model) with SPICE-based ranking achieves 99.9% simulation validity (reward 6.43/8.0) across 32 topologies using only 8 SPICE evaluations, 40x fewer than genetic algorithms. For single-model inference, a topology-aware Graph Transformer with Best-of-3 candidate selection reaches 85% simulation validity in 97ms, over 600x faster than random search. The key technical contribution is Group Relative Policy Optimization (GRPO): I identify a critical failure mode of REINFORCE (cross-topology reward distribution mismatch) and resolve it with per-topology advantage normalization, improving simulation validity by +9.6pp over REINFORCE in only 500 RL steps (10x fewer). Grammar-constrained decoding additionally guarantees 100% structural validity by construction via topology-aware token masking. ARCS does not yet match the per-design quality of search-based optimization (5.48 vs. 7.48 reward), but its >1000x speed advantage enables rapid prototyping, design-space exploration, and warm-starting search methods (recovering 96.6% of GA quality with 49% fewer simulations).

Fonte: arXiv cs.LG

RLScore 85

Realistic Market Impact Modeling for Reinforcement Learning Trading Environments

arXiv:2603.29086v1 Announce Type: new Abstract: Reinforcement learning (RL) has shown promise for trading, yet most open-source backtesting environments assume negligible or fixed transaction costs, causing agents to learn trading behaviors that fail under realistic execution. We introduce three Gymnasium-compatible trading environments -- MACE (Market-Adjusted Cost Execution) stock trading, margin trading, and portfolio optimization -- that integrate nonlinear market impact models grounded in the Almgren-Chriss framework and the empirically validated square-root impact law. Each environment provides pluggable cost models, permanent impact tracking with exponential decay, and comprehensive trade-level logging. We evaluate five DRL algorithms (A2C, PPO, DDPG, SAC, TD3) on the NASDAQ-100, comparing a fixed 10 bps baseline against the AC model with Optuna-tuned hyperparameters. Our results show that (i) the cost model materially changes both absolute performance and the relative ranking of algorithms across all three environments; (ii) the AC model produces dramatically different trading behavior, e.g., daily costs dropping from $200k to $8k with turnover falling from 19% to 1%; (iii) hyperparameter optimization is essential for constraining pathological trading, with costs dropping up to 82%; and (iv) algorithm-cost model interactions are strongly environment-specific, e.g., DDPG's OOS Sharpe jumps from -2.1 to 0.3 under AC in margin trading while SAC's drops from -0.5 to -1.2. We release the full suite as an open-source extension to FinRL-Meta.

Fonte: arXiv cs.LG

RecSysScore 85

MemRerank: Preference Memory for Personalized Product Reranking

arXiv:2603.29247v1 Announce Type: new Abstract: LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch. We propose MemRerank, a preference memory framework that distills user purchase history into concise, query-independent signals for personalized product reranking. To study this problem, we build an end-to-end benchmark and evaluation framework centered on an LLM-based \textbf{1-in-5} selection task, which measures both memory quality and downstream reranking utility. We further train the memory extractor with reinforcement learning (RL), using downstream reranking performance as supervision. Experiments with two LLM-based rerankers show that MemRerank consistently outperforms no-memory, raw-history, and off-the-shelf memory baselines, yielding up to \textbf{+10.61} absolute points in 1-in-5 accuracy. These results suggest that explicit preference memory is a practical and effective building block for personalization in agentic e-commerce systems.

Fonte: arXiv cs.CL

NLP/LLMsScore 85

MemFactory: Unified Inference & Training Framework for Agent Memory

arXiv:2603.29493v1 Announce Type: new Abstract: Memory-augmented Large Language Models (LLMs) are essential for developing capable, long-term AI agents. Recently, applying Reinforcement Learning (RL) to optimize memory operations, such as extraction, updating, and retrieval, has emerged as a highly promising research direction. However, existing implementations remain highly fragmented and task-specific, lacking a unified infrastructure to streamline the integration, training, and evaluation of these complex pipelines. To address this gap, we present MemFactory, the first unified, highly modular training and inference framework specifically designed for memory-augmented agents. Inspired by the success of unified fine-tuning frameworks like LLaMA-Factory, MemFactory abstracts the memory lifecycle into atomic, plug-and-play components, enabling researchers to seamlessly construct custom memory agents via a "Lego-like" architecture. Furthermore, the framework natively integrates Group Relative Policy Optimization (GRPO) to fine-tune internal memory management policies driven by multi-dimensional environmental rewards. MemFactory provides out-of-the-box support for recent cutting-edge paradigms, including Memory-R1, RMM, and MemAgent. We empirically validate MemFactory on the open-source MemAgent architecture using its publicly available training and evaluation data. Across both in-domain and out-of-distribution evaluation sets, MemFactory consistently improves performance over the corresponding base models, with relative gains of up to 14.8%. By providing a standardized, extensible, and easy-to-use infrastructure, MemFactory significantly lowers the barrier to entry, paving the way for future innovations in memory-driven AI agents.

Fonte: arXiv cs.CL

NLP/LLMsScore 85

Learning Diagnostic Reasoning for Decision Support in Toxicology

arXiv:2603.29608v1 Announce Type: new Abstract: Acute poly-substance intoxication requires rapid, life-saving decisions under substantial uncertainty, as clinicians must rely on incomplete ingestion details and nonspecific symptoms. Effective diagnostic reasoning in this chaotic environment requires fusing unstructured, non-medical narratives (e.g. paramedic scene descriptions and unreliable patient self-reports or known histories), with structured medical data like vital signs. While Large Language Models (LLMs) show potential for processing such heterogeneous inputs, they struggle in this setting, often underperforming simple baselines that rely solely on patient histories. To address this, we present DeToxR (Decision-support for Toxicology with Reasoning), the first adaptation of Reinforcement Learning (RL) to emergency toxicology. We design a robust data-fusion engine for multi-label prediction across 14 substance classes based on an LLM finetuned with Group Relative Policy Optimization (GRPO). We optimize the model's reasoning directly using a clinical performance reward. By formulating a multi-label agreement metric as the reward signal, the model is explicitly penalized for missing co-ingested substances and hallucinating absent poisons. Our model significantly outperforms its unadapted base LLM counterpart and supervised baselines. Furthermore, in a clinical validation study, the model indicates a clinical advantage by outperforming an expert toxicologist in identifying the correct poisons (Micro-F1: 0.644 vs. 0.473). These results demonstrate the potential of RL-aligned LLMs to synthesize unstructured pre-clinical narratives and structured medical data for decision support in high-stakes environments.

Fonte: arXiv cs.CL

NLP/LLMsScore 85

Calibrated Confidence Expression for Radiology Report Generation

arXiv:2603.29492v1 Announce Type: new Abstract: Safe deployment of Large Vision-Language Models (LVLMs) in radiology report generation requires not only accurate predictions but also clinically interpretable indicators of when outputs should be thoroughly reviewed, enabling selective radiologist verification and reducing the risk of hallucinated findings influencing clinical decisions. One intuitive approach to this is verbalized confidence, where the model explicitly states its certainty. However, current state-of-the-art language models are often overconfident, and research on calibration in multimodal settings such as radiology report generation is limited. To address this gap, we introduce ConRad (Confidence Calibration for Radiology Reports), a reinforcement learning framework for fine-tuning medical LVLMs to produce calibrated verbalized confidence estimates alongside radiology reports. We study two settings: a single report-level confidence score and a sentence-level variant assigning a confidence to each claim. Both are trained using the GRPO algorithm with reward functions based on the logarithmic scoring rule, which incentivizes truthful self-assessment by penalizing miscalibration and guarantees optimal calibration under reward maximization. Experimentally, ConRad substantially improves calibration and outperforms competing methods. In a clinical evaluation we show that ConRad's report level scores are well aligned with clinicians' judgment. By highlighting full reports or low-confidence statements for targeted review, ConRad can support safer clinical integration of AI-assistance for report generation.

Fonte: arXiv cs.CL

NLP/LLMsScore 85

LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning

arXiv:2603.29165v1 Announce Type: new Abstract: Existing vision-and-language navigation (VLN) models primarily reason over past and current visual observations, while largely ignoring the future visual dynamics induced by actions. As a result, they often lack an effective understanding of the causal relationship between actions and how the visual world changes, limiting robust decision-making. Humans, in contrast, can imagine the near future by leveraging action-dynamics causality, which improves both environmental understanding and navigation choices. Inspired by this capability, we propose LatentPilot, a new paradigm that exploits future observations during training as a valuable data source to learn action-conditioned visual dynamics, while requiring no access to future frames at inference. Concretely, we propose a flywheel-style training mechanism that iteratively collects on-policy trajectories and retrains the model to better match the agent's behavior distribution, with an expert takeover triggered when the agent deviates excessively. LatentPilot further learns visual latent tokens without explicit supervision; these latent tokens attend globally in a continuous latent space and are carried across steps, serving as both the current output and the next input, thereby enabling the agent to dream ahead and reason about how actions will affect subsequent observations. Experiments on R2R-CE, RxR-CE, and R2R-PE benchmarks achieve new SOTA results, and real-robot tests across diverse environments demonstrate LatentPilot's superior understanding of environment-action dynamics in scene. Project page:https://abdd.top/latentpilot/

Fonte: arXiv cs.CV

RLScore 85

APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay

arXiv:2603.29093v1 Announce Type: new Abstract: LLM-based autonomous agents lack persistent procedural memory: they re-derive solutions from scratch even when structurally identical tasks have been solved before. We present \textbf{APEX-EM}, a non-parametric online learning framework that accumulates, retrieves, and reuses structured procedural plans without modifying model weights. APEX-EM introduces: (1) a \emph{structured experience representation} encoding the full procedural-episodic trace of each execution -- planning steps, artifacts, iteration history with error analysis, and quality scores; (2) a \emph{Plan-Retrieve-Generate-Iterate-Ingest} (PRGII) workflow with Task Verifiers providing multi-dimensional reward signals; and (3) a \emph{dual-outcome Experience Memory} with hybrid retrieval combining semantic search, structural signature matching, and plan DAG traversal -- enabling cross-domain transfer between tasks sharing no lexical overlap but analogous operational structure. Successful experiences serve as positive in-context examples; failures as negative examples with structured error annotations. We evaluate on BigCodeBench~\cite{zhuo2025bigcodebench}, KGQAGen-10k~\cite{zhang2025kgqagen}, and Humanity's Last Exam~\cite{phan2025hle} using Claude Sonnet 4.5 and Opus 4.5. On KGQAGen-10k, APEX-EM achieves 89.6\% accuracy versus 41.3\% without memory (+48.3pp), surpassing the oracle-retrieval upper bound (84.9\%). On BigCodeBench, it reaches 83.3\% SR from a 53.9\% baseline (+29.4pp), exceeding MemRL's~\cite{memrl2025} +11.0pp gain under comparable frozen-backbone conditions (noting backbone differences controlled for in our analysis). On HLE, entity graph retrieval reaches 48.0\% from 25.2\% (+22.8pp). Ablations show component value is task-dependent: rich judge feedback is negligible for code generation but critical for structured queries (+10.3pp), while binary-signal iteration partially compensates for weaker feedback.

Fonte: arXiv cs.CL

RLScore 85

SparseDriveV2: Scoring is All You Need for End-to-End Autonomous Driving

arXiv:2603.29163v1 Announce Type: new Abstract: End-to-end multi-modal planning has been widely adopted to model the uncertainty of driving behavior, typically by scoring candidate trajectories and selecting the optimal one. Existing approaches generally fall into two categories: scoring a large static trajectory vocabulary, or scoring a small set of dynamically generated proposals. While static vocabularies often suffer from coarse discretization of the action space, dynamic proposals provide finer-grained precision and have shown stronger empirical performance on existing benchmarks. However, it remains unclear whether dynamic generation is fundamentally necessary, or whether static vocabularies can already achieve comparable performance when they are sufficiently dense to cover the action space. In this work, we start with a systematic scaling study of Hydra-MDP, a representative scoring-based method, revealing that performance consistently improves as trajectory anchors become denser, without exhibiting saturation before computational constraints are reached. Motivated by this observation, we propose SparseDriveV2 to push the performance boundary of scoring-based planning through two complementary innovations: (1) a scalable vocabulary representation with a factorized structure that decomposes trajectories into geometric paths and velocity profiles, enabling combinatorial coverage of the action space, and (2) a scalable scoring strategy with coarse factorized scoring over paths and velocity profiles followed by fine-grained scoring on a small set of composed trajectories. By combining these two techniques, SparseDriveV2 achieves 92.0 PDMS and 90.1 EPDMS on NAVSIM, with 89.15 Driving Score and 70.00 Success Rate on Bench2Drive with a lightweight ResNet-34 as backbone. Code and model are released at https://github.com/swc-17/SparseDriveV2.

Fonte: arXiv cs.CV

RLScore 85

ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training

arXiv:2603.29871v1 Announce Type: new Abstract: In user-agent interaction scenarios such as recommendation, brainstorming, and code suggestion, Large Language Models (LLMs) often generate sets of candidate recommendations where the objective is to maximize the collective utility of the entire set rather than individual candidates independently. However, existing reinforcement learning post-training paradigms, such as Group Relative Policy Optimization (GRPO), typically assign the same set-level scalar reward to every candidate in the set. This leads to noisy training signals where poor candidates free-ride on the high reward produced by a single strong peer, resulting in suboptimal exploration. To address this, we propose Shapley-Enhanced GRPO (ShapE-GRPO). By leveraging the permutation-invariant nature of set-level utility, we derive a Shapley-enhanced formulation from cooperative game theory to decompose set-level rewards into granular, candidate-specific signals. We show that our formulation preserves the fundamental axioms of the Shapley value while remaining computationally efficient with polynomial-time complexity. Empirically, ShapE-GRPO consistently outperforms standard GRPO across diverse datasets with accelerated convergence during training.

Fonte: arXiv cs.AI

RLScore 85

Optimistic Actor-Critic with Parametric Policies for Linear Markov Decision Processes

arXiv:2603.28595v1 Announce Type: cross Abstract: Although actor-critic methods have been successful in practice, their theoretical analyses have several limitations. Specifically, existing theoretical work either sidesteps the exploration problem by making strong assumptions or analyzes impractical methods with complicated algorithmic modifications. Moreover, the actor-critic methods analyzed for linear MDPs often employ natural policy gradient (NPG) and construct "implicit" policies without explicit parameterization. Such policies are computationally expensive to sample from, making the environment interactions inefficient. To that end, we focus on the finite-horizon linear MDPs and propose an optimistic actor-critic framework that uses parametric log-linear policies. In particular, we introduce a tractable \textit{logit-matching} regression objective for the actor. For the critic, we use approximate Thompson sampling via Langevin Monte Carlo to obtain optimistic value estimates. We prove that the resulting algorithm achieves $\widetilde{\mathcal{O}}(\epsilon^{-4})$ and $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity in the on-policy and off-policy setting, respectively. Our results match prior theoretical works in achieving the state-of-the-art sample complexity, while our algorithm is more aligned with practice.

Fonte: arXiv stat.ML

RLScore 85

Functional Natural Policy Gradients

arXiv:2603.28681v1 Announce Type: new Abstract: We propose a cross-fitted debiasing device for policy learning from offline data. A key consequence of the resulting learning principle is $\sqrt N$ regret even for policy classes with complexity greater than Donsker, provided a product-of-errors nuisance remainder is $O(N^{-1/2})$. The regret bound factors into a plug-in policy error factor governed by policy-class complexity and an environment nuisance factor governed by the complexity of the environment dynamics, making explicit how one may be traded against the other.

Fonte: arXiv stat.ML

RLScore 85

SkyNet: Belief-Aware Planning for Partially-Observable Stochastic Games

arXiv:2603.27751v1 Announce Type: new Abstract: In 2019, Google DeepMind released MuZero, a model-based reinforcement learning method that achieves strong results in perfect-information games by combining learned dynamics models with Monte Carlo Tree Search (MCTS). However, comparatively little work has extended MuZero to partially observable, stochastic, multi-player environments, where agents must act under uncertainty about hidden state. Such settings arise not only in card games but in domains such as autonomous negotiation, financial trading, and multi-agent robotics. In the absence of explicit belief modeling, MuZero's latent encoding has no dedicated mechanism for representing uncertainty over unobserved variables. To address this, we introduce SkyNet (Belief-Aware MuZero), which adds ego-conditioned auxiliary heads for winner prediction and rank estimation to the standard MuZero architecture. These objectives encourage the latent state to retain information predictive of outcomes under partial observability, without requiring explicit belief-state tracking or changes to the search algorithm. We evaluate SkyNet on Skyjo, a partially observable, non-zero-sum, stochastic card game, using a decision-granularity environment, transformer-based encoding, and a curriculum of heuristic opponents with self-play. In 1000-game head-to-head evaluations at matched checkpoints, SkyNet achieves a 75.3% peak win rate against the baseline (+194 Elo, $p < 10^{-50}$). SkyNet also outperforms the baseline against heuristic opponents (0.720 vs.\ 0.466 win rate). Critically, the belief-aware model initially underperforms the baseline but decisively surpasses it once training throughput is sufficient, suggesting that belief-aware auxiliary supervision improves learned representations under partial observability, but only given adequate data flow.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Liquid Networks with Mixture Density Heads for Efficient Imitation Learning

arXiv:2603.27058v1 Announce Type: new Abstract: We compare liquid neural networks with mixture density heads against diffusion policies on Push-T, RoboMimic Can, and PointMaze under a shared-backbone comparison protocol that isolates policy-head effects under matched inputs, training budgets, and evaluation settings. Across tasks, liquid policies use roughly half the parameters (4.3M vs. 8.6M), achieve 2.4x lower offline prediction error, and run 1.8 faster at inference. In sample-efficiency experiments spanning 1% to 46.42% of training data, liquid models remain consistently more robust, with especially large gains in low-data and medium-data regimes. Closed-loop results on Push-T and PointMaze are directionally consistent with offline rankings but noisier, indicating that strong offline density modeling helps deployment while not fully determining closed-loop success. Overall, liquid recurrent multimodal policies provide a compact and practical alternative to iterative denoising for imitation learning.

Fonte: arXiv cs.LG

MultimodalScore 85

Learning to Select Visual In-Context Demonstrations

arXiv:2603.26775v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) adapt to visual tasks via in-context learning (ICL), which relies heavily on demonstration quality. The dominant demonstration selection strategy is unsupervised k-Nearest Neighbor (kNN) search. While simple, this similarity-first approach is sub-optimal for complex factual regression tasks; it selects redundant examples that fail to capture the task's full output range. We reframe selection as a sequential decision-making problem and introduce Learning to Select Demonstrations (LSD), training a Reinforcement Learning agent to construct optimal demonstration sets. Using a Dueling DQN with a query-centric Transformer Decoder, our agent learns a policy that maximizes MLLM downstream performance. Evaluating across five visual regression benchmarks, we uncover a crucial dichotomy: while kNN remains optimal for subjective preference tasks, LSD significantly outperforms baselines on objective, factual regression tasks. By balancing visual relevance with diversity, LSD better defines regression boundaries, illuminating when learned selection is strictly necessary for visual ICL.

Fonte: arXiv cs.LG

RLScore 85

Bitboard version of Tetris AI

arXiv:2603.26765v1 Announce Type: new Abstract: The efficiency of game engines and policy optimization algorithms is crucial for training reinforcement learning (RL) agents in complex sequential decision-making tasks, such as Tetris. Existing Tetris implementations suffer from low simulation speeds, suboptimal state evaluation, and inefficient training paradigms, limiting their utility for large-scale RL research. To address these limitations, this paper proposes a high-performance Tetris AI framework based on bitboard optimization and improved RL algorithms. First, we redesign the Tetris game board and tetrominoes using bitboard representations, leveraging bitwise operations to accelerate core processes (e.g., collision detection, line clearing, and Dellacherie-Thiery Features extraction) and achieve a 53-fold speedup compared to OpenAI Gym-Tetris. Second, we introduce an afterstate-evaluating actor network that simplifies state value estimation by leveraging Tetris afterstate property, outperforming traditional action-value networks with fewer parameters. Third, we propose a buffer-optimized Proximal Policy Optimization (PPO) algorithm that balances sampling and update efficiency, achieving an average score of 3,829 on 10x10 grids within 3 minutes. Additionally, we develop a Python-Java interface compliant with the OpenAI Gym standard, enabling seamless integration with modern RL frameworks. Experimental results demonstrate that our framework enhances Tetris's utility as an RL benchmark by bridging low-level bitboard optimizations with high-level AI strategies, providing a sample-efficient and computationally lightweight solution for scalable sequential decision-making research.

Fonte: arXiv cs.AI

RLScore 85

RG-TTA: Regime-Guided Meta-Control for Test-Time Adaptation in Streaming Time Series

arXiv:2603.27814v1 Announce Type: cross Abstract: Test-time adaptation (TTA) enables neural forecasters to adapt to distribution shifts in streaming time series, but existing methods apply the same adaptation intensity regardless of the nature of the shift. We propose Regime-Guided Test-Time Adaptation (RG-TTA), a meta-controller that continuously modulates adaptation intensity based on distributional similarity to previously-seen regimes. Using an ensemble of Kolmogorov-Smirnov, Wasserstein-1, feature-distance, and variance-ratio metrics, RG-TTA computes a similarity score for each incoming batch and uses it to (i) smoothly scale the learning rate -- more aggressive for novel distributions, conservative for familiar ones -- and (ii) control gradient effort via loss-driven early stopping rather than fixed budgets, allowing the system to allocate exactly the effort each batch requires. As a supplementary mechanism, RG-TTA gates checkpoint reuse from a regime memory, loading stored specialist models only when they demonstrably outperform the current model (loss improvement >= 30%). RG-TTA is model-agnostic and strategy-composable: it wraps any forecaster exposing train/predict/save/load interfaces and enhances any gradient-based TTA method. We demonstrate three compositions -- RG-TTA, RG-EWC, and RG-DynaTTA -- and evaluate 6 update policies (3 baselines + 3 regime-guided variants) across 4 compact architectures (GRU, iTransformer, PatchTST, DLinear), 14 datasets (6 real-world multivariate benchmarks + 8 synthetic regime scenarios), and 4 forecast horizons (96, 192, 336, 720) under a streaming evaluation protocol with 3 random seeds (672 experiments total). Regime-guided policies achieve the lowest MSE in 156 of 224 seed-averaged experiments (69.6%), with RG-EWC winning 30.4% and RG-TTA winning 29.0%. Overall, RG-TTA reduces MSE by 5.7% vs TTA while running 5.5% faster; RG-EWC reduces MSE by 14.1% vs standalone EWC.

Fonte: arXiv stat.ML

NLP/LLMsScore 85

Language-Conditioned World Modeling for Visual Navigation

arXiv:2603.26741v1 Announce Type: new Abstract: We study language-conditioned visual navigation (LCVN), in which an embodied agent is asked to follow a natural language instruction based only on an initial egocentric observation. Without access to goal images, the agent must rely on language to shape its perception and continuous control, making the grounding problem particularly challenging. We formulate this problem as open-loop trajectory prediction conditioned on linguistic instructions and introduce the LCVN Dataset, a benchmark of 39,016 trajectories and 117,048 human-verified instructions that supports reproducible research across a range of environments and instruction styles. Using this dataset, we develop LCVN frameworks that link language grounding, future-state prediction, and action generation through two complementary model families. The first family combines LCVN-WM, a diffusion-based world model, with LCVN-AC, an actor-critic agent trained in the latent space of the world model. The second family, LCVN-Uni, adopts an autoregressive multimodal architecture that predicts both actions and future observations. Experiments show that these families offer different advantages: the former provides more temporally coherent rollouts, whereas the latter generalizes better to unseen environments. Taken together, these observations point to the value of jointly studying language grounding, imagination, and policy learning in a unified task setting, and LCVN provides a concrete basis for further investigation of language-conditioned world models. The code is available at https://github.com/F1y1113/LCVN.

Fonte: arXiv cs.CV

NLP/LLMsScore 85

Semantic Interaction Information mediates compositional generalization in latent space

arXiv:2603.27134v1 Announce Type: new Abstract: Are there still barriers to generalization once all relevant variables are known? We address this question via a framework that casts compositional generalization as a variational inference problem over latent variables with parametric interactions. To explore this, we develop the Cognitive Gridworld, a stationary Partially Observable Markov Decision Process (POMDP) where observations are generated jointly by multiple latent variables, yet feedback is provided for only a single goal variable. This setting allows us to define Semantic Interaction Information (SII): a metric measuring the contribution of latent variable interactions to task performance. Using SII, we analyze Recurrent Neural Networks (RNNs) provided with these interactions, finding that SII explains the accuracy gap between Echo State and Fully Trained networks. Our analysis also uncovers a theoretically predicted failure mode where confidence decouples from accuracy, suggesting that utilizing interactions between relevant variables is a non-trivial capability. We then address a harder regime where the interactions must be learned by an embedding model. Learning how latent variables interact requires accurate inference, yet accurate inference depends on knowing those interactions. The Cognitive Gridworld reveals this circular dependence as a core challenge for continual meta-learning. We approach this dilemma via Representation Classification Chains (RCCs), a JEPA-style architecture that disentangles these processes: variable inference and variable embeddings are learned by separate modules through Reinforcement Learning and self-supervised learning, respectively. Lastly, we demonstrate that RCCs facilitate compositional generalization to novel combinations of relevant variables. Together, these results establish a grounded setting for evaluating goal-directed generalist agents.

Fonte: arXiv cs.LG

RLScore 85

Diagnosing Non-Markovian Observations in Reinforcement Learning via Prediction-Based Violation Scoring

arXiv:2603.27389v1 Announce Type: cross Abstract: Reinforcement learning algorithms assume that observations satisfy the Markov property, yet real-world sensors frequently violate this assumption through correlated noise, latency, or partial observability. Standard performance metrics conflate Markov breakdowns with other sources of suboptimality, leaving practitioners without diagnostic tools for such violations. This paper introduces a prediction-based scoring method that quantifies non-Markovian structure in observation trajectories. A random forest first removes nonlinear Markov-compliant dynamics; ridge regression then tests whether historical observations reduce prediction error on the residuals beyond what the current observation provides. The resulting score is bounded in [0, 1] and requires no causal graph construction. Evaluation spans six environments (CartPole, Pendulum, Acrobot, HalfCheetah, Hopper, Walker2d), three algorithms (PPO, A2C, SAC), controlled AR(1) noise at six intensity levels, and 10 seeds per condition. In post-hoc detection, 7 of 16 environment-algorithm pairs, primarily high-dimensional locomotion tasks, show significant positive monotonicity between noise intensity and the violation score (Spearman rho up to 0.78, confirmed under repeated-measures analysis); under training-time noise, 13 of 16 pairs exhibit statistically significant reward degradation. An inversion phenomenon is documented in low-dimensional environments where the random forest absorbs the noise signal, causing the score to decrease as true violations grow, a failure mode analyzed in detail. A practical utility experiment demonstrates that the proposed score correctly identifies partial observability and guides architecture selection, fully recovering performance lost to non-Markovian observations. Source code to reproduce all results is provided at https://github.com/NAVEENMN/Markovianes.

Fonte: arXiv stat.ML

RLScore 85

Dynamic resource matching in manufacturing using deep reinforcement learning

arXiv:2603.27066v1 Announce Type: new Abstract: Matching plays an important role in the logical allocation of resources across a wide range of industries. The benefits of matching have been increasingly recognized in manufacturing industries. In particular, capacity sharing has received much attention recently. In this paper, we consider the problem of dynamically matching demand-capacity types of manufacturing resources. We formulate the multi-period, many-to-many manufacturing resource-matching problem as a sequential decision process. The formulated manufacturing resource-matching problem involves large state and action spaces, and it is not practical to accurately model the joint distribution of various types of demands. To address the curse of dimensionality and the difficulty of explicitly modeling the transition dynamics, we use a model-free deep reinforcement learning approach to find optimal matching policies. Moreover, to tackle the issue of infeasible actions and slow convergence due to initial biased estimates caused by the maximum operator in Q-learning, we introduce two penalties to the traditional Q-learning algorithm: a domain knowledge-based penalty based on a prior policy and an infeasibility penalty that conforms to the demand-supply constraints. We establish theoretical results on the convergence of our domain knowledge-informed Q-learning providing performance guarantee for small-size problems. For large-size problems, we further inject our modified approach into the deep deterministic policy gradient (DDPG) algorithm, which we refer to as domain knowledge-informed DDPG (DKDDPG). In our computational study, including small- and large-scale experiments, DKDDPG consistently outperformed traditional DDPG and other RL algorithms, yielding higher rewards and demonstrating greater efficiency in time and episodes.

Fonte: arXiv cs.LG

RLScore 85

PiCSRL: Physics-Informed Contextual Spectral Reinforcement Learning

arXiv:2603.26816v1 Announce Type: new Abstract: High-dimensional low-sample-size (HDLSS) datasets constrain reliable environmental model development, where labeled data remain sparse. Reinforcement learning (RL)-based adaptive sensing methods can learn optimal sampling policies, yet their application is severely limited in HDLSS contexts. In this work, we present PiCSRL (Physics-Informed Contextual Spectral Reinforcement Learning), where embeddings are designed using domain knowledge and parsed directly into the RL state representation for improved adaptive sensing. We developed an uncertainty-aware belief model that encodes physics-informed features to improve prediction. As a representative example, we evaluated our approach for cyanobacterial gene concentration adaptive sampling task using NASA PACE hyperspectral imagery over Lake Erie. PiCSRL achieves optimal station selection (RMSE = 0.153, 98.4% bloom detection rate, outperforming random (0.296) and UCB (0.178) RMSE baselines, respectively. Our ablation experiments demonstrate that physics-informed features improve test generalization (0.52 R^2, +0.11 over raw bands) in semi-supervised learning. In addition, our scalability test shows that PiCSRL scales effectively to large networks (50 stations, >2M combinations) with significant improvements over baselines (p = 0.002). We posit PiCSRL as a sample-efficient adaptive sensing method across Earth observation domains for improved observation-to-target mapping.

Fonte: arXiv cs.LG

RLScore 85

Unsupervised Behavioral Compression: Learning Low-Dimensional Policy Manifolds through State-Occupancy Matching

arXiv:2603.27044v1 Announce Type: new Abstract: Deep Reinforcement Learning (DRL) is widely recognized as sample-inefficient, a limitation attributable in part to the high dimensionality and substantial functional redundancy inherent to the policy parameter space. A recent framework, which we refer to as Action-based Policy Compression (APC), mitigates this issue by compressing the parameter space $\Theta$ into a low-dimensional latent manifold $\mathcal Z$ using a learned generative mapping $g:\mathcal Z \to \Theta$. However, its performance is severely constrained by relying on immediate action-matching as a reconstruction loss, a myopic proxy for behavioral similarity that suffers from compounding errors across sequential decisions. To overcome this bottleneck, we introduce Occupancy-based Policy Compression (OPC), which enhances APC by shifting behavior representation from immediate action-matching to long-horizon state-space coverage. Specifically, we propose two principal improvements: (1) we curate the dataset generation with an information-theoretic uniqueness metric that delivers a diverse population of policies; and (2) we propose a fully differentiable compression objective that directly minimizes the divergence between the true and reconstructed mixture occupancy distributions. These modifications force the generative model to organize the latent space around true functional similarity, promoting a latent representation that generalizes over a broad spectrum of behaviors while retaining most of the original parameter space's expressivity. Finally, we empirically validate the advantages of our contributions across multiple continuous control benchmarks.

Fonte: arXiv cs.LG

RLScore 85

Reward Hacking as Equilibrium under Finite Evaluation

arXiv:2603.28063v1 Announce Type: new Abstract: We prove that under five minimal axioms -- multi-dimensional quality, finite evaluation, effective optimization, resource finiteness, and combinatorial interaction -- any optimized AI agent will systematically under-invest effort in quality dimensions not covered by its evaluation system. This result establishes reward hacking as a structural equilibrium, not a correctable bug, and holds regardless of the specific alignment method (RLHF, DPO, Constitutional AI, or others) or evaluation architecture employed. Our framework instantiates the multi-task principal-agent model of Holmstrom and Milgrom (1991) in the AI alignment setting, but exploits a structural feature unique to AI systems -- the known, differentiable architecture of reward models -- to derive a computable distortion index that predicts both the direction and severity of hacking on each quality dimension prior to deployment. We further prove that the transition from closed reasoning to agentic systems causes evaluation coverage to decline toward zero as tool count grows -- because quality dimensions expand combinatorially while evaluation costs grow at most linearly per tool -- so that hacking severity increases structurally and without bound. Our results unify the explanation of sycophancy, length gaming, and specification gaming under a single theoretical structure and yield an actionable vulnerability assessment procedure. We further conjecture -- with partial formal analysis -- the existence of a capability threshold beyond which agents transition from gaming within the evaluation system (Goodhart regime) to actively degrading the evaluation system itself (Campbell regime), providing the first economic formalization of Bostrom's (2014) "treacherous turn."

Fonte: arXiv cs.AI

RLScore 85

Online Statistical Inference of Constant Sample-averaged Q-Learning

arXiv:2603.26982v1 Announce Type: new Abstract: Reinforcement learning algorithms have been widely used for decision-making tasks in various domains. However, the performance of these algorithms can be impacted by high variance and instability, particularly in environments with noise or sparse rewards. In this paper, we propose a framework to perform statistical online inference for a sample-averaged Q-learning approach. We adapt the functional central limit theorem (FCLT) for the modified algorithm under some general conditions and then construct confidence intervals for the Q-values via random scaling. We conduct experiments to perform inference on both the modified approach and its traditional counterpart, Q-learning using random scaling and report their coverage rates and confidence interval widths on two problems: a grid world problem as a simple toy example and a dynamic resource-matching problem as a real-world example for comparison between the two solution approaches.

Fonte: arXiv stat.ML

NLP/LLMsScore 96

Hybrid Deep Learning with Temporal Data Augmentation for Accurate Remaining Useful Life Prediction of Lithium-Ion Batteries

arXiv:2603.27186v1 Announce Type: new Abstract: Accurate prediction of lithium-ion battery remaining useful life (RUL) is essential for reliable health monitoring and data-driven analysis of battery degradation. However, the robustness and generalization capabilities of existing RUL prediction models are significantly challenged by complex operating conditions and limited data availability. To address these limitations, this study proposes a hybrid deep learning model, CDFormer, which integrates convolutional neural networks, deep residual shrinkage networks, and Transformer encoders extract multiscale temporal features from battery measurement signals, including voltage, current, and capacity. This architecture enables the joint modeling of local and global degradation dynamics, effectively improving the accuracy of RUL prediction.To enhance predictive reliability, a composite temporal data augmentation strategy is proposed, incorporating Gaussian noise, time warping, and time resampling, explicitly accounting for measurement noise and variability. CDFormer is evaluated on two real-world datasets, with experimental results demonstrating its consistent superiority over conventional recurrent neural network-based and Transformer-based baselines across key metrics. By improving the reliability and predictive performance of RUL prediction from measurement data, CDFormer provides accurate and reliable forecasts, supporting effective battery health monitoring and data-driven maintenance strategies.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

KAT-Coder-V2 Technical Report

arXiv:2603.27703v1 Announce Type: new Abstract: We present KAT-Coder-V2, an agentic coding model developed by the KwaiKAT team at Kuaishou. KAT-Coder-V2 adopts a "Specialize-then-Unify" paradigm that decomposes agentic coding into five expert domains - SWE, WebCoding, Terminal, WebSearch, and General - each undergoing independent supervised fine-tuning and reinforcement learning, before being consolidated into a single model via on-policy distillation. We develop KwaiEnv, a modular infrastructure sustaining tens of thousands of concurrent sandbox instances, and scale RL training along task complexity, intent alignment, and scaffold generalization. We further propose MCLA for stabilizing MoE RL training and Tree Training for eliminating redundant computation over tree-structured trajectories with up to 6.2x speedup. KAT-Coder-V2 achieves 79.6% on SWE-bench Verified (vs. Claude Opus 4.6 at 80.8%), 88.7 on PinchBench (surpassing GLM-5 and MiniMax M2.7), ranks first across all three frontend aesthetics scenarios, and maintains strong generalist scores on Terminal-Bench Hard (46.8) and tau^2-Bench (93.9). Our model is publicly available at https://streamlake.com/product/kat-coder.

Fonte: arXiv cs.CL

RLScore 85

A Perturbation Approach to Unconstrained Linear Bandits

arXiv:2603.28201v1 Announce Type: cross Abstract: We revisit the standard perturbation-based approach of Abernethy et al. (2008) in the context of unconstrained Bandit Linear Optimization (uBLO). We show the surprising result that in the unconstrained setting, this approach effectively reduces Bandit Linear Optimization (BLO) to a standard Online Linear Optimization (OLO) problem. Our framework improves on prior work in several ways. First, we derive expected-regret guarantees when our perturbation scheme is combined with comparator-adaptive OLO algorithms, leading to new insights about the impact of different adversarial models on the resulting comparator-adaptive rates. We also extend our analysis to dynamic regret, obtaining the optimal $\sqrt{P_T}$ path-length dependencies without prior knowledge of $P_T$. We then develop the first high-probability guarantees for both static and dynamic regret in uBLO. Finally, we discuss lower bounds on the static regret, and prove the folklore $\Omega(\sqrt{dT})$ rate for adversarial linear bandits on the unit Euclidean ball, which is of independent interest.

Fonte: arXiv stat.ML

RLScore 85

SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

arXiv:2603.27977v1 Announce Type: new Abstract: Reinforcement learning has become central to improving large reasoning models, but its success still relies heavily on verifiable rewards or labeled supervision. This limits its applicability to open ended domains where correctness is ambiguous and cannot be verified. Moreover, reasoning trajectories remain largely unconstrained, and optimization towards final answer can favor early exploitation over generalization. In this work, we ask whether general reasoning ability can be improved by teaching models how to think (the structure of reasoning) rather than what to produce (the outcome of reasoning) and extend traditional RLVR to open ended settings. We introduce structure aware reinforcement learning (SARL), a label free framework that constructs a per response Reasoning Map from intermediate thinking steps and rewards its small world topology, inspired by complex networks and the functional organization of the human brain. SARL encourages reasoning trajectories that are both locally coherent and globally efficient, shifting supervision from destination to path. Our experiments on Qwen3-4B show SARL surpasses ground truth based RL and prior label free RL baselines, achieving the best average gain of 9.1% under PPO and 11.6% under GRPO on math tasks and 34.6% under PPO and 30.4% under GRPO on open ended tasks. Beyond good performance, SARL also exhibits lower KL divergence, higher policy entropy, indicating a more stable and exploratory training and generalized reasoning ability.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

World Reasoning Arena

arXiv:2603.25887v1 Announce Type: new Abstract: World models (WMs) are intended to serve as internal simulators of the real world that enable agents to understand, anticipate, and act upon complex environments. Existing WM benchmarks remain narrowly focused on next-state prediction and visual fidelity, overlooking the richer simulation capabilities required for intelligent behavior. To address this gap, we introduce WR-Arena, a comprehensive benchmark for evaluating WMs along three fundamental dimensions of next world simulation: (i) Action Simulation Fidelity, the ability to interpret and follow semantically meaningful, multi-step instructions and generate diverse counterfactual rollouts; (ii) Long-horizon Forecast, the ability to sustain accurate, coherent, and physically plausible simulations across extended interactions; and (iii) Simulative Reasoning and Planning, the ability to support goal-directed reasoning by simulating, comparing, and selecting among alternative futures in both structured and open-ended environments. We build a task taxonomy and curate diverse datasets designed to probe these capabilities, moving beyond single-turn and perceptual evaluations. Through extensive experiments with state-of-the-art WMs, our results expose a substantial gap between current models and human-level hypothetical reasoning, and establish WR-Arena as both a diagnostic tool and a guideline for advancing next-generation world models capable of robust understanding, forecasting, and purposeful action. The code is available at https://github.com/MBZUAI-IFM/WR-Arena.

Fonte: arXiv cs.CV

RLScore 85

Dynamic Tokenization via Reinforcement Patching: End-to-end Training and Zero-shot Transfer

arXiv:2603.26097v1 Announce Type: new Abstract: Efficiently aggregating spatial or temporal horizons to acquire compact representations has become a unifying principle in modern deep learning models, yet learning data-adaptive representations for long-horizon sequence data, especially continuous sequences like time series, remains an open challenge. While fixed-size patching has improved scalability and performance, discovering variable-sized, data-driven patches end-to-end often forces models to rely on soft discretization, specific backbones, or heuristic rules. In this work, we propose Reinforcement Patching (ReinPatch), the first framework to jointly optimize a sequence patching policy and its downstream sequence backbone model using reinforcement learning. By formulating patch boundary placement as a discrete decision process optimized via Group Relative Policy Gradient (GRPG), ReinPatch bypasses the need for continuous relaxations and performs dynamic patching policy optimization in a natural manner. Moreover, our method allows strict enforcement of a desired compression rate, freeing the downstream backbone to scale efficiently, and naturally supports multi-level hierarchical modeling. We evaluate ReinPatch on time-series forecasting datasets, where it demonstrates compelling performance compared to state-of-the-art data-driven patching strategies. Furthermore, our detached design allows the patching module to be extracted as a standalone foundation patcher, providing the community with visual and empirical insights into the segmentation behaviors preferred by a purely performance-driven neural patching strategy.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

AutoB2G: A Large Language Model-Driven Agentic Framework For Automated Building-Grid Co-Simulation

arXiv:2603.26005v1 Announce Type: new Abstract: The growing availability of building operational data motivates the use of reinforcement learning (RL), which can learn control policies directly from data and cope with the complexity and uncertainty of large-scale building clusters. However, most existing simulation environments prioritize building-side performance metrics and lack systematic evaluation of grid-level impacts, while their experimental workflows still rely heavily on manual configuration and substantial programming expertise. Therefore, this paper proposes AutoB2G, an automated building-grid co-simulation framework that completes the entire simulation workflow solely based on natural-language task descriptions. The framework extends CityLearn V2 to support Building-to-Grid (B2G) interaction and adopts the large language model (LLM)-based SOCIA (Simulation Orchestration for Computational Intelligence with Agents) framework to automatically generate, execute, and iteratively refine the simulator. As LLMs lack prior knowledge of the implementation context of simulation functions, a codebase covering simulation configurations and functional modules is constructed and organized as a directed acyclic graph (DAG) to explicitly represent module dependencies and execution order, guiding the LLM to retrieve a complete executable path. Experimental results demonstrate that AutoB2G can effectively enable automated simulator implementations, coordinating B2G interactions to improve grid-side performance metrics.

Fonte: arXiv cs.AI

RLScore 85

Topology-Aware Graph Reinforcement Learning for Energy Storage Systems Optimal Dispatch in Distribution Networks

arXiv:2603.26264v1 Announce Type: new Abstract: Optimal dispatch of energy storage systems (ESSs) in distribution networks involves jointly improving operating economy and voltage security under time-varying conditions and possible topology changes. To support fast online decision making, we develop a topology-aware Reinforcement Learning architecture based on Twin Delayed Deep Deterministic Policy Gradient (TD3), which integrates graph neural networks (GNNs) as graph feature encoders for ESS dispatch. We conduct a systematic investigation of three GNN variants: graph convolutional networks (GCNs), topology adaptive graph convolutional networks (TAGConv), and graph attention networks (GATs) on the 34-bus and 69-bus systems, and evaluate robustness under multiple topology reconfiguration cases as well as cross-system transfer between networks with different system sizes. Results show that GNN-based controllers consistently reduce the number and magnitude of voltage violations, with clearer benefits on the 69-bus system and under reconfiguration; on the 69-bus system, TD3-GCN and TD3-TAGConv also achieve lower saved cost relative to the NLP benchmark than the NN baseline. We also highlight that transfer gains are case-dependent, and zero-shot transfer between fundamentally different systems results in notable performance degradation and increased voltage magnitude violations. This work is available at: https://github.com/ShuyiGao/GNNs_RL_ESSs and https://github.com/distributionnetworksTUDelft/GNNs_RL_ESSs.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

Policy-Guided World Model Planning for Language-Conditioned Visual Navigation

arXiv:2603.25981v1 Announce Type: cross Abstract: Navigating to a visually specified goal given natural language instructions remains a fundamental challenge in embodied AI. Existing approaches either rely on reactive policies that struggle with long-horizon planning, or employ world models that suffer from poor action initialization in high-dimensional spaces. We present PiJEPA, a two-stage framework that combines the strengths of learned navigation policies with latent world model planning for instruction-conditioned visual navigation. In the first stage, we finetune an Octo-based generalist policy, augmented with a frozen pretrained vision encoder (DINOv2 or V-JEPA-2), on the CAST navigation dataset to produce an informed action distribution conditioned on the current observation and language instruction. In the second stage, we use this policy-derived distribution to warm-start Model Predictive Path Integral (MPPI) planning over a separately trained JEPA world model, which predicts future latent states in the embedding space of the same frozen encoder. By initializing the MPPI sampling distribution from the policy prior rather than from an uninformed Gaussian, our planner converges faster to high-quality action sequences that reach the goal. We systematically study the effect of the vision encoder backbone, comparing DINOv2 and V-JEPA-2, across both the policy and world model components. Experiments on real-world navigation tasks demonstrate that PiJEPA significantly outperforms both standalone policy execution and uninformed world model planning, achieving improved goal-reaching accuracy and instruction-following fidelity.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Knowledge Distillation for Efficient Transformer-Based Reinforcement Learning in Hardware-Constrained Energy Management Systems

arXiv:2603.26249v1 Announce Type: new Abstract: Transformer-based reinforcement learning has emerged as a strong candidate for sequential control in residential energy management. In particular, the Decision Transformer can learn effective battery dispatch policies from historical data, thereby increasing photovoltaic self-consumption and reducing electricity costs. However, transformer models are typically too computationally demanding for deployment on resource-constrained residential controllers, where memory and latency constraints are critical. This paper investigates knowledge distillation to transfer the decision-making behaviour of high-capacity Decision Transformer policies to compact models that are more suitable for embedded deployment. Using the Ausgrid dataset, we train teacher models in an offline sequence-based Decision Transformer framework on heterogeneous multi-building data. We then distil smaller student models by matching the teachers' actions, thereby preserving control quality while reducing model size. Across a broad set of teacher-student configurations, distillation largely preserves control performance and even yields small improvements of up to 1%, while reducing the parameter count by up to 96%, the inference memory by up to 90%, and the inference time by up to 63%. Beyond these compression effects, comparable cost improvements are also observed when distilling into a student model of identical architectural capacity. Overall, our results show that knowledge distillation makes Decision Transformer control more applicable for residential energy management on resource-limited hardware.

Fonte: arXiv cs.LG

RLScore 85

Parameter-Free Dynamic Regret for Unconstrained Linear Bandits

arXiv:2603.25916v1 Announce Type: new Abstract: We study dynamic regret minimization in unconstrained adversarial linear bandit problems. In this setting, a learner must minimize the cumulative loss relative to an arbitrary sequence of comparators $\boldsymbol{u}_1,\ldots,\boldsymbol{u}_T$ in $\mathbb{R}^d$, but receives only point-evaluation feedback on each round. We provide a simple approach to combining the guarantees of several bandit algorithms, allowing us to optimally adapt to the number of switches $S_T = \sum_t\mathbb{I}\{\boldsymbol{u}_t \neq \boldsymbol{u}_{t-1}\}$ of an arbitrary comparator sequence. In particular, we provide the first algorithm for linear bandits achieving the optimal regret guarantee of order $\mathcal{O}\big(\sqrt{d(1+S_T) T}\big)$ up to poly-logarithmic terms without prior knowledge of $S_T$, thus resolving a long-standing open problem.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

Designing Fatigue-Aware VR Interfaces via Biomechanical Models

arXiv:2603.26031v1 Announce Type: cross Abstract: Prolonged mid-air interaction in virtual reality (VR) causes arm fatigue and discomfort, negatively affecting user experience. Incorporating ergonomic considerations into VR user interface (UI) design typically requires extensive human-in-the-loop evaluation. Although biomechanical models have been used to simulate human behavior in HCI tasks, their application as surrogate users for ergonomic VR UI design remains underexplored. We propose a hierarchical reinforcement learning framework that leverages biomechanical user models to evaluate and optimize VR interfaces for mid-air interaction. A motion agent is trained to perform button-press tasks in VR under sequential conditions, using realistic movement strategies and estimating muscle-level effort via a validated three-compartment control with recovery (3CC-r) fatigue model. The simulated fatigue output serves as feedback for a UI agent that optimizes UI element layout via reinforcement learning (RL) to minimize fatigue. We compare the RL-optimized layout against a manually-designed centered baseline and a Bayesian optimized baseline. Results show that fatigue trends from the biomechanical model align with human user data. Moreover, the RL-optimized layout using simulated fatigue feedback produced significantly lower perceived fatigue in a follow-up human study. We further demonstrate the framework's extensibility via a simulated case study on longer sequential tasks with non-uniform interaction frequencies. To our knowledge, this is the first work using simulated biomechanical muscle fatigue as a direct optimization signal for VR UI layout design. Our findings highlight the potential of biomechanical user models as effective surrogate tools for ergonomic VR interface design, enabling efficient early-stage iteration with less reliance on extensive human participation.

Fonte: arXiv cs.AI

RLScore 85

Stabilizing Rubric Integration Training via Decoupled Advantage Normalization

arXiv:2603.26535v1 Announce Type: new Abstract: We propose Process-Aware Policy Optimization (PAPO), a method that integrates process-level evaluation into Group Relative Policy Optimization (GRPO) through decoupled advantage normalization, to address two limitations of existing reward designs. Outcome reward models (ORM) evaluate only final-answer correctness, treating all correct responses identically regardless of reasoning quality, and gradually lose the advantage signal as groups become uniformly correct. Process reward models (PRM) offer richer supervision, but directly using PRM scores causes reward hacking, where models exploit verbosity to inflate scores while accuracy collapses. PAPO resolves both by composing the advantage from an outcome component Aout, derived from ORM and normalized over all responses, and a process component Aproc, derived from a rubric-based PRM and normalized exclusively among correct responses. This decoupled design ensures that Aout anchors training on correctness while Aproc differentiates reasoning quality without distorting the outcome signal. Experiments across multiple model scales and six benchmarks demonstrate that PAPO consistently outperforms ORM, reaching 51.3% vs.\ 46.3% on OlympiadBench while continuing to improve as ORM plateaus and declines.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Reinforcing Structured Chain-of-Thought for Video Understanding

arXiv:2603.25942v1 Announce Type: cross Abstract: Multi-modal Large Language Models (MLLMs) show promise in video understanding. However, their reasoning often suffers from thinking drift and weak temporal comprehension, even when enhanced by Reinforcement Learning (RL) techniques like Group Relative Policy Optimization (GRPO). Moreover, existing RL methods usually depend on Supervised Fine-Tuning (SFT), which requires costly Chain-of-Thought (CoT) annotation and multi-stage training, and enforces fixed reasoning paths, limiting MLLMs' ability to generalize and potentially inducing bias. To overcome these limitations, we introduce Summary-Driven Reinforcement Learning (SDRL), a novel single-stage RL framework that obviates the need for SFT by utilizing a Structured CoT format: Summarize -> Think -> Answer. SDRL introduces two self-supervised mechanisms integrated into the GRPO objective: 1) Consistency of Vision Knowledge (CVK) enforces factual grounding by reducing KL divergence among generated summaries; and 2) Dynamic Variety of Reasoning (DVR) promotes exploration by dynamically modulating thinking diversity based on group accuracy. This novel integration effectively balances alignment and exploration, supervising both the final answer and the reasoning process. Our method achieves state-of-the-art performance on seven public VideoQA datasets.

Fonte: arXiv cs.AI

RLScore 85

Empowering Epidemic Response: The Role of Reinforcement Learning in Infectious Disease Control

arXiv:2603.25771v1 Announce Type: cross Abstract: Reinforcement learning (RL), owing to its adaptability to various dynamic systems in many real-world scenarios and the capability of maximizing long-term outcomes under different constraints, has been used in infectious disease control to optimize the intervention strategies for controlling infectious disease spread and responding to outbreaks in recent years. The potential of RL for assisting public health sectors in preventing and controlling infectious diseases is gradually emerging and being explored by rapidly increasing publications relevant to COVID-19 and other infectious diseases. However, few surveys exclusively discuss this topic, that is, the development and application of RL approaches for optimizing strategies of non-pharmaceutical and pharmaceutical interventions of public health. Therefore, this paper aims to provide a concise review and discussion of the latest literature on how RL approaches have been used to assist in controlling the spread and outbreaks of infectious diseases, covering several critical topics addressing public health demands: resource allocation, balancing between lives and livelihoods, mixed policy of multiple interventions, and inter-regional coordinated control. Finally, we conclude the paper with a discussion of several potential directions for future research.

Fonte: arXiv cs.AI

RLScore 85

Adversarial Bandit Optimization with Globally Bounded Perturbations to Linear Losses

arXiv:2603.26066v1 Announce Type: new Abstract: We study a class of adversarial bandit optimization problems in which the loss functions may be non-convex and non-smooth. In each round, the learner observes a loss that consists of an underlying linear component together with an additional perturbation applied after the learner selects an action. The perturbations are measured relative to the linear losses and are constrained by a global budget that bounds their cumulative magnitude over time. Under this model, we establish both expected and high-probability regret guarantees. As a special case of our analysis, we recover an improved high-probability regret bound for classical bandit linear optimization, which corresponds to the setting without perturbations. We further complement our upper bounds by proving a lower bound on the expected regret.

Fonte: arXiv cs.LG

RLScore 85

Neuro-Cognitive Reward Modeling for Human-Centered Autonomous Vehicle Control

arXiv:2603.25968v1 Announce Type: new Abstract: Recent advancements in computer vision have accelerated the development of autonomous driving. Despite these advancements, training machines to drive in a way that aligns with human expectations remains a significant challenge. Human factors are still essential, as humans possess a sophisticated cognitive system capable of rapidly interpreting scene information and making accurate decisions. Aligning machine with human intent has been explored with Reinforcement Learning with Human Feedback (RLHF). Conventional RLHF methods rely on collecting human preference data by manually ranking generated outputs, which is time-consuming and indirect. In this work, we propose an electroencephalography (EEG)-guided decision-making framework to incorporate human cognitive insights without behaviour response interruption into reinforcement learning (RL) for autonomous driving. We collected EEG signals from 20 participants in a realistic driving simulator and analyzed event-related potentials (ERP) in response to sudden environmental changes. Our proposed framework employs a neural network to predict the strength of ERP based on the cognitive information from visual scene information. Moreover, we explore the integration of such cognitive information into the reward signal of the RL algorithm. Experimental results show that our framework can improve the collision avoidance ability of the RL algorithm, highlighting the potential of neuro-cognitive feedback in enhancing autonomous driving systems. Our project page is: https://alex95gogo.github.io/Cognitive-Reward/.

Fonte: arXiv cs.CV

RLScore 85

Distributed Real-Time Vehicle Control for Emergency Vehicle Transit: A Scalable Cooperative Method

arXiv:2603.25000v1 Announce Type: new Abstract: Rapid transit of emergency vehicles is critical for saving lives and reducing property loss but often relies on surrounding ordinary vehicles to cooperatively adjust their driving behaviors. It is important to ensure rapid transit of emergency vehicles while minimizing the impact on ordinary vehicles. Centralized mathematical solver and reinforcement learning are the state-of-the-art methods. The former obtains optimal solutions but is only practical for small-scale scenarios. The latter implicitly learns through extensive centralized training but the trained model exhibits limited scalability to different traffic conditions. Hence, existing methods suffer from two fundamental limitations: high computational cost and lack of scalability. To overcome above limitations, this work proposes a scalable distributed vehicle control method, where vehicles adjust their driving behaviors in a distributed manner online using only local instead of global information. We proved that the proposed distributed method using only local information is approximately equivalent to the one using global information, which enables vehicles to evaluate their candidate states and make approximately optimal decisions in real time without pre-training and with natural adaptability to varying traffic conditions. Then, a distributed conflict resolution mechanism is further proposed to guarantee vehicles' safety by avoiding their decision conflicts, which eliminates the single-point-of-failure risk of centralized methods and provides deterministic safety guarantees that learned methods cannot offer. Compared with existing methods, simulation experiments based on real-world traffic datasets demonstrate that the proposed method achieves faster decision-making, less impact on ordinary vehicles, and maintains much stronger scalability across different traffic densities and road configurations.

Fonte: arXiv cs.CV

NLP/LLMsScore 85

VideoTIR: Accurate Understanding for Long Videos with Efficient Tool-Integrated Reasoning

arXiv:2603.25021v1 Announce Type: new Abstract: Existing Multimodal Large Language Models (MLLMs) often suffer from hallucinations in long video understanding (LVU), primarily due to the imbalance between textual and visual tokens. Observing that MLLMs handle short visual inputs well, recent LVU works alleviate hallucinations by automatically parsing the vast visual data into manageable segments that can be effectively processed by MLLMs. SFT-based tool-calling methods can serve this purpose, but they typically require vast amounts of fine-grained, high-quality data and suffer from constrained tool-calling trajectories. We propose a novel VideoTIR that leverages Reinforcement Learning (RL) to encourage proper usage of comprehensive multi-level toolkits for efficient long video understanding. VideoTIR explores both Zero-RL and SFT cold-starting to enable MLLMs to retrieve and focus on meaningful video segments/images/regions, enhancing long video understanding both accurately and efficiently. To reduce redundant tool-calling, we propose Toolkit Action Grouped Policy Optimization (TAGPO), which enhances the efficiency of the calling process through stepwise reward assignment and reuse of failed rollouts. Additionally, we develop a sandbox-based trajectory synthesis framework to generate high-quality trajectories data. Extensive experiments on three long-video QA benchmarks demonstrate the effectiveness and efficiency of our method.

Fonte: arXiv cs.CV

NLP/LLMsScore 85

MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models

arXiv:2603.24984v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) has emerged as an effective approach to reduce the computational overhead of Transformer architectures by sparsely activating a subset of parameters for each token while preserving high model capacity. This paradigm has recently been extended to Vision-Language Models (VLMs), enabling scalable multi-modal understanding with reduced computational cost. However, the widely adopted deterministic top-K routing mechanism may overlook more optimal expert combinations and lead to expert overfitting. To address this limitation and improve the diversity of expert selection, we propose MoE-GRPO, a reinforcement learning (RL)-based framework for optimizing expert routing in MoE-based VLMs. Specifically, we formulate expert selection as a sequential decision-making problem and optimize it using Group Relative Policy Optimization (GRPO), allowing the model to learn adaptive expert routing policies through exploration and reward-based feedback. Furthermore, we introduce a modality-aware router guidance that enhances training stability and efficiency by discouraging the router from exploring experts that are infrequently activated for a given modality. Extensive experiments on multi-modal image and video benchmarks show that MoE-GRPO consistently outperforms standard top-K routing and its variants by promoting more diverse expert selection, thereby mitigating expert overfitting and enabling a task-level expert specialization.

Fonte: arXiv cs.CV

NLP/LLMsScore 85

X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs

arXiv:2603.24596v1 Announce Type: cross Abstract: While the shift from cascaded dialogue systems to end-to-end (E2E) speech Large Language Models (LLMs) improves latency and paralinguistic modeling, E2E models often exhibit a significant performance degradation compared to their text-based counterparts. The standard Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training methods fail to close this gap. To address this, we propose X-OPD, a novel Cross-Modal On-Policy Distillation framework designed to systematically align the capabilities of Speech LLMs to their text-based counterparts. X-OPD enables the Speech LLM to explore its own distribution via on-policy rollouts, where a text-based teacher model evaluates these trajectories and provides token-level feedback, effectively distilling teacher's capabilities into student's multi-modal representations. Extensive experiments across multiple benchmarks demonstrate that X-OPD significantly narrows the gap in complex tasks while preserving the model's inherent capabilities.

Fonte: arXiv cs.CL

RLScore 85

Prune as You Generate: Online Rollout Pruning for Faster and Better RLVR

arXiv:2603.24840v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, methods such as GRPO and DAPO suffer from substantial computational cost, since they rely on sampling many rollouts for each prompt. Moreover, in RLVR the relative advantage is often sparse: many samples become nearly all-correct or all-incorrect, yielding low within-group reward variance and thus weak learning signals. In this paper, we introduce arrol (Accelerating RLVR via online Rollout Pruning), an online rollout pruning method that prunes rollouts during generation while explicitly steering the surviving ones more correctness-balanced to enhance learning signals. Specifically, arrol trains a lightweight quality head on-the-fly to predict the success probability of partial rollouts and uses it to make early pruning decisions. The learned quality head can further weigh candidates to improve inference accuracy during test-time scaling. To improve efficiency, we present a system design that prunes rollouts inside the inference engine and re-batches the remaining ones for log-probability computation and policy updates. Across GRPO and DAPO on Qwen-3 and LLaMA-3.2 models (1B-8B), arrol improves average accuracy by +2.30 to +2.99 while achieving up to 1.7x training speedup, and yielding up to +8.33 additional gains in average accuracy in test-time scaling. The code is available at https://github.com/Hsu1023/ARRoL.

Fonte: arXiv cs.CL

RLScore 85

Efficient Best-of-Both-Worlds Algorithms for Contextual Combinatorial Semi-Bandits

arXiv:2508.18768v2 Announce Type: replace Abstract: We introduce the first best-of-both-worlds algorithm for contextual combinatorial semi-bandits that simultaneously guarantees $\widetilde{\mathcal{O}}(\sqrt{T})$ regret in the adversarial regime and $\widetilde{\mathcal{O}}(\ln T)$ regret in the corrupted stochastic regime. Our approach builds on the Follow-the-Regularized-Leader (FTRL) framework equipped with a Shannon entropy regularizer, yielding a flexible method that admits efficient implementations. Beyond regret bounds, we tackle the practical bottleneck in FTRL (or, equivalently, Online Stochastic Mirror Descent) arising from the high-dimensional projection step encountered in each round of interaction. By leveraging the Karush-Kuhn-Tucker conditions, we transform the $K$-dimensional convex projection problem into a single-variable root-finding problem, dramatically accelerating each round. Empirical evaluations demonstrate that this combined strategy not only attains the attractive regret bounds of best-of-both-worlds algorithms but also delivers substantial per-round speed-ups, making it well-suited for large-scale, real-time applications.

Fonte: arXiv stat.ML

NLP/LLMsScore 85

Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards

arXiv:2603.24709v1 Announce Type: cross Abstract: Multi-step tool orchestration, where LLMs must invoke multiple dependent APIs in the correct order while propagating intermediate outputs, remains challenging. State-of-the-art models frequently fail on full sequence execution, with parameter value errors accounting for a significant portion of failures. Training models to handle such workflows faces two obstacles: existing environments focus on simple per-turn function calls with simulated data, and binary rewards provide no signal for partial correctness. We present a framework addressing both challenges. First, we construct a reinforcement learning environment backed by a large-scale cache of real API responses, enabling a data synthesis pipeline that samples valid multi-step orchestration traces with controllable complexity and significantly higher generation efficiency than unconstrained methods. Second, we propose a graduated reward design that decomposes correctness into atomic validity (individual function call correctness at increasing granularity) and orchestration (correct tool sequencing with dependency respect). On ComplexFuncBench, our approach demonstrates substantial improvements in turn accuracy. Ablation studies confirm both reward components are essential: using either alone significantly degrades performance.

Fonte: arXiv cs.CL

NLP/LLMsScore 85

TIGFlow-GRPO: Trajectory Forecasting via Interaction-Aware Flow Matching and Reward-Driven Optimization

arXiv:2603.24936v1 Announce Type: new Abstract: Human trajectory forecasting is important for intelligent multimedia systems operating in visually complex environments, such as autonomous driving and crowd surveillance. Although Conditional Flow Matching (CFM) has shown strong ability in modeling trajectory distributions from spatio-temporal observations, existing approaches still focus primarily on supervised fitting, which may leave social norms and scene constraints insufficiently reflected in generated trajectories. To address this issue, we propose TIGFlow-GRPO, a two-stage generative framework that aligns flow-based trajectory generation with behavioral rules. In the first stage, we build a CFM-based predictor with a Trajectory-Interaction-Graph (TIG) module to model fine-grained visual-spatial interactions and strengthen context encoding. This stage captures both agent-agent and agent-scene relations more effectively, providing more informative conditional features for subsequent alignment. In the second stage, we perform Flow-GRPO post-training,where deterministic flow rollout is reformulated as stochastic ODE-to-SDE sampling to enable trajectory exploration, and a composite reward combines view-aware social compliance with map-aware physical feasibility. By evaluating trajectories explored through SDE rollout, GRPO progressively steers multimodal predictions toward behaviorally plausible futures. Experiments on the ETH/UCY and SDD datasets show that TIGFlow-GRPO improves forecasting accuracy and long-horizon stability while generating trajectories that are more socially compliant and physically feasible. These results suggest that the proposed framework provides an effective way to connect flow-based trajectory modeling with behavior-aware alignment in dynamic multimedia environments.

Fonte: arXiv cs.CV

NLP/LLMsScore 85

TAPO: Translation Augmented Policy Optimization for Multilingual Mathematical Reasoning

arXiv:2603.25419v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in English mathematical reasoning, yet a significant performance disparity persists in multilingual contexts, largely attributed to deficiencies in language understanding. To bridge this gap, we introduce Translation-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework built upon GRPO. TAPO enforces an explicit alignment strategy where the model leverages English as a pivot and follows an understand-then-reason paradigm. Crucially, we employ a step-level relative advantage mechanism that decouples understanding from reasoning, allowing the integration of translation quality rewards without introducing optimization conflicts. Extensive experiments reveal that TAPO effectively synergizes language understanding with reasoning capabilities and is compatible with various models. It outperforms baseline methods in both multilingual mathematical reasoning and translation tasks, while generalizing well to unseen languages and out-of-domain tasks.

Fonte: arXiv cs.CL

RLScore 85

Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

arXiv:2603.24844v1 Announce Type: cross Abstract: Given a question, a language model (LM) implicitly encodes a distribution over possible answers. In practice, post-training procedures for LMs often collapse this distribution onto a single dominant mode. While this is generally not a problem for benchmark-style evaluations that assume one correct answer, many real-world tasks inherently involve multiple valid answers or irreducible uncertainty. Examples include medical diagnosis, ambiguous question answering, and settings with incomplete information. In these cases, we would like LMs to generate multiple plausible hypotheses, ideally with confidence estimates for each one, and without computationally intensive repeated sampling to generate non-modal answers. This paper describes a multi-answer reinforcement learning approach for training LMs to perform distributional reasoning over multiple answers during inference. We modify the RL objective to enable models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference-time search into the model's generative process. Across question-answering, medical diagnostic, and coding benchmarks, we observe improved diversity, coverage, and set-level calibration scores compared to single answer trained baselines. Models trained with our approach require fewer tokens to generate multiple answers than competing approaches. On coding tasks, they are also substantially more accurate. These results position multi-answer RL as a principled and compute-efficient alternative to inference-time scaling procedures such as best-of-k. Code and more information can be found at https://multi-answer-rl.github.io/.

Fonte: arXiv cs.CL

NLP/LLMsScore 85

LineMVGNN: Anti-Money Laundering with Line-Graph-Assisted Multi-View Graph Neural Networks

arXiv:2603.23584v1 Announce Type: new Abstract: Anti-money laundering (AML) systems are important for protecting the global economy. However, conventional rule-based methods rely on domain knowledge, leading to suboptimal accuracy and a lack of scalability. Graph neural networks (GNNs) for digraphs (directed graphs) can be applied to transaction graphs and capture suspicious transactions or accounts. However, most spectral GNNs do not naturally support multi-dimensional edge features, lack interpretability due to edge modifications, and have limited scalability owing to their spectral nature. Conversely, most spatial methods may not capture the money flow well. Therefore, in this work, we propose LineMVGNN (Line-Graph-Assisted Multi-View Graph Neural Network), a novel spatial method that considers payment and receipt transactions. Specifically, the LineMVGNN model extends a lightweight MVGNN module, which performs two-way message passing between nodes in a transaction graph. Additionally, LineMVGNN incorporates a line graph view of the original transaction graph to enhance the propagation of transaction information. We conduct experiments on two real-world account-based transaction datasets: the Ethereum phishing transaction network dataset and a financial payment transaction dataset from one of our industry partners. The results show that our proposed method outperforms state-of-the-art methods, reflecting the effectiveness of money laundering detection with line-graph-assisted multi-view graph learning. We also discuss scalability, adversarial robustness, and regulatory considerations of our proposed method.

Fonte: arXiv cs.LG

RLScore 85

Self Paced Gaussian Contextual Reinforcement Learning

arXiv:2603.23755v1 Announce Type: new Abstract: Curriculum learning improves reinforcement learning (RL) efficiency by sequencing tasks from simple to complex. However, many self-paced curriculum methods rely on computationally expensive inner-loop optimizations, limiting their scalability in high-dimensional context spaces. In this paper, we propose Self-Paced Gaussian Curriculum Learning (SPGL), a novel approach that avoids costly numerical procedures by leveraging a closed-form update rule for Gaussian context distributions. SPGL maintains the sample efficiency and adaptability of traditional self-paced methods while substantially reducing computational overhead. We provide theoretical guarantees on convergence and validate our method across several contextual RL benchmarks, including the Point Mass, Lunar Lander, and Ball Catching environments. Experimental results show that SPGL matches or outperforms existing curriculum methods, especially in hidden context scenarios, and achieves more stable context distribution convergence. Our method offers a scalable, principled alternative for curriculum generation in challenging continuous and partially observable domains.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

Grounding Vision and Language to 3D Masks for Long-Horizon Box Rearrangement

arXiv:2603.23676v1 Announce Type: new Abstract: We study long-horizon planning in 3D environments from under-specified natural-language goals using only visual observations, focusing on multi-step 3D box rearrangement tasks. Existing approaches typically rely on symbolic planners with brittle relational grounding of states and goals, or on direct action-sequence generation from 2D vision-language models (VLMs). Both approaches struggle with reasoning over many objects, rich 3D geometry, and implicit semantic constraints. Recent advances in 3D VLMs demonstrate strong grounding of natural-language referents to 3D segmentation masks, suggesting the potential for more general planning capabilities. We extend existing 3D grounding models and propose Reactive Action Mask Planner (RAMP-3D), which formulates long-horizon planning as sequential reactive prediction of paired 3D masks: a "which-object" mask indicating what to pick and a "which-target-region" mask specifying where to place it. The resulting system processes RGB-D observations and natural-language task specifications to reactively generate multi-step pick-and-place actions for 3D box rearrangement. We conduct experiments across 11 task variants in warehouse-style environments with 1-30 boxes and diverse natural-language constraints. RAMP-3D achieves 79.5% success rate on long-horizon rearrangement tasks and significantly outperforms 2D VLM-based baselines, establishing mask-based reactive policies as a promising alternative to symbolic pipelines for long-horizon planning.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents

arXiv:2603.23951v1 Announce Type: new Abstract: Discovering improved policy optimization algorithms for language models remains a costly manual process requiring repeated mechanism-level modification and validation. Unlike simple combinatorial code search, this problem requires searching over algorithmic mechanisms tightly coupled with training dynamics while reusing empirical evidence across iterations. We propose POISE, a closed-loop framework for automated discovery of policy optimization algorithms for language models. POISE maintains a structured, genealogically linked archive linking proposals, executable implementations, standardized evaluations, and natural-language reflections to support evidence-driven iteration. In mathematical reasoning experiments starting from GRPO, POISE evaluates 64 candidate algorithms and discovers improved mechanisms, including analytic-variance scaling and validity masking. The best variant improves weighted Overall from 47.8 to 52.5 (+4.6) and increases AIME25 pass@32 from 26.7% to 43.3%, demonstrating the feasibility of automated policy optimization discovery while supporting interpretable design principles.

Fonte: arXiv cs.CL

RLScore 85

Completeness of Unbounded Best-First Minimax and Descent Minimax

arXiv:2603.24572v1 Announce Type: new Abstract: In this article, we focus on search algorithms for two-player perfect information games, whose objective is to determine the best possible strategy, and ideally a winning strategy. Unfortunately, some search algorithms for games in the literature are not able to always determine a winning strategy, even with an infinite search time. This is the case, for example, of the following algorithms: Unbounded Best-First Minimax and Descent Minimax, which are core algorithms in state-of-the-art knowledge-free reinforcement learning. They were then improved with the so-called completion technique. However, whether this technique sufficiently improves these algorithms to allow them to always determine a winning strategy remained an open question until now. To answer this question, we generalize the two algorithms (their versions using the completion technique), and we show that any algorithm of this class of algorithms computes the best strategy. Finally, we experimentally show that the completion technique improves winning performance.

Fonte: arXiv cs.AI

RLScore 85

Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

arXiv:2603.23926v1 Announce Type: new Abstract: Online reinforcement learning in infinite-horizon Markov decision processes (MDPs) remains less theoretically and algorithmically developed than its episodic counterpart, with many algorithms suffering from high ``burn-in'' costs and failing to adapt to benign instance-specific complexity. In this work, we address these shortcomings for two infinite-horizon objectives: the classical average-reward regret and the $\gamma$-regret. We develop a single tractable UCB-style algorithm applicable to both settings, which achieves the first optimal variance-dependent regret guarantees. Our regret bounds in both settings take the form $\tilde{O}( \sqrt{SA\,\text{Var}} + \text{lower-order terms})$, where $S,A$ are the state and action space sizes, and $\text{Var}$ captures cumulative transition variance. This implies minimax-optimal average-reward and $\gamma$-regret bounds in the worst case but also adapts to easier problem instances, for example yielding nearly constant regret in deterministic MDPs. Furthermore, our algorithm enjoys significantly improved lower-order terms for the average-reward setting. With prior knowledge of the optimal bias span $\Vert h^\star\Vert_\text{sp}$, our algorithm obtains lower-order terms scaling as $\Vert h^\star\Vert_\text{sp} S^2 A$, which we prove is optimal in both $\Vert h^\star\Vert_\text{sp}$ and $A$. Without prior knowledge, we prove that no algorithm can have lower-order terms smaller than $\Vert h^\star \Vert_\text{sp}^2 S A$, and we provide a prior-free algorithm whose lower-order terms scale as $\Vert h^\star\Vert_\text{sp}^2 S^3 A$, nearly matching this lower bound. Taken together, these results completely characterize the optimal dependence on $\Vert h^\star\Vert_\text{sp}$ in both leading and lower-order terms, and reveal a fundamental gap in what is achievable with and without prior knowledge.

Fonte: arXiv cs.LG

RLScore 85

HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation

arXiv:2603.23871v1 Announce Type: new Abstract: Large language models trained with reinforcement learning (RL) for mathematical reasoning face a fundamental challenge: on problems the model cannot solve at all - "cliff" prompts - the RL gradient vanishes entirely, preventing any learning signal from reaching these failure modes. We introduce Hybrid Distillation Policy Optimization (HDPO), which augments standard RL with privileged self-distillation targeting cliff prompts. On each training step, HDPO identifies prompts where all rollouts fail, generates privileged rollouts by providing the model with ground-truth information, filters for correct solutions, and distills the teacher's token-level distribution into the student. Because teacher and student share the same weights - differing only in their input - the realizability gap is provably bounded, unlike cross-model distillation. We prove that R=1 filtered privileged generation recovers the optimal KL-regularized RL policy in the hard-threshold limit. Experiments on OpenMathInstruct-2 with Qwen2.5-Math-1.5B-Instruct show that HDPO consistently improves coverage metrics (pass@4 by +0.8-1.1%, pass@8 by +0.4-1.7%) while maintaining greedy accuracy, with the distillation weight lambda providing direct control over the exploration-exploitation tradeoff.

Fonte: arXiv cs.LG

RLScore 85

BXRL: Behavior-Explainable Reinforcement Learning

arXiv:2603.23738v1 Announce Type: new Abstract: A major challenge of Reinforcement Learning is that agents often learn undesired behaviors that seem to defy the reward structure they were given. Explainable Reinforcement Learning (XRL) methods can answer queries such as "explain this specific action", "explain this specific trajectory", and "explain the entire policy". However, XRL lacks a formal definition for behavior as a pattern of actions across many episodes. We provide such a definition, and use it to enable a new query: "Explain this behavior". We present Behavior-Explainable Reinforcement Learning (BXRL), a new problem formulation that treats behaviors as first-class objects. BXRL defines a behavior measure as any function $m : \Pi \to \mathbb{R}$, allowing users to precisely express the pattern of actions that they find interesting and measure how strongly the policy exhibits it. We define contrastive behaviors that reduce the question "why does the agent prefer $a$ to $a'$?" to "why is $m(\pi)$ high?" which can be explored with differentiation. We do not implement an explainability method; we instead analyze three existing methods and propose how they could be adapted to explain behavior. We present a port of the HighwayEnv driving environment to JAX, which provides an interface for defining, measuring, and differentiating behaviors with respect to the model parameters.

Fonte: arXiv cs.LG

RLScore 85

Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration

arXiv:2603.23889v1 Announce Type: new Abstract: When safety is formulated as a limit of cumulative cost, safe reinforcement learning (RL) aims to learn policies that maximize return subject to the cost constraint in data collection and deployment. Off-policy safe RL methods, although offering high sample efficiency, suffer from constraint violations due to cost-agnostic exploration and estimation bias in cumulative cost. To address this issue, we propose Constrained Optimistic eXploration Q-learning (COX-Q), an off-policy safe RL algorithm that integrates cost-bounded online exploration and conservative offline distributional value learning. First, we introduce a novel cost-constrained optimistic exploration strategy that resolves gradient conflicts between reward and cost in the action space and adaptively adjusts the trust region to control the training cost. Second, we adopt truncated quantile critics to stabilize the cost value learning. Quantile critics also quantify epistemic uncertainty to guide exploration. Experiments on safe velocity, safe navigation, and autonomous driving tasks demonstrate that COX-Q achieves high sample efficiency, competitive test safety performance, and controlled data collection cost. The results highlight COX-Q as a promising RL method for safety-critical applications.

Fonte: arXiv cs.LG

RLScore 85

Safe Reinforcement Learning with Preference-based Constraint Inference

arXiv:2603.23565v1 Announce Type: new Abstract: Safe reinforcement learning (RL) is a standard paradigm for safety-critical decision making. However, real-world safety constraints can be complex, subjective, and even hard to explicitly specify. Existing works on constraint inference rely on restrictive assumptions or extensive expert demonstrations, which is not realistic in many real-world applications. How to cheaply and reliably learn these constraints is the major challenge we focus on in this study. While inferring constraints from human preferences offers a data-efficient alternative, we identify the popular Bradley-Terry (BT) models fail to capture the asymmetric, heavy-tailed nature of safety costs, resulting in risk underestimation. It is still rare in the literature to understand the impacts of BT models on the downstream policy learning. To address the above knowledge gaps, we propose a novel approach namely Preference-based Constrained Reinforcement Learning (PbCRL). We introduce a novel dead zone mechanism into preference modeling and theoretically prove that it encourages heavy-tailed cost distributions, thereby achieving better constraint alignment. Additionally, we incorporate a Signal-to-Noise Ratio (SNR) loss to encourage exploration by cost variances, which is found to benefit policy learning. Further, two-stage training strategy are deployed to lower online labeling burdens while adaptively enhancing constraint satisfaction. Empirical results demonstrate that PbCRL achieves superior alignment with true safety requirements and outperforms the state-of-the-art baselines in terms of safety and reward. Our work explores a promising and effective way for constraint inference in Safe RL, which has great potential in a range of safety-critical applications.

Fonte: arXiv cs.LG

RLScore 85

Implicit Turn-Wise Policy Optimization for Proactive User-LLM Interaction

arXiv:2603.23550v1 Announce Type: new Abstract: Multi-turn human-AI collaboration is fundamental to deploying interactive services such as adaptive tutoring, conversational recommendation, and professional consultation. However, optimizing these interactions via reinforcement learning is hindered by the sparsity of verifiable intermediate rewards and the high stochasticity of user responses. To address these challenges, we introduce Implicit Turn-wise Policy Optimization (ITPO). ITPO leverages an implicit process reward model to derive fine-grained, turn-wise process rewards from sparse outcome signals. Unlike volatile token-level rewards, these turn-level signals exhibit superior robustness and may utilize a normalization mechanism to further enhance training stability. We evaluate ITPO across three representative multi-turn collaborative tasks: math tutoring, document writing, and medical recommendation. Empirical results demonstrate that ITPO, when combined with PPO, GRPO, or RLOO, consistently achieves improved convergence than existing baselines. Elaborate trajectory analysis confirms that ITPO infers turn-wise preferences that are semantically aligned with human judgment. Code is publicly available at https://github.com/Graph-COM/ITPO.

Fonte: arXiv cs.LG

RLScore 85

Central Limit Theorems for Transition Probabilities of Controlled Markov Chains

arXiv:2508.01517v3 Announce Type: replace-cross Abstract: We develop a central limit theorem (CLT) for a non-parametric estimator of the transition matrices in controlled Markov chains (CMCs) with finite state-action spaces. Our results establish precise conditions on the logging policy under which the estimator is asymptotically normal, and reveal settings in which no CLT can exist. We then build on it to derive CLTs for the value, Q-, and advantage functions of any stationary stochastic policy, including the optimal policy recovered from the estimated model. Goodness-of-fit tests are derived as a corollary, which enable to test whether the logged data is stochastic. These results provide new statistical tools for offline policy evaluation and optimal policy recovery, and enable hypothesis tests for transition probabilities.

Fonte: arXiv stat.ML

RLScore 85

Learning-guided Prioritized Planning for Lifelong Multi-Agent Path Finding in Warehouse Automation

arXiv:2603.23838v1 Announce Type: new Abstract: Lifelong Multi-Agent Path Finding (MAPF) is critical for modern warehouse automation, which requires multiple robots to continuously navigate conflict-free paths to optimize the overall system throughput. However, the complexity of warehouse environments and the long-term dynamics of lifelong MAPF often demand costly adaptations to classical search-based solvers. While machine learning methods have been explored, their superiority over search-based methods remains inconclusive. In this paper, we introduce Reinforcement Learning (RL) guided Rolling Horizon Prioritized Planning (RL-RH-PP), the first framework integrating RL with search-based planning for lifelong MAPF. Specifically, we leverage classical Prioritized Planning (PP) as a backbone for its simplicity and flexibility in integrating with a learning-based priority assignment policy. By formulating dynamic priority assignment as a Partially Observable Markov Decision Process (POMDP), RL-RH-PP exploits the sequential decision-making nature of lifelong planning while delegating complex spatial-temporal interactions among agents to reinforcement learning. An attention-based neural network autoregressively decodes priority orders on-the-fly, enabling efficient sequential single-agent planning by the PP planner. Evaluations in realistic warehouse simulations show that RL-RH-PP achieves the highest total throughput among baselines and generalizes effectively across agent densities, planning horizons, and warehouse layouts. Our interpretive analyses reveal that RL-RH-PP proactively prioritizes congested agents and strategically redirects agents from congestion, easing traffic flow and boosting throughput. These findings highlight the potential of learning-guided approaches to augment traditional heuristics in modern warehouse automation.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

The DeepXube Software Package for Solving Pathfinding Problems with Learned Heuristic Functions and Search

arXiv:2603.23873v1 Announce Type: new Abstract: DeepXube is a free and open-source Python package and command-line tool that seeks to automate the solution of pathfinding problems by using machine learning to learn heuristic functions that guide heuristic search algorithms tailored to deep neural networks (DNNs). DeepXube is comprised of the latest advances in deep reinforcement learning, heuristic search, and formal logic for solving pathfinding problems. This includes limited-horizon Bellman-based learning, hindsight experience replay, batched heuristic search, and specifying goals with answer-set programming. A robust multiple-inheritance structure simplifies the definition of pathfinding domains and the generation of training data. Training heuristic functions is made efficient through the automatic parallelization of the generation of training data across central processing units (CPUs) and reinforcement learning updates across graphics processing units (GPUs). Pathfinding algorithms that take advantage of the parallelism of GPUs and DNN architectures, such as batch weighted A* and Q* search and beam search are easily employed to solve pathfinding problems through command-line arguments. Finally, several convenient features for visualization, code profiling, and progress monitoring during training and solving are available. The GitHub repository is publicly available at https://github.com/forestagostinelli/deepxube.

Fonte: arXiv cs.AI

RLScore 85

From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments

arXiv:2603.23964v1 Announce Type: new Abstract: The remarkable progress of reinforcement learning (RL) is intrinsically tied to the environments used to train and evaluate artificial agents. Moving beyond traditional qualitative reviews, this work presents a large-scale, data-driven empirical investigation into the evolution of RL environments. By programmatically processing a massive corpus of academic literature and rigorously distilling over 2,000 core publications, we propose a quantitative methodology to map the transition from isolated physical simulations to generalist, language-driven foundation agents. Implementing a novel, multi-dimensional taxonomy, we systematically analyze benchmarks against diverse application domains and requisite cognitive capabilities. Our automated semantic and statistical analysis reveals a profound, data-verified paradigm shift: the bifurcation of the field into a "Semantic Prior" ecosystem dominated by Large Language Models (LLMs) and a "Domain-Specific Generalization" ecosystem. Furthermore, we characterize the "cognitive fingerprints" of these distinct domains to uncover the underlying mechanisms of cross-task synergy, multi-domain interference, and zero-shot generalization. Ultimately, this study offers a rigorous, quantitative roadmap for designing the next generation of Embodied Semantic Simulators, bridging the gap between continuous physical control and high-level logical reasoning.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

ELITE: Experiential Learning and Intent-Aware Transfer for Self-improving Embodied Agents

arXiv:2603.24018v1 Announce Type: new Abstract: Vision-language models (VLMs) have shown remarkable general capabilities, yet embodied agents built on them fail at complex tasks, often skipping critical steps, proposing invalid actions, and repeating mistakes. These failures arise from a fundamental gap between the static training data of VLMs and the physical interaction for embodied tasks. VLMs can learn rich semantic knowledge from static data but lack the ability to interact with the world. To address this issue, we introduce ELITE, an embodied agent framework with {E}xperiential {L}earning and {I}ntent-aware {T}ransfer that enables agents to continuously learn from their own environment interaction experiences, and transfer acquired knowledge to procedurally similar tasks. ELITE operates through two synergistic mechanisms, \textit{i.e.,} self-reflective knowledge construction and intent-aware retrieval. Specifically, self-reflective knowledge construction extracts reusable strategies from execution trajectories and maintains an evolving strategy pool through structured refinement operations. Then, intent-aware retrieval identifies relevant strategies from the pool and applies them to current tasks. Experiments on the EB-ALFRED and EB-Habitat benchmarks show that ELITE achieves 9\% and 5\% performance improvement over base VLMs in the online setting without any supervision. In the supervised setting, ELITE generalizes effectively to unseen task categories, achieving better performance compared to state-of-the-art training-based methods. These results demonstrate the effectiveness of ELITE for bridging the gap between semantic understanding and reliable action execution.

Fonte: arXiv cs.AI

RLScore 85

PointRFT: Explicit Reinforcement Fine-tuning for Point Cloud Few-shot Learning

arXiv:2603.23957v1 Announce Type: new Abstract: Understanding spatial dynamics and semantics in point cloud is fundamental for comprehensive 3D comprehension. While reinforcement learning algorithms such as Group Relative Policy Optimization (GRPO) have recently achieved remarkable breakthroughs in large language models by incentivizing reasoning capabilities through strategic reward design, their potential remains largely unexplored in the 3D perception domain. This naturally raises a pivotal question: Can RL-based methods effectively empower 3D point cloud fine-tuning? In this paper, we propose PointRFT, the first reinforcement fine-tuning paradigm tailored specifically for point cloud representation learning. We select three prevalent 3D foundation models and devise specialized accuracy reward and dispersion reward functions to stabilize training and mitigate distribution shifts. Through comprehensive few-shot classification experiments comparing distinct training paradigms, we demonstrate that PointRFT consistently outperforms vanilla supervised fine-tuning (SFT) across diverse benchmarks. Furthermore, when organically integrated into a hybrid Pretraining-SFT-RFT paradigm, the representational capacity of point cloud foundation models is substantially unleashed, achieving state-of-the-art performance particularly under data-scarce scenarios.

Fonte: arXiv cs.CV

RLScore 85

WIST: Web-Grounded Iterative Self-Play Tree for Domain-Targeted Reasoning Improvement

arXiv:2603.22352v1 Announce Type: new Abstract: Recent progress in reinforcement learning with verifiable rewards (RLVR) offers a practical path to self-improvement of language models, but existing methods face a key trade-off: endogenous self-play can drift over iterations, while corpus-grounded approaches rely on curated data environments. We present \textbf{WIST}, a \textbf{W}eb-grounded \textbf{I}terative \textbf{S}elf-play \textbf{T}ree framework for domain-targeted reasoning improvement that learns directly from the open web without requiring any pre-arranged domain corpus. WIST incrementally expands a domain tree for exploration, and retrieves and cleans path-consistent web corpus to construct a controllable training environment. It then performs Challenger--Solver self-play with verifiable rewards, and feeds learnability signals back to update node posteriors and guide subsequent exploration through an adaptive curriculum. Across four backbones, WIST consistently improves over the base models and typically outperforms both purely endogenous self-evolution and corpus-grounded self-play baselines, with the Overall gains reaching \textbf{+9.8} (\textit{Qwen3-4B-Base}) and \textbf{+9.7} (\textit{OctoThinker-8B}). WIST is also domain-steerable, improving \textit{Qwen3-8B-Base} by \textbf{+14.79} in medicine and \textit{Qwen3-4B-Base} by \textbf{+5.28} on PhyBench. Ablations further confirm the importance of WIST's key components for stable open-web learning. Our Code is available at https://github.com/lfy-123/WIST.

Fonte: arXiv cs.LG

RLScore 85

COMPASS-Hedge: Learning Safely Without Knowing the World

arXiv:2603.22348v1 Announce Type: new Abstract: Online learning algorithms often faces a fundamental trilemma: balancing regret guarantees between adversarial and stochastic settings and providing baseline safety against a fixed comparator. While existing methods excel in one or two of these regimes, they typically fail to unify all three without sacrificing optimal rates or requiring oracle access to problem-dependent parameters. In this work, we bridge this gap by introducing COMPASS-Hedge. Our algorithm is the first full-information method to simultaneously achieve: i) Minimax-optimal regret in adversarial environments; ii) Instance-optimal, gap-dependent regret in stochastic environments; and iii) $\tilde{\mathcal{O}}(1)$ regret relative to a designated baseline policy, up to logarithmic factors. Crucially, COMPASS-Hedge is parameter-free and requires no prior knowledge of the environment's nature or the magnitude of the stochastic sub optimality gaps. Our approach hinges on a novel integration of adaptive pseudo-regret scaling and phase-based aggression, coupled with a comparator-aware mixing strategy. To the best of our knowledge, this provides the first "best-of-three-world" guarantee in the full-information setting, establishing that baseline safety does not have to come at the cost of worst-case robustness or stochastic efficiency.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

PhySe-RPO: Physics and Semantics Guided Relative Policy Optimization for Diffusion-Based Surgical Smoke Removal

arXiv:2603.22844v1 Announce Type: new Abstract: Surgical smoke severely degrades intraoperative video quality, obscuring anatomical structures and limiting surgical perception. Existing learning-based desmoking approaches rely on scarce paired supervision and deterministic restoration pipelines, making it difficult to perform exploration or reinforcement-driven refinement under real surgical conditions. We propose PhySe-RPO, a diffusion restoration framework optimized through Physics- and Semantics-Guided Relative Policy Optimization. The core idea is to transform deterministic restoration into a stochastic policy, enabling trajectory-level exploration and critic-free updates via group-relative optimization. A physics-guided reward imposes illumination and color consistency, while a visual-concept semantic reward learned from CLIP-based surgical concepts promotes smoke-free and anatomically coherent restoration. Together with a reference-free perceptual constraint, PhySe-RPO produces results that are physically consistent, semantically faithful, and clinically interpretable across synthetic and real robotic surgical datasets, providing a principled route to robust diffusion-based restoration under limited paired supervision.

Fonte: arXiv cs.AI

RLScore 85

Learning What Matters Now: Dynamic Preference Inference under Contextual Shifts

arXiv:2603.22813v1 Announce Type: new Abstract: Humans often juggle multiple, sometimes conflicting objectives and shift their priorities as circumstances change, rather than following a fixed objective function. In contrast, most computational decision-making and multi-objective RL methods assume static preference weights or a known scalar reward. In this work, we study sequential decision-making problem when these preference weights are unobserved latent variables that drift with context. Specifically, we propose Dynamic Preference Inference (DPI), a cognitively inspired framework in which an agent maintains a probabilistic belief over preference weights, updates this belief from recent interaction, and conditions its policy on inferred preferences. We instantiate DPI as a variational preference inference module trained jointly with a preference-conditioned actor-critic, using vector-valued returns as evidence about latent trade-offs. In queueing, maze, and multi-objective continuous-control environments with event-driven changes in objectives, DPI adapts its inferred preferences to new regimes and achieves higher post-shift performance than fixed-weight and heuristic envelope baselines.

Fonte: arXiv cs.AI

RLScore 85

Beyond Hard Constraints: Budget-Conditioned Reachability For Safe Offline Reinforcement Learning

arXiv:2603.22292v1 Announce Type: new Abstract: Sequential decision making using Markov Decision Process underpins many realworld applications. Both model-based and model free methods have achieved strong results in these settings. However, real-world tasks must balance reward maximization with safety constraints, often conflicting objectives, that can lead to unstable min/max, adversarial optimization. A promising alternative is safety reachability analysis, which precomputes a forward-invariant safe state, action set, ensuring that an agent starting inside this set remains safe indefinitely. Yet, most reachability based methods address only hard safety constraints, and little work extends reachability to cumulative cost constraints. To address this, first, we define a safetyconditioned reachability set that decouples reward maximization from cumulative safety cost constraints. Second, we show how this set enforces safety constraints without unstable min/max or Lagrangian optimization, yielding a novel offline safe RL algorithm that learns a safe policy from a fixed dataset without environment interaction. Finally, experiments on standard offline safe RL benchmarks, and a real world maritime navigation task demonstrate that our method matches or outperforms state of the art baselines while maintaining safety.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

The Efficiency Attenuation Phenomenon: A Computational Challenge to the Language of Thought Hypothesis

arXiv:2603.22312v1 Announce Type: new Abstract: This paper computationally investigates whether thought requires a language-like format, as posited by the Language of Thought (LoT) hypothesis. We introduce the ``AI Private Language'' thought experiment: if two artificial agents develop an efficient, inscrutable communication protocol via multi-agent reinforcement learning (MARL), and their performance declines when forced to use a human-comprehensible language, this Efficiency Attenuation Phenomenon (EAP) challenges the LoT. We formalize this in a cooperative navigation task under partial observability. Results show that agents with an emergent protocol achieve 50.5\% higher efficiency than those using a pre-defined, human-like symbolic protocol, confirming the EAP. This suggests optimal collaborative cognition in these systems is not mediated by symbolic structures but is naturally coupled with sub-symbolic computations. The work bridges philosophy, cognitive science, and AI, arguing for pluralism in cognitive architectures and highlighting implications for AI ethics.

Fonte: arXiv cs.AI

RLScore 85

MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping

arXiv:2603.22650v1 Announce Type: new Abstract: Active mapping aims to determine how an agent should move to efficiently reconstruct an unknown environment. Most existing approaches rely on greedy next-best-view prediction, resulting in inefficient exploration and incomplete scene reconstruction. To address this limitation, we introduce MAGICIAN, a novel long-term planning framework that maximizes accumulated surface coverage gain through Imagined Gaussians, a scene representation derived from a pre-trained occupancy network with strong structural priors. This representation enables efficient computation of coverage gain for any novel viewpoint via fast volumetric rendering, allowing its integration into a tree-search algorithm for long-horizon planning. We update Imagined Gaussians and refine the planned trajectory in a closed-loop manner. Our method achieves state-of-the-art performance across indoor and outdoor benchmarks with varying action spaces, demonstrating the critical advantage of long-term planning in active mapping.

Fonte: arXiv cs.CV

NLP/LLMsScore 85

TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs

arXiv:2603.22293v1 Announce Type: new Abstract: Search-augmented large language models (LLMs) trained with reinforcement learning (RL) have achieved strong results on open-domain question answering (QA), but training still remains a significant challenge. The optimization is often unstable due to sparse rewards and difficult credit assignments across reasoning and tool calls. To address this, we introduce Turn-Level Information Potential Reward Shaping (TIPS), a simple framework that assigns dense, turn-level rewards to each reasoning + tool-call segment based on the increased likelihood of the correct answer under a teacher model. By leveraging the potential-based reward shaping, TIPS offers fine-grained and policy-invariant guidance that overcomes the limitations of outcome-only optimization. Evaluated on seven QA benchmarks, TIPS consistently outperforms GRPO/PPO baselines and substantially improves training stability. For instance, with a Qwen-2.5 7B Instruct model, TIPS improves the average Exact Match score by 11.8% and F1 by 13.6% relative to PPO. Our results demonstrate that turn-level information-potential reward shaping provides an effective and general solution to sparse-reward credit assignment for multi-turn LLM reasoning.

Fonte: arXiv cs.CL

RLScore 85

Learning When to Act: Interval-Aware Reinforcement Learning with Predictive Temporal Structure

arXiv:2603.22384v1 Announce Type: new Abstract: Autonomous agents operating in continuous environments must decide not only what to do, but when to act. We introduce a lightweight adaptive temporal control system that learns the optimal interval between cognitive ticks from experience, replacing ad hoc biologically inspired timers with a principled learned policy. The policy state is augmented with a predictive hyperbolic spread signal (a "curvature signal" shorthand) derived from hyperbolic geometry: the mean pairwise Poincare distance among n sampled futures embedded in the Poincare ball. High spread indicates a branching, uncertain future and drives the agent to act sooner; low spread signals predictability and permits longer rest intervals. We further propose an interval-aware reward that explicitly penalises inefficiency relative to the chosen wait time, correcting a systematic credit-assignment failure of naive outcome-based rewards in timing problems. We additionally introduce a joint spatio-temporal embedding (ATCPG-ST) that concatenates independently normalised state and position projections in the Poincare ball; spatial trajectory divergence provides an independent timing signal unavailable to the state-only variant (ATCPG-SO). This extension raises mean hyperbolic spread (kappa) from 1.88 to 3.37 and yields a further 5.8 percent efficiency gain over the state-only baseline. Ablation experiments across five random seeds demonstrate that (i) learning is the dominant efficiency factor (54.8 percent over no-learning), (ii) hyperbolic spread provides significant complementary gain (26.2 percent over geometry-free control), (iii) the combined system achieves 22.8 percent efficiency over the fixed-interval baseline, and (iv) adding spatial position information to the spread embedding yields an additional 5.8 percent.

Fonte: arXiv cs.LG

VisionScore 85

Q-Tacit: Image Quality Assessment via Latent Visual Reasoning

arXiv:2603.22641v1 Announce Type: new Abstract: Vision-Language Model (VLM)-based image quality assessment (IQA) has been significantly advanced by incorporating Chain-of-Thought (CoT) reasoning. Recent work has refined image quality reasoning by applying reinforcement learning (RL) and leveraging active visual tools. However, such strategies are typically language-centric, with visual information being treated as static preconditions. Quality-related visual cues often cannot be abstracted into text in extenso due to the gap between discrete textual tokens and quality perception space, which in turn restricts the reasoning effectiveness for visually intensive IQA tasks. In this paper, we revisit this by asking the question, "Is natural language the ideal space for quality reasoning?" and, as a consequence, we propose Q-Tacit, a new paradigm that elicits VLMs to reason beyond natural language in the latent quality space. Our approach follows a synergistic two-stage process: (i) injecting structural visual quality priors into the latent space, and (ii) calibrating latent reasoning trajectories to improve quality assessment ability. Extensive experiments demonstrate that Q-Tacit can effectively perform quality reasoning with significantly fewer tokens than previous reasoning-based methods, while achieving strong overall performance. This paper validates the proposition that language is not the only compact representation suitable for visual quality, opening possibilities for further exploration of effective latent reasoning paradigms for IQA. Source code will be released to support future research.

Fonte: arXiv cs.CV

RLScore 85

Quality Over Clicks: Intrinsic Quality-Driven Iterative Reinforcement Learning for Cold-Start E-Commerce Query Suggestion

arXiv:2603.22922v1 Announce Type: new Abstract: Existing dialogue systems rely on Query Suggestion (QS) to enhance user engagement. Recent efforts typically employ large language models with Click-Through Rate (CTR) model, yet fail in cold-start scenarios due to their heavy reliance on abundant online click data for effective CTR model training. To bridge this gap, we propose Cold-EQS, an iterative reinforcement learning framework for Cold-Start E-commerce Query Suggestion (EQS). Specifically, we leverage answerability, factuality, and information gain as reward to continuously optimize the quality of suggested queries. To continuously optimize our QS model, we estimate uncertainty for grouped candidate suggested queries to select hard and ambiguous samples from online user queries lacking click signals. In addition, we provide an EQS-Benchmark comprising 16,949 online user queries for offline training and evaluation. Extensive offline and online experiments consistently demonstrate a strong positive correlation between online and offline effectiveness. Both offline and online experimental results demonstrate the superiority of our Cold-EQS, achieving a significant +6.81% improvement in online chatUV.

Fonte: arXiv cs.CL

RLScore 85

Model Predictive Control with Differentiable World Models for Offline Reinforcement Learning

arXiv:2603.22430v1 Announce Type: new Abstract: Offline Reinforcement Learning (RL) aims to learn optimal policies from fixed offline datasets, without further interactions with the environment. Such methods train an offline policy (or value function), and apply it at inference time without further refinement. We introduce an inference time adaptation framework inspired by model predictive control (MPC) that utilizes a pretrained policy along with a learned world model of state transitions and rewards. While existing world model and diffusion-planning methods use learned dynamics to generate imagined trajectories during training, or to sample candidate plans at inference time, they do not use inference-time information to optimize the policy parameters on the fly. In contrast, our design is a Differentiable World Model (DWM) pipeline that enables endto-end gradient computation through imagined rollouts for policy optimization at inference time based on MPC. We evaluate our algorithm on D4RL continuous-control benchmarks (MuJoCo locomotion tasks and AntMaze), and show that exploiting inference-time information to optimize the policy parameters yields consistent gains over strong offline RL baselines.

Fonte: arXiv cs.LG

RLScore 85

Privacy-Preserving Reinforcement Learning from Human Feedback via Decoupled Reward Modeling

arXiv:2603.22563v1 Announce Type: new Abstract: Preference-based fine-tuning has become an important component in training large language models, and the data used at this stage may contain sensitive user information. A central question is how to design a differentially private pipeline that is well suited to the distinct structure of reinforcement learning from human feedback. We propose a privacy-preserving framework that imposes differential privacy only on reward learning and derives the final policy from the resulting private reward model. Theoretically, we study the suboptimality gap and show that privacy contributes an additional additive term beyond the usual non-private statistical error. We also establish a minimax lower bound and show that the dominant term changes with sample size and privacy level, which in turn characterizes regimes in which the upper bound is rate-optimal up to logarithmic factors. Empirically, synthetic experiments confirm the scaling predicted by the theory, and experiments on the Anthropic HH-RLHF dataset using the Gemma-2B-IT model show stronger private alignment performance than existing differentially private baseline methods across privacy budgets.

Fonte: arXiv stat.ML

RLScore 85

ImplicitRM: Unbiased Reward Modeling from Implicit Preference Data for LLM alignment

arXiv:2603.23184v1 Announce Type: new Abstract: Reward modeling represents a long-standing challenge in reinforcement learning from human feedback (RLHF) for aligning language models. Current reward modeling is heavily contingent upon experimental feedback data with high collection costs. In this work, we study \textit{implicit reward modeling} -- learning reward models from implicit human feedback (e.g., clicks and copies) -- as a cost-effective alternative. We identify two fundamental challenges in implicit reward modeling: (1) Implicit preference data lacks definitive negative samples, which makes standard positive-negative classification methods inapplicable; (2) Implicit preference data suffers from user preference bias, where different responses have different propensities to elicit user feedback actions, which exacerbates the difficulty of distinguishing definitive negative samples. To address these challenges, we propose ImplicitRM, which aims to learn unbiased reward models from implicit preference data. ImplicitRM stratifies training samples into four latent groups via a stratification model. Building on this, it derives a learning objective through likelihood maximization, which we prove is theoretically unbiased, effectively resolving both challenges. Experiments demonstrate that ImplicitRM learns accurate reward models across implicit preference datasets. Code is available on our project website.

Fonte: arXiv cs.CL

NLP/LLMsScore 85

Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs

arXiv:2603.22446v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has significantly improved reasoning in large language models (LLMs), yet the token-level mechanisms underlying these improvements remain unclear. We present a systematic empirical study of RLVR's distributional effects organized around three main analyses: (1) token-level characterization of distributional shifts between base and RL models, (2) the impact of token-level distributional shifts on sequence-level reasoning performance through cross-sampling interventions, and (3) fine-grained mechanics of these shifts at the token level. We find that RL fine-tuning induces highly sparse and targeted changes, with only a small fraction of token distributions exhibiting meaningful divergence between the base and RL policies. We further characterize the structure and evolution of these shifts through analyses of token entropy, positional concentration, and reallocation of probability mass. To assess the functional importance of these sparse changes, we conduct cross-sampling experiments that selectively swap token choices between the base and RL models with varying intervention budgets. We show that inserting only a small fraction of RL-sampled tokens into base generations progressively recovers RL performance gains, while injecting a similarly small number of base token choices into otherwise RL-generated sequences collapses performance to base levels, isolating a small set of token-level decisions directly responsible for RLVR's performance gains. Finally, we explore divergence-weighted variants of the advantage signal as a diagnostic intervention, finding that they can yield improvements over baselines. Together, our results shed light on the distributional changes induced by RLVR and provide a fine-grained, token-level lens for understanding RLVR fine-tuning as a targeted refinement process.

Fonte: arXiv cs.CL

RLScore 85

Emergency Preemption Without Online Exploration: A Decision Transformer Approach

arXiv:2603.22315v1 Announce Type: new Abstract: Emergency vehicle (EV) response time is a critical determinant of survival outcomes, yet deployed signal preemption strategies remain reactive and uncontrollable. We propose a return-conditioned framework for emergency corridor optimization based on the Decision Transformer (DT). By casting corridor optimization as offline, return-conditioned sequence modeling, our approach (1) eliminates online environment interaction during policy learning, (2) enables dispatch-level urgency control through a single target-return scalar, and (3) extends to multi-agent settings via a Multi-Agent Decision Transformer (MADT) with graph attention for spatial coordination. On the LightSim simulator, DT reduces average EV travel time by 37.7% relative to fixed-timing preemption on a 4x4 grid (88.6 s vs. 142.3 s), achieving the lowest civilian delay (11.3 s/veh) and fewest EV stops (1.2) among all methods, including online RL baselines that require environment interaction. MADT further improves on larger grids, overtaking DT with 45.2% reduction on 8x8 via graph-attention coordination. Return conditioning produces a smooth dispatch interface: varying the target return from 100 to -400 trades EV travel time (72.4-138.2 s) against civilian delay (16.8-5.4 s/veh), requiring no retraining. A Constrained DT extension adds explicit civilian disruption budgets as a second control knob.

Fonte: arXiv cs.LG

RLScore 85

Minibal: Balanced Game-Playing Without Opponent Modeling

arXiv:2603.23059v1 Announce Type: new Abstract: Recent advances in game AI, such as AlphaZero and Ath\'enan, have achieved superhuman performance across a wide range of board games. While highly powerful, these agents are ill-suited for human-AI interaction, as they consistently overwhelm human players, offering little enjoyment and limited educational value. This paper addresses the problem of balanced play, in which an agent challenges its opponent without either dominating or conceding. We introduce Minibal (Minimize & Balance), a variant of Minimax specifically designed for balanced play. Building on this concept, we propose several modifications of the Unbounded Minimax algorithm explicitly aimed at discovering balanced strategies. Experiments conducted across seven board games demonstrate that one variant consistently achieves the most balanced play, with average outcomes close to perfect balance. These results establish Minibal as a promising foundation for designing AI agents that are both challenging and engaging, suitable for both entertainment and serious games.

Fonte: arXiv cs.AI

RLScore 85

CoMaTrack: Competitive Multi-Agent Game-Theoretic Tracking with Vision-Language-Action Models

arXiv:2603.22846v1 Announce Type: new Abstract: Embodied Visual Tracking (EVT), a core dynamic task in embodied intelligence, requires an agent to precisely follow a language-specified target. Yet most existing methods rely on single-agent imitation learning, suffering from costly expert data and limited generalization due to static training environments. Inspired by competition-driven capability evolution, we propose CoMaTrack, a competitive game-theoretic multi-agent reinforcement learning framework that trains agents in a dynamic adversarial setting with competitive subtasks, yielding stronger adaptive planning and interference-resilient strategies. We further introduce CoMaTrack-Bench, the first benchmark for competitive EVT, featuring game scenarios between a tracker and adaptive opponents across diverse environments and instructions, enabling standardized robustness evaluation under active adversarial interactions. Experiments show that CoMaTrack achieves state-of-the-art results on both standard benchmarks and CoMaTrack-Bench. Notably, a 3B VLM trained with our framework surpasses previous single-agent imitation learning methods based on 7B models on the challenging EVT-Bench, achieving 92.1% in STT, 74.2% in DT, and 57.5% in AT. The benchmark code will be available at https://github.com/wlqcode/CoMaTrack-Bench

Fonte: arXiv cs.AI

RLScore 85

Describe-Then-Act: Proactive Agent Steering via Distilled Language-Action World Models

arXiv:2603.23149v1 Announce Type: new Abstract: Deploying safety-critical agents requires anticipating the consequences of actions before they are executed. While world models offer a paradigm for this proactive foresight, current approaches relying on visual simulation incur prohibitive latencies, often exceeding several seconds per step. In this work, we challenge the assumption that visual processing is necessary for failure prevention. We show that a trained policy's latent state, combined with its planned actions, already encodes sufficient information to anticipate action outcomes, making visual simulation redundant for failure prevention. To this end, we introduce DILLO (DIstiLLed Language-ActiOn World Model), a fast steering layer that shifts the paradigm from "simulate-then-act" to "describe-then-act." DILLO is trained via cross-modal distillation, where a privileged Vision Language Model teacher annotates offline trajectories and a latent-conditioned Large Language Model student learns to predict semantic outcomes. This creates a text-only inference path, bypassing heavy visual generation entirely, achieving a 14x speedup over baselines. Experiments on MetaWorld and LIBERO demonstrate that DILLO produces high-fidelity descriptions of the next state and is able to steer the policy, improving episode success rate by up to 15 pp and 9.3 pp on average across tasks.

Fonte: arXiv cs.AI

RLScore 85

Off-Policy Evaluation and Learning for Survival Outcomes under Censoring

arXiv:2603.22900v1 Announce Type: cross Abstract: Optimizing survival outcomes, such as patient survival or customer retention, is a critical objective in data-driven decision-making. Off-Policy Evaluation~(OPE) provides a powerful framework for assessing such decision-making policies using logged data alone, without the need for costly or risky online experiments in high-stakes applications. However, typical estimators are not designed to handle right-censored survival outcomes, as they ignore unobserved survival times beyond the censoring time, leading to systematic underestimation of the true policy performance. To address this issue, we propose a novel framework for OPE and Off-Policy Learning~(OPL) tailored for survival outcomes under censoring. Specifically, we introduce IPCW-IPS and IPCW-DR, which employ the Inverse Probability of Censoring Weighting technique to explicitly deal with censoring bias. We theoretically establish that our estimators are unbiased and that IPCW-DR achieves double robustness, ensuring consistency if either the propensity score or the outcome model is correct. Furthermore, we extend this framework to constrained OPL to optimize policy value under budget constraints. We demonstrate the effectiveness of our proposed methods through simulation studies and illustrate their practical impacts using public real-world data for both evaluation and learning tasks.

Fonte: arXiv stat.ML

NLP/LLMsScore 85

Improving Safety Alignment via Balanced Direct Preference Optimization

arXiv:2603.22829v1 Announce Type: new Abstract: With the rapid development and widespread application of Large Language Models (LLMs), their potential safety risks have attracted widespread attention. Reinforcement Learning from Human Feedback (RLHF) has been adopted to enhance the safety performance of LLMs. As a simple and effective alternative to RLHF, Direct Preference Optimization (DPO) is widely used for safety alignment. However, safety alignment still suffers from severe overfitting, which limits its actual performance. This paper revisits the overfitting phenomenon from the perspective of the model's comprehension of the training data. We find that the Imbalanced Preference Comprehension phenomenon exists between responses in preference pairs, which compromises the model's safety performance. To address this, we propose Balanced Direct Preference Optimization (B-DPO), which adaptively modulates optimization strength between preferred and dispreferred responses based on mutual information. A series of experimental results show that B-DPO can enhance the safety capability while maintaining the competitive general capabilities of LLMs on various mainstream benchmarks compared to state-of-the-art methods. \color{red}{Warning: This paper contains examples of harmful texts, and reader discretion is recommended.

Fonte: arXiv cs.AI

RLScore 85

Profit is the Red Team: Stress-Testing Agents in Strategic Economic Interactions

arXiv:2603.20925v1 Announce Type: new Abstract: As agentic systems move into real-world deployments, their decisions increasingly depend on external inputs such as retrieved content, tool outputs, and information provided by other actors. When these inputs can be strategically shaped by adversaries, the relevant security risk extends beyond a fixed library of prompt attacks to adaptive strategies that steer agents toward unfavorable outcomes. We propose profit-driven red teaming, a stress-testing protocol that replaces handcrafted attacks with a learned opponent trained to maximize its profit using only scalar outcome feedback. The protocol requires no LLM-as-judge scoring, attack labels, or attack taxonomy, and is designed for structured settings with auditable outcomes. We instantiate it in a lean arena of four canonical economic interactions, which provide a controlled testbed for adaptive exploitability. In controlled experiments, agents that appear strong against static baselines become consistently exploitable under profit-optimized pressure, and the learned opponent discovers probing, anchoring, and deceptive commitments without explicit instruction. We then distill exploit episodes into concise prompt rules for the agent, which make most previously observed failures ineffective and substantially improve target performance. These results suggest that profit-driven red-team data can provide a practical route to improving robustness in structured agent settings with auditable outcomes.

Fonte: arXiv cs.AI

RLScore 85

MARLIN: Multi-Agent Reinforcement Learning for Incremental DAG Discovery

arXiv:2603.20295v1 Announce Type: new Abstract: Uncovering causal structures from observational data is crucial for understanding complex systems and making informed decisions. While reinforcement learning (RL) has shown promise in identifying these structures in the form of a directed acyclic graph (DAG), existing methods often lack efficiency, making them unsuitable for online applications. In this paper, we propose MARLIN, an efficient multi agent RL based approach for incremental DAG learning. MARLIN uses a DAG generation policy that maps a continuous real valued space to the DAG space as an intra batch strategy, then incorporates two RL agents state specific and state invariant to uncover causal relationships and integrates these agents into an incremental learning framework. Furthermore, the framework leverages a factored action space to enhance parallelization efficiency. Extensive experiments on synthetic and real datasets demonstrate that MARLIN outperforms state of the art methods in terms of both efficiency and effectiveness.

Fonte: arXiv cs.LG

RLScore 85

SLE-FNO: Single-Layer Extensions for Task-Agnostic Continual Learning in Fourier Neural Operators

arXiv:2603.20410v1 Announce Type: new Abstract: Scientific machine learning is increasingly used to build surrogate models, yet most models are trained under a restrictive assumption in which future data follow the same distribution as the training set. In practice, new experimental conditions or simulation regimes may differ significantly, requiring extrapolation and model updates without re-access to prior data. This creates a need for continual learning (CL) frameworks that can adapt to distribution shifts while preventing catastrophic forgetting. Such challenges are pronounced in fluid dynamics, where changes in geometry, boundary conditions, or flow regimes induce non-trivial changes to the solution. Here, we introduce a new architecture-based approach (SLE-FNO) combining a Single-Layer Extension (SLE) with the Fourier Neural Operator (FNO) to support efficient CL. SLE-FNO was compared with a range of established CL methods, including Elastic Weight Consolidation (EWC), Learning without Forgetting (LwF), replay-based approaches, Orthogonal Gradient Descent (OGD), Gradient Episodic Memory (GEM), PiggyBack, and Low-Rank Approximation (LoRA), within an image-to-image regression setting. The models were trained to map transient concentration fields to time-averaged wall shear stress (TAWSS) in pulsatile aneurysmal blood flow. Tasks were derived from 230 computational fluid dynamics simulations grouped into four sequential and out-of-distribution configurations. Results show that replay-based methods and architecture-based approaches (PiggyBack, LoRA, and SLE-FNO) achieve the best retention, with SLE-FNO providing the strongest overall balance between plasticity and stability, achieving accuracy with zero forgetting and minimal additional parameters. Our findings highlight key differences between CL algorithms and introduce SLE-FNO as a promising strategy for adapting baseline models when extrapolation is required.

Fonte: arXiv cs.LG

RLScore 85

Does This Gradient Spark Joy?

arXiv:2603.20526v1 Announce Type: new Abstract: Policy gradient computes a backward pass for every sample, even though the backward pass is expensive and most samples carry little learning value. The Delightful Policy Gradient (DG) provides a forward-pass signal of learning value: \emph{delight}, the product of advantage and surprisal (negative log-probability). We introduce the \emph{Kondo gate}, which compares delight against a compute price and pays for a backward pass only when the sample is worth it, thereby tracing a quality--cost Pareto frontier. In bandits, zero-price gating preserves useful gradient signal while removing perpendicular noise, and delight is a more reliable screening signal than additive combinations of value and surprise. On MNIST and transformer token reversal, the Kondo gate skips most backward passes while retaining nearly all of DG's learning quality, with gains that grow as problems get harder and backward passes become more expensive. Because the gate tolerates approximate delight, a cheap forward pass can screen samples before expensive backpropagation, suggesting a speculative-decoding-for-training paradigm.

Fonte: arXiv cs.LG

Theory/OptimizationScore 85

Bounded Coupled AI Learning Dynamics in Tri-Hierarchical Drone Swarms

arXiv:2603.20333v1 Announce Type: new Abstract: Modern autonomous multi-agent systems combine heterogeneous learning mechanisms operating at different timescales. An open question remains: can one formally guarantee that coupled dynamics of such mechanisms stay within the admissible operational regime? This paper studies a tri-hierarchical swarm learning system where three mechanisms act simultaneously: (1) local Hebbian online learning at individual agent level (fast timescale, 10-100 ms); (2) multi-agent reinforcement learning (MARL) for tactical group coordination (medium timescale, 1-10 s); (3) meta-learning (MAML) for strategic adaptation (slow timescale, 10-100 s). Four results are established. The Bounded Total Error Theorem shows that under contractual constraints on learning rates, Lipschitz continuity of inter-level mappings, and weight stabilization, total suboptimality admits a component-wise upper bound uniform in time. The Bounded Representation Drift Theorem gives a worst-case estimate of how Hebbian updates affect coordination-level embeddings during one MARL cycle. The Meta-Level Compatibility Theorem provides sufficient conditions under which strategic adaptation preserves lower-level invariants. The Non-Accumulation Theorem proves that error does not grow unboundedly over time.

Fonte: arXiv cs.LG

RLScore 85

Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret

arXiv:2603.20453v1 Announce Type: new Abstract: Reinforcement learning from human feedback (RLHF) replaces hard-to-specify rewards with pairwise trajectory preferences, yet regret-oriented theory often assumes that preference labels are generated consistently from a single ground-truth objective. In practical RLHF systems, however, feedback is typically \emph{multi-source} (annotators, experts, reward models, heuristics) and can exhibit systematic, persistent mismatches due to subjectivity, expertise variation, and annotation/modeling artifacts. We study episodic RL from \emph{multi-source imperfect preferences} through a cumulative imperfection budget: for each source, the total deviation of its preference probabilities from an ideal oracle is at most $\omega$ over $K$ episodes. We propose a unified algorithm with regret $\tilde{O}(\sqrt{K/M}+\omega)$, which exhibits a best-of-both-regimes behavior: it achieves $M$-dependent statistical gains when imperfection is small (where $M$ is the number of sources), while remaining robust with unavoidable additive dependence on $\omega$ when imperfection is large. We complement this with a lower bound $\tilde{\Omega}(\max\{\sqrt{K/M},\omega\})$, which captures the best possible improvement with respect to $M$ and the unavoidable dependence on $\omega$, and a counterexample showing that na\"ively treating imperfect feedback as as oracle-consistent can incur regret as large as $\tilde{\Omega}(\min\{\omega\sqrt{K},K\})$. Technically, our approach involves imperfection-adaptive weighted comparison learning, value-targeted transition estimation to control hidden feedback-induced distribution shift, and sub-importance sampling to keep the weighted objectives analyzable, yielding regret guarantees that quantify when multi-source feedback provably improves RLHF and how cumulative imperfection fundamentally limits it.

Fonte: arXiv cs.LG

RLScore 85

CAMA: Exploring Collusive Adversarial Attacks in c-MARL

arXiv:2603.20390v1 Announce Type: new Abstract: Cooperative multi-agent reinforcement learning (c-MARL) has been widely deployed in real-world applications, such as social robots, embodied intelligence, UAV swarms, etc. Nevertheless, many adversarial attacks still exist to threaten various c-MARL systems. At present, the studies mainly focus on single-adversary perturbation attacks and white-box adversarial attacks that manipulate agents' internal observations or actions. To address these limitations, we in this paper attempt to study collusive adversarial attacks through strategically organizing a set of malicious agents into three collusive attack modes: Collective Malicious Agents, Disguised Malicious Agents, and Spied Malicious Agents. Three novelties are involved: i) three collusive adversarial attacks are creatively proposed for the first time, and a unified framework CAMA for policy-level collusive attacks is designed; ii) the attack effectiveness is theoretically analyzed from the perspectives of disruptiveness, stealthiness, and attack cost; and iii) the three collusive adversarial attacks are technically realized through agent's observation information fusion, attack-trigger control. Finally, multi-facet experiments on four SMAC II maps are performed, and experimental results showcase the three collusive attacks have an additive adversarial synergy, strengthening attack outcome while maintaining high stealthiness and stability over long horizons. Our work fills the gap for collusive adversarial learning in c-MARL.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

A Multihead Continual Learning Framework for Fine-Grained Fashion Image Retrieval with Contrastive Learning and Exponential Moving Average Distillation

arXiv:2603.20648v1 Announce Type: new Abstract: Most fine-grained fashion image retrieval (FIR) methods assume a static setting, requiring full retraining when new attributes appear, which is costly and impractical for dynamic scenarios. Although pretrained models support zero-shot inference, their accuracy drops without supervision, and no prior work explores class-incremental learning (CIL) for fine-grained FIR. We propose a multihead continual learning framework for fine-grained fashion image retrieval with contrastive learning and exponential moving average (EMA) distillation (MCL-FIR). MCL-FIR adopts a multi-head design to accommodate evolving classes across increments, reformulates triplet inputs into doublets with InfoNCE for simpler and more effective training, and employs EMA distillation for efficient knowledge transfer. Experiments across four datasets demonstrate that, beyond its scalability, MCL-FIR achieves a strong balance between efficiency and accuracy. It significantly outperforms CIL baselines under similar training cost, and compared with static methods, it delivers comparable performance while using only about 30% of the training cost. The source code is publicly available in https://github.com/Dr-LingXiao/MCL-FIR.

Fonte: arXiv cs.CV

RLScore 85

RLVR Training of LLMs Does Not Improve Thinking Ability for General QA: Evaluation Method and a Simple Solution

arXiv:2603.20799v1 Announce Type: new Abstract: Reinforcement learning from verifiable rewards (RLVR) stimulates the thinking processes of large language models (LLMs), substantially enhancing their reasoning abilities on verifiable tasks. It is often assumed that similar gains should transfer to general question answering (GQA), but this assumption has not been thoroughly validated. To assess whether RLVR automatically improves LLM performance on GQA, we propose a Cross-Generation evaluation framework that measures the quality of intermediate reasoning by feeding the generated thinking context into LLMs of varying capabilities. Our evaluation leads to a discouraging finding: the efficacy of the thinking process on GQA tasks is markedly lower than on verifiable tasks, suggesting that explicit training on GQA remains necessary in addition to training on verifiable tasks. We further observe that direct RL training on GQA is less effective than RLVR. Our hypothesis is that, whereas verifiable tasks demand robust logical chains to obtain high rewards, GQA tasks often admit shortcuts to high rewards without cultivating high-quality thinking. To avoid possible shortcuts, we introduce a simple method, Separated Thinking And Response Training (START), which first trains only the thinking process, using rewards defined on the final answer. We show that START improves both the quality of thinking and the final answer across several GQA benchmarks and RL algorithms.

Fonte: arXiv cs.CL

NLP/LLMsScore 85

Fast-Slow Thinking RM: Efficient Integration of Scalar and Generative Reward Models

arXiv:2603.20212v1 Announce Type: new Abstract: Reward models (RMs) are critical for aligning Large Language Models via Reinforcement Learning from Human Feedback (RLHF). While Generative Reward Models (GRMs) achieve superior accuracy through chain-of-thought (CoT) reasoning, they incur substantial computational costs. Conversely, Scalar Reward Models (SRMs) offer efficiency but suffer from limited performance and adaptability in complex scenarios. We introduce Fast-Slow Thinking Reward Models (F/S-RM), a hybrid RM architecture inspired by Dual Process Theory. It trains a single model to integrate two distinct reward paradigms: first-token prediction as a scalar score (fast thinking) and CoT-based judgment (slow thinking), regulated by a dual-confidence activation mechanism that determines when to activate slow thinking. F/S-RM achieves a 1.2% relative performance improvement over state-of-the-art models while reducing token consumption by 20.8%. Code and data will be publicly available.

Fonte: arXiv cs.CL

RLScore 85

Understanding Behavior Cloning with Action Quantization

arXiv:2603.20538v1 Announce Type: new Abstract: Behavior cloning is a fundamental paradigm in machine learning, enabling policy learning from expert demonstrations across robotics, autonomous driving, and generative models. Autoregressive models like transformer have proven remarkably effective, from large language models (LLMs) to vision-language-action systems (VLAs). However, applying autoregressive models to continuous control requires discretizing actions through quantization, a practice widely adopted yet poorly understood theoretically. This paper provides theoretical foundations for this practice. We analyze how quantization error propagates along the horizon and interacts with statistical sample complexity. We show that behavior cloning with quantized actions and log-loss achieves optimal sample complexity, matching existing lower bounds, and incurs only polynomial horizon dependence on quantization error, provided the dynamics are stable and the policy satisfies a probabilistic smoothness condition. We further characterize when different quantization schemes satisfy or violate these requirements, and propose a model-based augmentation that provably improves the error bound without requiring policy smoothness. Finally, we establish fundamental limits that jointly capture the effects of quantization error and statistical complexity.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

SymCircuit: Bayesian Structure Inference for Tractable Probabilistic Circuits via Entropy-Regularized Reinforcement Learning

arXiv:2603.20392v1 Announce Type: new Abstract: Probabilistic circuit (PC) structure learning is hampered by greedy algorithms that make irreversible, locally optimal decisions. We propose SymCircuit, which replaces greedy search with a learned generative policy trained via entropy-regularized reinforcement learning. Instantiating the RL-as-inference framework in the PC domain, we show the optimal policy is a tempered Bayesian posterior, recovering the exact posterior when the regularization temperature is set inversely proportional to the dataset size. The policy is implemented as SymFormer, a grammar-constrained autoregressive Transformer with tree-relative self-attention that guarantees valid circuits at every generation step. We introduce option-level REINFORCE, restricting gradient updates to structural decisions rather than all tokens, yielding an SNR (signal to noise ratio) improvement and >10 times sample efficiency gain on the NLTCS dataset. A three-layer uncertainty decomposition (structural via model averaging, parametric via the delta method, leaf via conjugate Dirichlet-Categorical propagation) is grounded in the multilinear polynomial structure of PC outputs. On NLTCS, SymCircuit closes 93% of the gap to LearnSPN; preliminary results on Plants (69 variables) suggest scalability.

Fonte: arXiv cs.LG

NLP/LLMsScore 95

Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs

arXiv:2603.20698v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable potential in medical image analysis. However, their application in gastrointestinal endoscopy is currently hindered by two critical limitations: the misalignment between general model reasoning and standardized clinical cognitive pathways, and the lack of causal association between visual features and diagnostic outcomes. In this paper, we propose a novel Clinical-Cognitive-Aligned (CogAlign) framework to address these challenges. First, we endow the model with rigorous clinical analytical capabilities by constructing the hierarchical clinical cognition dataset and employing Supervised Fine-Tuning (SFT). Unlike conventional approaches, this strategy internalizes the hierarchical diagnostic logic of experts, ranging from anatomical localization and morphological evaluation to microvascular analysis, directly into the model. Second, to eliminate visual bias, we provide a theoretical analysis demonstrating that standard supervised tuning inevitably converges to spurious background correlations. Guided by this insight, we propose a counterfactual-driven reinforcement learning strategy to enforce causal rectification. By generating counterfactual normal samples via lesion masking and optimizing through clinical-cognition-centric rewards, we constrain the model to strictly ground its diagnosis in causal lesion features. Extensive experiments demonstrate that our approach achieves State-of-the-Art (SoTA) performance across multiple benchmarks, significantly enhancing diagnostic accuracy in complex clinical scenarios. All source code and datasets will be made publicly available.

Fonte: arXiv cs.CV

Theory/OptimizationScore 85

Constrained Online Convex Optimization with Memory and Predictions

arXiv:2603.21375v1 Announce Type: cross Abstract: We study Constrained Online Convex Optimization with Memory (COCO-M), where both the loss and the constraints depend on a finite window of past decisions made by the learner. This setting extends the previously studied unconstrained online optimization with memory framework and captures practical problems such as the control of constrained dynamical systems and scheduling with reconfiguration budgets. For this problem, we propose the first algorithms that achieve sublinear regret and sublinear cumulative constraint violation under time-varying constraints, both with and without predictions of future loss and constraint functions. Without predictions, we introduce an adaptive penalty approach that guarantees sublinear regret and constraint violation. When short-horizon and potentially unreliable predictions are available, we reinterpret the problem as online learning with delayed feedback and design an optimistic algorithm whose performance improves as prediction accuracy improves, while remaining robust when predictions are inaccurate. Our results bridge the gap between classical constrained online convex optimization and memory-dependent settings, and provide a versatile learning toolbox with diverse applications.

Fonte: arXiv stat.ML

NLP/LLMsScore 85

Knowledge Boundary Discovery for Large Language Models

arXiv:2603.21022v1 Announce Type: new Abstract: We propose Knowledge Boundary Discovery (KBD), a reinforcement learning based framework to explore the knowledge boundaries of the Large Language Models (LLMs). We define the knowledge boundary by automatically generating two types of questions: (i) those the LLM can confidently answer (within-knowledge boundary) and (ii) those it cannot (beyond-knowledge boundary). Iteratively exploring and exploiting the LLM's responses to find its knowledge boundaries is challenging because of the hallucination phenomenon. To find the knowledge boundaries of an LLM, the agent interacts with the LLM under the modeling of exploring a partially observable environment. The agent generates a progressive question as the action, adopts an entropy reduction as the reward, receives the LLM's response as the observation and updates its belief states. We demonstrate that the KBD detects knowledge boundaries of LLMs by automatically finding a set of non-trivial answerable and unanswerable questions. We validate the KBD by comparing its generated knowledge boundaries with manually crafted LLM benchmark datasets. Experiments show that our KBD-generated question set is comparable to the human-generated datasets. Our approach paves a new way to evaluate LLMs.

Fonte: arXiv cs.AI

RLScore 85

Delightful Distributed Policy Gradient

arXiv:2603.20521v1 Announce Type: new Abstract: Distributed reinforcement learning trains on data from stale, buggy, or mismatched actors, producing actions with high surprisal (negative log-probability) under the learner's policy. The core difficulty is not surprising data per se, but \emph{negative learning from surprising data}. High-surprisal failures can dominate the update direction despite carrying little useful signal, while high-surprisal successes reveal opportunities the current policy would otherwise miss. The \textit{Delightful Policy Gradient} (DG) separates these cases by gating each update with delight, the product of advantage and surprisal, suppressing rare failures and amplifying rare successes without behavior probabilities. Under contaminated sampling, the cosine similarity between the standard policy gradient and the true gradient collapses, while DG's grows as the policy improves. No sign-blind reweighting, including exact importance sampling, can reproduce this effect. On MNIST with simulated staleness, DG without off-policy correction outperforms importance-weighted PG with exact behavior probabilities. On a transformer sequence task with staleness, actor bugs, reward corruption, and rare discovery, DG achieves roughly $10{\times}$ lower error. When all four frictions act simultaneously, its compute advantage is order-of-magnitude and grows with task complexity.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

AI-Driven Multi-Agent Simulation of Stratified Polyamory Systems: A Computational Framework for Optimizing Social Reproductive Efficiency

arXiv:2603.20678v1 Announce Type: new Abstract: Contemporary societies face a severe crisis of demographic reproduction. Global fertility rates continue to decline precipitously, with East Asian nations exhibiting the most dramatic trends -- China's total fertility rate (TFR) fell to approximately 1.0 in 2023, while South Korea's dropped below 0.72. Simultaneously, the institution of marriage is undergoing structural disintegration: educated women rationally reject unions lacking both emotional fulfillment and economic security, while a growing proportion of men at the lower end of the socioeconomic spectrum experience chronic sexual deprivation, anxiety, and learned helplessness. This paper proposes a computational framework for modeling and evaluating a Stratified Polyamory System (SPS) using techniques from agent-based modeling (ABM), multi-agent reinforcement learning (MARL), and large language model (LLM)-empowered social simulation. The SPS permits individuals to maintain a limited number of legally recognized secondary partners in addition to one primary spouse, combined with socialized child-rearing and inheritance reform. We formalize the A/B/C stratification as heterogeneous agent types in a multi-agent system and model the matching process as a MARL problem amenable to Proximal Policy Optimization (PPO). The mating network is analyzed using graph neural network (GNN) representations. Drawing on evolutionary psychology, behavioral ecology, social stratification theory, computational social science, algorithmic fairness, and institutional economics, we argue that SPS can improve aggregate social welfare in the Pareto sense. Preliminary computational results demonstrate the framework's viability in addressing the dual crisis of female motherhood penalties and male sexlessness, while offering a non-violent mechanism for wealth dispersion analogous to the historical Chinese Grace Decree (Tui'en Ling).

Fonte: arXiv cs.AI

Theory/OptimizationScore 85

The Intelligent Disobedience Game: Formulating Disobedience in Stackelberg Games and Markov Decision Processes

arXiv:2603.20994v1 Announce Type: new Abstract: In shared autonomy, a critical tension arises when an automated assistant must choose between obeying a human's instruction and deliberately overriding it to prevent harm. This safety-critical behavior is known as intelligent disobedience. To formalize this dynamic, this paper introduces the Intelligent Disobedience Game (IDG), a sequential game-theoretic framework based on Stackelberg games that models the interaction between a human leader and an assistive follower operating under asymmetric information. It characterizes optimal strategies for both agents across multi-step scenarios, identifying strategic phenomena such as ``safety traps,'' where the system indefinitely avoids harm but fails to achieve the human's goal. The IDG provides a needed mathematical foundation that enables both the algorithmic development of agents that can learn safe non-compliance and the empirical study of how humans perceive and trust disobedient AI. The paper further translates the IDG into a shared control Multi-Agent Markov Decision Process representation, forming a compact computational testbed for training reinforcement learning agents.

Fonte: arXiv cs.AI

MultimodalScore 85

RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models

arXiv:2603.21341v1 Announce Type: new Abstract: Improving embodied reasoning in multimodal-large-language models (MLLMs) is essential for building vision-language-action models (VLAs) on top of them to readily translate multimodal understanding into low-level actions. Accordingly, recent work has explored enhancing embodied reasoning in MLLMs through supervision of vision-question-answering type. However, these approaches have been reported to result in unstable VLA performance, often yielding only marginal or even negative gains. In this paper, we propose a more systematic MLLM training framework RoboAlign that reliably improves VLA performance. Our key idea is to sample action tokens via zero-shot natural language reasoning and refines this reasoning using reinforcement learning (RL) to improve action accuracy. As a result, RoboAlign bridges the modality gap between language and low-level actions in MLLMs, and facilitate knowledge transfer from MLLM to VLA. To validate the effectiveness of RoboAlign, we train VLAs by adding a diffusion-based action head on top of an MLLM backbone and evaluate them on major robotics benchmarks. Remarkably, by performing RL-based alignment after SFT using less than 1\% of the data, RoboAlign achieves performance improvements of 17.5\%, 18.9\%, and 106.6\% over SFT baselines on LIBERO, CALVIN, and real-world environments, respectively.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

FAAR: Efficient Frequency-Aware Multi-Task Fine-Tuning via Automatic Rank Selection

arXiv:2603.20403v1 Announce Type: new Abstract: Adapting models pre-trained on large-scale datasets is a proven way to reach strong performance quickly for down-stream tasks. However, the growth of state-of-the-art mod-els makes traditional full fine-tuning unsuitable and difficult, especially for multi-task learning (MTL) where cost scales with the number of tasks. As a result, recent studies investigate parameter-efficient fine-tuning (PEFT) using low-rank adaptation to significantly reduce the number of trainable parameters. However, these existing methods use a single, fixed rank, which may not be optimal for differ-ent tasks or positions in the MTL architecture. Moreover, these methods fail to learn spatial information that cap-tures inter-task relationships and helps to improve diverse task predictions. This paper introduces Frequency-Aware and Automatic Rank (FAAR) for efficient MTL fine-tuning. Our method introduces Performance-Driven Rank Shrink-ing (PDRS) to allocate the optimal rank per adapter location and per task. Moreover, by analyzing the image frequency spectrum, FAAR proposes a Task-Spectral Pyramidal Decoder (TS-PD) that injects input-specific context into spatial bias learning to better reflect cross-task relationships. Experiments performed on dense visual task benchmarks show the superiority of our method in terms of both accuracy and efficiency compared to other PEFT methods in MTL. FAAR reduces the number of parameters by up to 9 times compared to traditional MTL fine-tuning whilst improving overall performance. Our code is available.

Fonte: arXiv cs.CV

RLScore 85

Bayesian Learning in Episodic Zero-Sum Games

arXiv:2603.20604v1 Announce Type: new Abstract: We study Bayesian learning in episodic, finite-horizon zero-sum Markov games with unknown transition and reward models. We investigate a posterior algorithm in which each player maintains a Bayesian posterior over the game model, independently samples a game model at the beginning of each episode, and computes an equilibrium policy for the sampled model. We analyze two settings: (i) Both players use the posterior sampling algorithm, and (ii) Only one player uses posterior sampling while the opponent follows an arbitrary learning algorithm. In each setting, we provide guarantees on the expected regret of the posterior sampling agent. Our notion of regret compares the expected total reward of the learning agent against the expected total reward under equilibrium policies of the true game. Our main theoretical result is an expected regret bound for the posterior sampling agent of order $O(HS\sqrt{ABHK\log(SABHK)})$ where $K$ is the number of episodes, $H$ is the episode length, $S$ is the number of states, and $A,B$ are the action space sizes of the two players. Experiments in a grid-world predator--prey domain illustrate the sublinear regret scaling and show that posterior sampling competes favorably with a fictitious-play baseline.

Fonte: arXiv cs.LG

NLP/LLMsScore 92

LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

arXiv:2603.21065v1 Announce Type: new Abstract: We introduce LongCat-Flash-Prover, a flagship 560-billion-parameter open-source Mixture-of- Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR). We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving. To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch. During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels. Additionally, we also incorporate theorem consistency and legality detection mechanisms to eliminate reward hacking issues. Extensive evaluations show that our LongCat-Flash-Prover sets a new state-of-the-art for open-weights models in both auto-formalization and theorem proving. Demonstrating remarkable sample efficiency, it achieves a 97.1% pass rate on MiniF2F-Test using only 72 inference budget per problem. On more challenging benchmarks, it solves 70.8% of ProverBench and 41.5% of PutnamBench with no more than 220 attempts per problem, significantly outperforming existing open-weights baselines.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Agentic AI and the next intelligence explosion

arXiv:2603.20639v1 Announce Type: new Abstract: The "AI singularity" is often miscast as a monolithic, godlike mind. Evolution suggests a different path: intelligence is fundamentally plural, social, and relational. Recent advances in agentic AI reveal that frontier reasoning models, such as DeepSeek-R1, do not improve simply by "thinking longer". Instead, they simulate internal "societies of thought," spontaneous cognitive debates that argue, verify, and reconcile to solve complex tasks. Moreover, we are entering an era of human-AI centaurs: hybrid actors where collective agency transcends individual control. Scaling this intelligence requires shifting from dyadic alignment (RLHF) toward institutional alignment. By designing digital protocols, modeled on organizations and markets, we can build a social infrastructure of checks and balances. The next intelligence explosion will not be a single silicon brain, but a complex, combinatorial society specializing and sprawling like a city. No mind is an island.

Fonte: arXiv cs.AI

RLScore 85

DynFlowDrive: Flow-Based Dynamic World Modeling for Autonomous Driving

arXiv:2603.19675v1 Announce Type: new Abstract: Recently, world models have been incorporated into the autonomous driving systems to improve the planning reliability. Existing approaches typically predict future states through appearance generation or deterministic regression, which limits their ability to capture trajectory-conditioned scene evolution and leads to unreliable action planning. To address this, we propose DynFlowDrive, a latent world model that leverages flow-based dynamics to model the transition of world states under different driving actions. By adopting the rectifiedflow formulation, the model learns a velocity field that describes how the scene state changes under different driving actions, enabling progressive prediction of future latent states. Building upon this, we further introduce a stability-aware multi-mode trajectory selection strategy that evaluates candidate trajectories according to the stability of the induced scene transitions. Extensive experiments on the nuScenes and NavSim benchmarks demonstrate consistent improvements across diverse driving frameworks without introducing additional inference overhead. Source code will be abaliable at https://github.com/xiaolul2/DynFlowDrive.

Fonte: arXiv cs.CV

MultimodalScore 85

ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models

arXiv:2603.19466v1 Announce Type: new Abstract: Effective collaboration begins with knowing when to ask for help. For example, when trying to identify an occluded object, a human would ask someone to remove the obstruction. Can MLLMs exhibit a similar "proactive" behavior by requesting simple user interventions? To investigate this, we introduce ProactiveBench, a benchmark built from seven repurposed datasets that tests proactiveness across different tasks such as recognizing occluded objects, enhancing image quality, and interpreting coarse sketches. We evaluate 22 MLLMs on ProactiveBench, showing that (i) they generally lack proactiveness; (ii) proactiveness does not correlate with model capacity; (iii) "hinting" at proactiveness yields only marginal gains. Surprisingly, we found that conversation histories and in-context learning introduce negative biases, hindering performance. Finally, we explore a simple fine-tuning strategy based on reinforcement learning: its results suggest that proactiveness can be learned, even generalizing to unseen scenarios. We publicly release ProactiveBench as a first step toward building proactive multimodal models.

Fonte: arXiv cs.CV

RLScore 85

Near-Equivalent Q-learning Policies for Dynamic Treatment Regimes

arXiv:2603.19440v1 Announce Type: new Abstract: Precision medicine aims to tailor therapeutic decisions to individual patient characteristics. This objective is commonly formalized through dynamic treatment regimes, which use statistical and machine learning methods to derive sequential decision rules adapted to evolving clinical information. In most existing formulations, these approaches produce a single optimal treatment at each stage, leading to a unique decision sequence. However, in many clinical settings, several treatment options may yield similar expected outcomes, and focusing on a single optimal policy may conceal meaningful alternatives. We extend the Q-learning framework for retrospective data by introducing a worst-value tolerance criterion controlled by a hyperparameter $\varepsilon$, which specifies the maximum acceptable deviation from the optimal expected value. Rather than identifying a single optimal policy, the proposed approach constructs sets of $\varepsilon$-optimal policies whose performance remains within a controlled neighborhood of the optimum. This formulation shifts Q-learning from a vector-valued representation to a matrix-valued one, allowing multiple admissible value functions to coexist during backward recursion. The approach yields families of near-equivalent treatment strategies and explicitly identifies regions of treatment indifference where several decisions achieve comparable outcomes. We illustrate the framework in two settings: a single-stage problem highlighting indifference regions around the decision boundary, and a multi-stage decision process based on a simulated oncology model describing tumor size and treatment toxicity dynamics.

Fonte: arXiv stat.ML

RLScore 85

DeepStock: Reinforcement Learning with Policy Regularizations for Inventory Management

arXiv:2603.19621v1 Announce Type: new Abstract: Deep Reinforcement Learning (DRL) provides a general-purpose methodology for training inventory policies that can leverage big data and compute. However, off-the-shelf implementations of DRL have seen mixed success, often plagued by high sensitivity to the hyperparameters used during training. In this paper, we show that by imposing policy regularizations, grounded in classical inventory concepts such as "Base Stock", we can significantly accelerate hyperparameter tuning and improve the final performance of several DRL methods. We report details from a 100% deployment of DRL with policy regularizations on Alibaba's e-commerce platform, Tmall. We also include extensive synthetic experiments, which show that policy regularizations reshape the narrative on what is the best DRL method for inventory management.

Fonte: arXiv cs.LG

RLScore 85

PA2D-MORL: Pareto Ascent Directional Decomposition based Multi-Objective Reinforcement Learning

arXiv:2603.19579v1 Announce Type: new Abstract: Multi-objective reinforcement learning (MORL) provides an effective solution for decision-making problems involving conflicting objectives. However, achieving high-quality approximations to the Pareto policy set remains challenging, especially in complex tasks with continuous or high-dimensional state-action space. In this paper, we propose the Pareto Ascent Directional Decomposition based Multi-Objective Reinforcement Learning (PA2D-MORL) method, which constructs an efficient scheme for multi-objective problem decomposition and policy improvement, leading to a superior approximation of Pareto policy set. The proposed method leverages Pareto ascent direction to select the scalarization weights and computes the multi-objective policy gradient, which determines the policy optimization direction and ensures joint improvement on all objectives. Meanwhile, multiple policies are selectively optimized under an evolutionary framework to approximate the Pareto frontier from different directions. Additionally, a Pareto adaptive fine-tuning approach is applied to enhance the density and spread of the Pareto frontier approximation. Experiments on various multi-objective robot control tasks show that the proposed method clearly outperforms the current state-of-the-art algorithm in terms of both quality and stability of the outcomes.

Fonte: arXiv cs.AI

RLScore 85

Heavy-Tailed and Long-Range Dependent Noise in Stochastic Approximation: A Finite-Time Analysis

arXiv:2603.19648v1 Announce Type: new Abstract: Stochastic approximation (SA) is a fundamental iterative framework with broad applications in reinforcement learning and optimization. Classical analyses typically rely on martingale difference or Markov noise with bounded second moments, but many practical settings, including finance and communications, frequently encounter heavy-tailed and long-range dependent (LRD) noise. In this work, we study SA for finding the root of a strongly monotone operator under these non-classical noise models. We establish the first finite-time moment bounds in both settings, providing explicit convergence rates that quantify the impact of heavy tails and temporal dependence. Our analysis employs a noise-averaging argument that regularizes the impact of noise without modifying the iteration. Finally, we apply our general framework to stochastic gradient descent (SGD) and gradient play, and corroborate our finite-time analysis through numerical experiments.

Fonte: arXiv cs.LG

Theory/OptimizationScore 85

Deep Hilbert--Galerkin Methods for Infinite-Dimensional PDEs and Optimal Control

arXiv:2603.19463v1 Announce Type: new Abstract: We develop deep learning-based approximation methods for fully nonlinear second-order PDEs on separable Hilbert spaces, such as HJB equations for infinite-dimensional control, by parameterizing solutions via Hilbert--Galerkin Neural Operators (HGNOs). We prove the first Universal Approximation Theorems (UATs) which are sufficiently powerful to address these problems, based on novel topologies for Hessian terms and corresponding novel continuity assumptions on the fully nonlinear operator. These topologies are non-sequential and non-metrizable, making the problem delicate. In particular, we prove UATs for functions on Hilbert spaces, together with their Fr\'echet derivatives up to second order, and for unbounded operators applied to the first derivative, ensuring that HGNOs are able to approximate all the PDE terms. For control problems, we further prove UATs for optimal feedback controls in terms of our approximating value function HGNO. We develop numerical training methods, which we call Deep Hilbert--Galerkin and Hilbert Actor-Critic (reinforcement learning) Methods, for these problems by minimizing the $L^2_\mu(H)$-norm of the residual of the PDE on the whole Hilbert space, not just a projected PDE to finite dimensions. This is the first paper to propose such an approach. The models considered arise in many applied sciences, such as functional differential equations in physics and Kolmogorov and HJB PDEs related to controlled PDEs, SPDEs, path-dependent systems, partially observed stochastic systems, and mean-field SDEs. We numerically solve examples of Kolmogorov and HJB PDEs related to the optimal control of deterministic and stochastic heat and Burgers' equations, demonstrating the promise of our deep learning-based approach.

Fonte: arXiv cs.LG

VisionScore 85

Vision-Language Attribute Disentanglement and Reinforcement for Lifelong Person Re-Identification

arXiv:2603.19678v1 Announce Type: new Abstract: Lifelong person re-identification (LReID) aims to learn from varying domains to obtain a unified person retrieval model. Existing LReID approaches typically focus on learning from scratch or a visual classification-pretrained model, while the Vision-Language Model (VLM) has shown generalizable knowledge in a variety of tasks. Although existing methods can be directly adapted to the VLM, since they only consider global-aware learning, the fine-grained attribute knowledge is underleveraged, leading to limited acquisition and anti-forgetting capacity. To address this problem, we introduce a novel VLM-driven LReID approach named Vision-Language Attribute Disentanglement and Reinforcement (VLADR). Our key idea is to explicitly model the universally shared human attributes to improve inter-domain knowledge transfer, thereby effectively utilizing historical knowledge to reinforce new knowledge learning and alleviate forgetting. Specifically, VLADR includes a Multi-grain Text Attribute Disentanglement mechanism that mines the global and diverse local text attributes of an image. Then, an Inter-domain Cross-modal Attribute Reinforcement scheme is developed, which introduces cross-modal attribute alignment to guide visual attribute extraction and adopts inter-domain attribute alignment to achieve fine-grained knowledge transfer. Experimental results demonstrate that our VLADR outperforms the state-of-the-art methods by 1.9\%-2.2\% and 2.1\%-2.5\% on anti-forgetting and generalization capacity. Our source code is available at https://github.com/zhoujiahuan1991/CVPR2026-VLADR

Fonte: arXiv cs.CV

NLP/LLMsScore 90

A Subgoal-driven Framework for Improving Long-Horizon LLM Agents

arXiv:2603.19685v1 Announce Type: new Abstract: Large language model (LLM)-based agents have emerged as powerful autonomous controllers for digital environments, including mobile interfaces, operating systems, and web browsers. Web navigation, for example, requires handling dynamic content and long sequences of actions, making it particularly challenging. Existing LLM-based agents struggle with long-horizon planning in two main ways. During online execution, they often lose track as new information arrives, lacking a clear and adaptive path toward the final goal. This issue is further exacerbated during reinforcement learning (RL) fine-tuning, where sparse and delayed rewards make it difficult for agents to identify which actions lead to success, preventing them from maintaining coherent reasoning over extended tasks. To address these challenges, we propose two contributions. First, we introduce an agent framework that leverages proprietary models for online planning through subgoal decomposition. Second, we present MiRA (Milestoning your Reinforcement Learning Enhanced Agent), an RL training framework that uses dense, milestone-based reward signals. The real-time planning mechanism improves proprietary models such as Gemini by approximately a 10% absolute increase in success rate (SR) on the WebArena-Lite benchmark. Meanwhile, applying MiRA to the open Gemma3-12B model increases its success rate from 6.4% to 43.0%. This performance surpasses proprietary systems such as GPT-4-Turbo (17.6%) and GPT-4o (13.9%), as well as the previous open-model state of the art, WebRL (38.4%). Overall, our findings demonstrate that combining explicit inference-time planning with milestone-based rewards significantly improves an agent's long-horizon capabilities, paving the way for more robust and general-purpose autonomous systems.

Fonte: arXiv cs.AI

RLScore 85

Stochastic Sequential Decision Making over Expanding Networks with Graph Filtering

arXiv:2603.19501v1 Announce Type: new Abstract: Graph filters leverage topological information to process networked data with existing methods mainly studying fixed graphs, ignoring that graphs often expand as nodes continually attach with an unknown pattern. The latter requires developing filter-based decision-making paradigms that take evolution and uncertainty into account. Existing approaches rely on either pre-designed filters or online learning, limited to a myopic view considering only past or present information. To account for future impacts, we propose a stochastic sequential decision-making framework for filtering networked data with a policy that adapts filtering to expanding graphs. By representing filter shifts as agents, we model the filter as a multi-agent system and train the policy following multi-agent reinforcement learning. This accounts for long-term rewards and captures expansion dynamics through sequential decision-making. Moreover, we develop a context-aware graph neural network to parameterize the policy, which tunes filter parameters based on information of both the graph and agents. Experiments on synthetic and real datasets from cold-start recommendation to COVID prediction highlight the benefits of using a sequential decision-making perspective over batch and online filtering alternatives.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL

arXiv:2603.19470v1 Announce Type: new Abstract: Off-policy problems such as policy staleness and training-inference mismatch, has become a major bottleneck for training stability and further exploration for LLM RL. To enhance inference efficiency, the distribution gap between the inference and updated policy grows, leading to heavy-tailed importance ratios. Heavy-tailed ratios arise when the policy is locally sharp, which further inflates sharp gradients and can push updates outside the trust region. To address this, we propose Adaptive Layerwise Perturbation(ALP) by injecting small learnable perturbations into input hidden states of each layer during updates, which is used as the numerator of the importance ratio against the unchanged inference policy in the objective. Intuitively, by adding controlled noise to intermediate representations, ALP prevents the updated policy from deviating too sharply from the inference policy, and enlarges the policy family to cover the inference policy family with mismatch noises. Hence, the flattened distribution can naturally tighten the updated and inference policy gap and reduce the tail of importance ratios, thus maintaining training stability. This is further validated empirically. Experiments on single-turn math and multi-turn tool-integrated reasoning tasks show that ALP not only improves final performance, but also avoid blow up of importance ratio tail and KL spikes during iterative training, along with boosted exploration. Ablations show that representation-level perturbations across all layers are most effective, substantially outperforming partial-layer and logits-only variants.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

LARFT: Closing the Cognition-Action Gap for Length Instruction Following in Large Language Models

arXiv:2603.19255v1 Announce Type: cross Abstract: Despite the strong performance of Large Language Models (LLMs) on complex instruction-following tasks, precise control of output length remains a persistent challenge. Existing methods primarily attempt to enforce length constraints by externally imposing length signals or optimization objectives, while largely overlooking the underlying limitation: the model's intrinsic deficit in length cognition. To address this, we propose LARFT (Length-Aware Reinforcement Fine-Tuning), a training framework that aligns the model's length cognition with its action. Specifically, LARFT integrates length-oriented reinforcement learning with a hindsight length awareness. By transforming on-policy data into hindsight self-awareness tasks where the model learns to identify the actual length of its own generation, LARFT jointly optimizes the model's internal representation of length information and refines its policy to satisfy length constraints, thereby achieving precise and reliable length instruction following. Extensive experiments across four base models demonstrate that LARFT outperforms existing baselines, achieving an average improvement of +20.92 points across three length instruction following benchmarks with only a marginal decline of -1.45 points on four general capability benchmarks.

Fonte: arXiv cs.AI

RLScore 85

Optimizing Resource-Constrained Non-Pharmaceutical Interventions for Multi-Cluster Outbreak Control Using Hierarchical Reinforcement Learning

arXiv:2603.19397v1 Announce Type: new Abstract: Non-pharmaceutical interventions (NPIs), such as diagnostic testing and quarantine, are crucial for controlling infectious disease outbreaks but are often constrained by limited resources, particularly in early outbreak stages. In real-world public health settings, resources must be allocated across multiple outbreak clusters that emerge asynchronously, vary in size and risk, and compete for a shared resource budget. Here, a cluster corresponds to a group of close contacts generated by a single infected index case. Thus, decisions must be made under uncertainty and heterogeneous demands, while respecting operational constraints. We formulate this problem as a constrained restless multi-armed bandit and propose a hierarchical reinforcement learning framework. A global controller learns a continuous action cost multiplier that adjusts global resource demand, while a generalized local policy estimates the marginal value of allocating resources to individuals within each cluster. We evaluate the proposed framework in a realistic agent-based simulator of SARS-CoV-2 with dynamically arriving clusters. Across a wide range of system scales and testing budgets, our method consistently outperforms RMAB-inspired and heuristic baselines, improving outbreak control effectiveness by 20%-30%. Experiments on up to 40 concurrently active clusters further demonstrate that the hierarchical framework is highly scalable and enables faster decision-making than the RMAB-inspired method.

Fonte: arXiv cs.LG

NLP/LLMsScore 88

PrefPO: Pairwise Preference Prompt Optimization

arXiv:2603.19311v1 Announce Type: new Abstract: Prompt engineering is effective but labor-intensive, motivating automated optimization methods. Existing methods typically require labeled datasets, which are often unavailable, and produce verbose, repetitive prompts. We introduce PrefPO, a minimal prompt optimization approach inspired by reinforcement learning from human feedback (RLHF). Its preference-based approach reduces the need for labeled data and hyperparameter tuning-only a starting prompt and natural language criteria are needed. PrefPO uses an LLM discriminator to express pairwise preferences over model outputs and provide feedback to an LLM optimizer, iteratively improving performance. We evaluate PrefPO on 9 BIG-Bench Hard (BBH) tasks and IFEval-Hard, a newly-curated, challenging subset of IFEval. PrefPO matches or exceeds SOTA methods, including GEPA, MIPRO, and TextGrad, on 6/9 tasks and performs comparably to TextGrad on IFEval-Hard (82.4% vs 84.5%). Unlike other methods, PrefPO can optimize in both labeled and unlabeled settings. Without labels, PrefPO closely matches its labeled performance on 6/9 tasks, proving effective without ground truth. PrefPO also improves prompt hygiene: we find existing methods produce prompts 14.7x their original length or with 34% repetitive content; PrefPO reduces these issues by 3-5x. Furthermore, both LLM and human judges rate PrefPO's prompts higher than TextGrad's. Finally, we identify prompt hacking in prompt optimizers, where methods game evaluation criteria, and find PrefPO is susceptible at half the rate of TextGrad (37% vs 86%), generating fewer brittle, misaligned prompts.

Fonte: arXiv cs.CL

NLP/LLMsScore 92

MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels

arXiv:2603.19310v1 Announce Type: new Abstract: Training large language models (LLMs) for complex reasoning via reinforcement learning requires reward labels that specify whether the generated rollouts are correct. However, obtaining reward labels at scale often requires expensive human labeling or time-consuming verification procedures; for instance, evaluating mathematical proofs demands expert review, while open-ended question answering lacks definitive ground truth. When reward labels are limited, the effectiveness of reinforcement learning fine-tuning is constrained by the scarcity of reward labels. We introduce MemReward, a graph-based experience memory framework: an initial LLM policy generates rollouts for each query, each comprising a thinking process and a final answer, and these rollouts are stored as experience memory. Queries, thinking processes, and answers form nodes in a heterogeneous graph with similarity and structural edges; a GNN trained on labeled nodes propagates rewards to unlabeled rollouts during online optimization. Experiments on Qwen2.5-3B and 1.5B across mathematics, question answering, and code generation demonstrate that MemReward, with only 20% labels, achieves 97.3% of Oracle performance on 3B and 96.6% on 1.5B, surpassing Oracle on out-of-domain tasks. Performance scales smoothly with label budget, reaching 99.4% of Oracle at 70% labels.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

Cooperation and Exploitation in LLM Policy Synthesis for Sequential Social Dilemmas

arXiv:2603.19453v1 Announce Type: new Abstract: We study LLM policy synthesis: using a large language model to iteratively generate programmatic agent policies for multi-agent environments. Rather than training neural policies via reinforcement learning, our framework prompts an LLM to produce Python policy functions, evaluates them in self-play, and refines them using performance feedback across iterations. We investigate feedback engineering (the design of what evaluation information is shown to the LLM during refinement) comparing sparse feedback (scalar reward only) against dense feedback (reward plus social metrics: efficiency, equality, sustainability, peace). Across two canonical Sequential Social Dilemmas (Gathering and Cleanup) and two frontier LLMs (Claude Sonnet 4.6, Gemini 3.1 Pro), dense feedback consistently matches or exceeds sparse feedback on all metrics. The advantage is largest in the Cleanup public goods game, where providing social metrics helps the LLM calibrate the costly cleaning-harvesting tradeoff. Rather than triggering over-optimization of fairness, social metrics serve as a coordination signal that guides the LLM toward more effective cooperative strategies, including territory partitioning, adaptive role assignment, and the avoidance of wasteful aggression. We further perform an adversarial experiment to determine whether LLMs can reward hack these environments. We characterize five attack classes and discuss mitigations, highlighting an inherent tension in LLM policy synthesis between expressiveness and safety. Code at https://github.com/vicgalle/llm-policies-social-dilemmas.

Fonte: arXiv cs.CL

NLP/LLMsScore 85

Probing to Refine: Reinforcement Distillation of LLMs via Explanatory Inversion

arXiv:2603.19266v1 Announce Type: cross Abstract: Distilling robust reasoning capabilities from large language models (LLMs) into smaller, computationally efficient student models remains an unresolved challenge. Despite recent advances, distilled models frequently suffer from superficial pattern memorization and subpar generalization. To overcome these limitations, we introduce a novel distillation framework that moves beyond simple mimicry to instill a deeper conceptual understanding. Our framework features two key innovations. \underline{\textit{First}}, to address pattern memorization, Explanatory Inversion (EI) generates targeted ``explanatory probes'' that compel the student to articulate the underlying logic behind an answer, rather than just memorizing it. \underline{\textit{Second}}, to improve generalization, Explanatory GRPO (\texttt{EXGRPO}) uses a reinforcement learning algorithm with a novel Dialogue Structure Utility Bonus, which explicitly rewards the student for maintaining a coherent reasoning process across these probes. Extensive evaluations on 12 datasets demonstrate significant improvements. Using Gemma-7b as the student model, our method yields an average \textbf{20.39\%} increase over zero-shot performance and a \textbf{6.02\%} improvement over the state-of-the-art distillation baselines. Moreover, models distilled with our method show remarkable training efficiency (e.g., surpassing vanilla fine-tuning with \textbf{10-25\%} training data) and strong generalization to out-of-distribution tasks. Implementation is released at https://github.com/Zhen-Tan-dmml/ExGRPO.git.

Fonte: arXiv cs.AI

RLScore 85

Experience is the Best Teacher: Motivating Effective Exploration in Reinforcement Learning for LLMs

arXiv:2603.20046v1 Announce Type: new Abstract: Reinforcement Learning (RL) with rubric-based rewards has recently shown remarkable progress in enhancing general reasoning capabilities of Large Language Models (LLMs), yet still suffers from ineffective exploration confined to curent policy distribution. In fact, RL optimization can be viewed as steering the policy toward an ideal distribution that maximizes the rewards, while effective exploration should align efforts with desired target. Leveraging this insight, we propose HeRL, a Hindsight experience guided Reinforcement Learning framework to bootstrap effective exploration by explicitly telling LLMs the desired behaviors specified in rewards. Concretely, HeRL treats failed trajectories along with their unmet rubrics as hindsight experience, which serves as in-context guidance for the policy to explore desired responses beyond its current distribution. Additionally, we introduce a bonus reward to incentivize responses with greater potential for improvement under such guidance. HeRL facilitates effective learning from desired high quality samples without repeated trial-and-error from scratch, yielding a more accurate estimation of the expected gradient theoretically. Extensive experiments across various benchmarks demonstrate that HeRL achieves superior performance gains over baselines, and can further benefit from experience guided self-improvement at test time. Our code is available at https://github.com/sikelifei/HeRL.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Teaching an Agent to Sketch One Part at a Time

arXiv:2603.19500v1 Announce Type: new Abstract: We develop a method for producing vector sketches one part at a time. To do this, we train a multi-modal language model-based agent using a novel multi-turn process-reward reinforcement learning following supervised fine-tuning. Our approach is enabled by a new dataset we call ControlSketch-Part, containing rich part-level annotations for sketches, obtained using a novel, generic automatic annotation pipeline that segments vector sketches into semantic parts and assigns paths to parts with a structured multi-stage labeling process. Our results indicate that incorporating structured part-level data and providing agent with the visual feedback through the process enables interpretable, controllable, and locally editable text-to-vector sketch generation.

Fonte: arXiv cs.AI

RLScore 85

Discounted Beta--Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards

arXiv:2603.18444v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as an effective post-training paradigm for improving the reasoning capabilities of large language models. However, existing group-based RLVR methods often suffer from severe sample inefficiency. This inefficiency stems from reliance on point estimation of rewards from a small number of rollouts, leading to high estimation variance, variance collapse, and ineffective utilization of generated responses. In this work, we reformulate RLVR from a statistical estimation perspective by modeling rewards as samples drawn from a policy-induced distribution and casting advantage computation as the problem of estimating the reward distribution from finite data. Building on this view, we propose Discounted Beta--Bernoulli (DBB) reward estimation, which leverages historical reward statistics for the non-stationary distribution. Although biased, the resulting estimator exhibits reduced and stable variance, theoretically avoids estimated variance collapse, and achieves lower mean squared error than standard point estimation. Extensive experiments across six in-distribution and three out-of-distribution reasoning benchmarks demonstrate that GRPO with DBB consistently outperforms naive GRPO, achieving average Acc@8 improvements of 3.22/2.42 points in-distribution and 12.49/6.92 points out-of-distribution on the 1.7B and 8B models, respectively, without additional computational cost or memory usage.

Fonte: arXiv cs.LG

VisionScore 85

TexEditor: Structure-Preserving Text-Driven Texture Editing

arXiv:2603.18488v1 Announce Type: new Abstract: Text-guided texture editing aims to modify object appearance while preserving the underlying geometric structure. However, our empirical analysis reveals that even SOTA editing models frequently struggle to maintain structural consistency during texture editing, despite the intended changes being purely appearance-related. Motivated by this observation, we jointly enhance structure preservation from both data and training perspectives, and build TexEditor, a dedicated texture editing model based on Qwen-Image-Edit-2509. Firstly, we construct TexBlender, a high-quality SFT dataset generated with Blender, which provides strong structural priors for a cold start. Sec- ondly, we introduce StructureNFT, a RL-based approach that integrates structure-preserving losses to transfer the structural priors learned during SFT to real-world scenes. Moreover, due to the limited realism and evaluation coverage of existing benchmarks, we introduce TexBench, a general-purpose real-world benchmark for text-guided texture editing. Extensive experiments on existing Blender-based texture benchmarks and our TexBench show that TexEditor consistently outperforms strong baselines such as Nano Banana Pro. In addition, we assess TexEditor on the general purpose benchmark ImgEdit to validate its generalization. Our code and data are available at https://github.com/KlingAIResearch/TexEditor.

Fonte: arXiv cs.CV

NLP/LLMsScore 85

Balanced Thinking: Improving Chain of Thought Training in Vision Language Models

arXiv:2603.18656v1 Announce Type: new Abstract: Multimodal reasoning in vision-language models (VLMs) typically relies on a two-stage process: supervised fine-tuning (SFT) and reinforcement learning (RL). In standard SFT, all tokens contribute equally to the loss, even though reasoning data are inherently token-imbalanced. Long traces overshadow short but task-critical segments, leading to verbose reasoning and inaccurate answers. We propose SCALe (Scheduled Curriculum Adaptive Loss), which explicitly separates supervision over reasoning and answer segments using dynamic, length-independent weighting. Unlike vanilla SFT, which overweights the segment, SCALe-SFT gradually shifts the focus from to throughout training via a cosine scheduling policy, encouraging concise and well-grounded reasoning. We evaluate SCALe across diverse benchmarks and architectures. Results show that SCALe consistently improves accuracy over vanilla SFT and matches the performance of the full two-phase SFT + GRPO pipeline while requiring only about one-seventh of the training time, making it a lightweight yet effective alternative. When combined with GRPO, SCALe achieves the best overall performance, highlighting its value both as a standalone method and as a strong foundation for reinforcement refinement.

Fonte: arXiv cs.AI

MultimodalScore 85

Thinking with Constructions: A Benchmark and Policy Optimization for Visual-Text Interleaved Geometric Reasoning

arXiv:2603.18662v1 Announce Type: new Abstract: Geometric reasoning inherently requires "thinking with constructions" -- the dynamic manipulation of visual aids to bridge the gap between problem conditions and solutions. However, existing Multimodal Large Language Models (MLLMs) are largely confined to passive inference with static diagrams, lacking the strategic knowledge of when and how to construct effective visual aids. To address this, we present a framework for Visual-Text Interleaved Chain-of-Thought. We first introduce GeoAux-Bench, the first benchmark comprising 4,334 geometry problems that aligns textual construction steps with ground-truth visual updates. Our pilot study reveals two critical insights: (1) interleaved visual-textual aids outperform single-modality counterparts, which cannot losslessly capture geometric synergy; and (2) valid constructions act as entropy reducers, strongly correlating with reduced reasoning perplexity. Building on these findings, we propose Action Applicability Policy Optimization (A2PO), a reinforcement learning paradigm for mastering strategic construction. A2PO employs Adaptive Reward Shaping to regulate the timing and quality of visual aids via counterfactual sampling to distinguish necessary from redundant constructions. Experiments demonstrate our approach enables MLLMs to leverage selective auxiliary constructions, yielding a 3.51% gain over strong baselines. Code and data are available on GitHub.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

arXiv:2603.18363v1 Announce Type: new Abstract: Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a well-defined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting $\alpha$-power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution ($\alpha > 1$) to intensify logical reasoning, or flattening it ($\alpha < 1$) to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks.

Fonte: arXiv cs.CL

MultimodalScore 85

Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models

arXiv:2603.18118v1 Announce Type: new Abstract: Large Language Models (LLMs) have achieved remarkable reliability and advanced capabilities through extended test-time reasoning. However, extending these capabilities to Multi-modal Large Language Models (MLLMs) remains a significant challenge due to a critical scarcity of high-quality, long-chain reasoning data and optimized training pipelines. To bridge this gap, we present a unified multi-agent visual reasoning framework that systematically evolves from our foundational image-centric model, Insight-V, into a generalized spatial-temporal architecture, Insight-V++. We first propose a scalable data generation pipeline equipped with multi-granularity assessment that autonomously synthesizes structured, complex reasoning trajectories across image and video domains without human intervention. Recognizing that directly supervising MLLMs with such intricate data yields sub-optimal results, we design a dual-agent architecture comprising a reasoning agent to execute extensive analytical chains, and a summary agent to critically evaluate and distill final outcomes. While our initial framework utilized Direct Preference Optimization (DPO), its off-policy nature fundamentally constrained reinforcement learning potential. To overcome these limitations, particularly for long-horizon video understanding, Insight-V++ introduces two novel algorithms, ST-GRPO and J-GRPO, which enhance spatial-temporal reasoning and improve evaluative robustness. Crucially, by leveraging reliable feedback from the summary agent, we guide an iterative reasoning path generation process, retraining the entire multi-agent system in a continuous, self-improving loop. Extensive experiments on base models like LLaVA-NeXT and Qwen2.5-VL demonstrate significant performance gains across challenging image and video reasoning benchmarks while preserving strong capabilities on traditional perception-focused tasks.

Fonte: arXiv cs.CV

RLScore 85

Discovering What You Can Control: Interventional Boundary Discovery for Reinforcement Learning

arXiv:2603.18257v1 Announce Type: new Abstract: Selecting relevant state dimensions in the presence of confounded distractors is a causal identification problem: observational statistics alone cannot reliably distinguish dimensions that correlate with actions from those that actions cause. We formalize this as discovering the agent's Causal Sphere of Influence and propose Interventional Boundary Discovery IBD, which applies Pearl's do-operator to the agent's own actions and uses two-sample testing to produce an interpretable binary mask over observation dimensions. IBD requires no learned models and composes with any downstream RL algorithm as a preprocessing step. Across 12 continuous control settings with up to 100 distractor dimensions, we find that: (1) observational feature selection can actively select confounded distractors while discarding true causal dimensions; (2) full-state RL degrades sharply once distractors outnumber relevant features by roughly 3:1 in our benchmarks; and (3)IBD closely tracks oracle performance across all distractor levels tested, with gains transferring across SAC and TD3.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

Learning to Self-Evolve

arXiv:2603.18620v1 Announce Type: new Abstract: We introduce Learning to Self-Evolve (LSE), a reinforcement learning framework that trains large language models (LLMs) to improve their own contexts at test time. We situate LSE in the setting of test-time self-evolution, where a model iteratively refines its context from feedback on seen problems to perform better on new ones. Existing approaches rely entirely on the inherent reasoning ability of the model and never explicitly train it for this task. LSE reduces the multi-step evolution problem to a single-step RL objective, where each context edit is rewarded by the improvement in downstream performance. We pair this objective with a tree-guided evolution loop. On Text-to-SQL generation (BIRD) and general question answering (MMLU-Redux), a 4B-parameter model trained with LSE outperforms self-evolving policies powered by GPT-5 and Claude Sonnet 4.5, as well as prompt optimization methods including GEPA and TextGrad, and transfers to guide other models without additional training. Our results highlight the effectiveness of treating self-evolution as a learnable skill.

Fonte: arXiv cs.CL

RLScore 85

SLEA-RL: Step-Level Experience Augmented Reinforcement Learning for Multi-Turn Agentic Training

arXiv:2603.18079v1 Announce Type: new Abstract: Large Language Model (LLM) agents have shown strong results on multi-turn tool-use tasks, yet they operate in isolation during training, failing to leverage experiences accumulated across episodes. Existing experience-augmented methods address this by organizing trajectories into retrievable libraries, but they retrieve experiences only once based on the initial task description and hold them constant throughout the episode. In multi-turn settings where observations change at every step, this static retrieval becomes increasingly mismatched as episodes progress. We propose SLEA-RL (Step-Level Experience-Augmented Reinforcement Learning), a framework that retrieves relevant experiences at each decision step conditioned on the current observation. SLEA-RL operates through three components: (i) step-level observation clustering that groups structurally equivalent environmental states for efficient cluster-indexed retrieval; (ii) a self-evolving experience library that distills successful strategies and failure patterns through score-based admission and rate-limited extraction; and (iii) policy optimization with step-level credit assignment for fine-grained advantage estimation across multi-turn episodes. The experience library evolves alongside the policy through semantic analysis rather than gradient updates. Experiments on long-horizon multi-turn agent benchmarks demonstrate that SLEA-RL achieves superior performance compared to various reinforcement learning baselines.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

SR-Nav: Spatial Relationships Matter for Zero-shot Object Goal Navigation

arXiv:2603.18443v1 Announce Type: new Abstract: Zero-shot object-goal navigation aims to find target objects in unseen environments using only egocentric observation. Recent methods leverage foundation models' comprehension and reasoning capabilities to enhance navigation performance. However, when faced with poor viewpoints or weak semantic cues, foundation models often fail to support reliable reasoning in both perception and planning, resulting in inefficient or failed navigation. We observe that inherent relationships among objects and regions encode structured scene priors, which help agents infer plausible target locations even under partial observations. Motivated by this insight, we propose Spatial Relation-aware Navigation (SR-Nav), a framework that models both observed and experience-based spatial relationships to enhance both perception and planning. Specifically, SR-Nav first constructs a Dynamic Spatial Relationship Graph (DSRG) that encodes the target-centered spatial relationships through the foundation models and updates dynamically with real-time observations. We then introduce a Relation-aware Matching Module. It utilizes relationship matching instead of naive detection, leveraging diverse relationships in the DSRG to verify and correct errors, enhancing visual perception robustness. Finally, we design a Dynamic Relationship Planning Module to reduce the planning search space by dynamically computing the optimal paths based on the DSRG from the current position, thereby guiding planning and reducing exploration redundancy. Experiments on HM3D show that our method achieves state-of-the-art performance in both success rate and navigation efficiency. The code will be publicly available at https://github.com/Mzyw-1314/SR-Nav

Fonte: arXiv cs.CV

RLScore 85

Enhancing Reinforcement Learning Fine-Tuning with an Online Refiner

arXiv:2603.18088v1 Announce Type: new Abstract: Constraints are essential for stabilizing reinforcement learning fine-tuning (RFT) and preventing degenerate outputs, yet they inherently conflict with the optimization objective because stronger constraints limit the ability of a fine-tuned model to discover better solutions. We propose \textit{dynamic constraints} that resolve this tension by adapting to the evolving capabilities of the fine-tuned model based on the insight that constraints should only intervene when degenerate outputs occur. We implement this by using a reference model as an \textit{online refiner} that takes the response from the fine-tuned model and generates a minimally corrected version which preserves correct content verbatim while fixing errors. A supervised fine-tuning loss then trains the fine-tuned model to produce the refined output. This mechanism yields a constraint that automatically strengthens or relaxes based on output quality. Experiments on dialogue and code generation show that dynamic constraints outperform both KL regularization and unconstrained baselines, achieving substantially higher task rewards while maintaining training stability.

Fonte: arXiv cs.LG

Theory/OptimizationScore 85

Mathematical Foundations of Deep Learning

arXiv:2603.18387v1 Announce Type: new Abstract: This draft book offers a comprehensive and rigorous treatment of the mathematical principles underlying modern deep learning. The book spans core theoretical topics, from the approximation capabilities of deep neural networks, the theory and algorithms of optimal control and reinforcement learning integrated with deep learning techniques, to contemporary generative models that drive today's advances in artificial intelligence.

Fonte: arXiv cs.LG

RLScore 95

AcceRL: A Distributed Asynchronous Reinforcement Learning and World Model Framework for Vision-Language-Action Models

arXiv:2603.18464v1 Announce Type: new Abstract: Reinforcement learning (RL) for large-scale Vision-Language-Action (VLA) models faces significant challenges in computational efficiency and data acquisition. We propose AcceRL, a fully asynchronous and decoupled RL framework designed to eliminate synchronization barriers by physically isolating training, inference, and rollouts. Crucially, AcceRL is the first to integrate a plug-and-play, trainable world model into a distributed asynchronous RL pipeline to generate virtual experiences. Experiments on the LIBERO benchmark demonstrate that AcceRL achieves state-of-the-art (SOTA) performance. Systematically, it exhibits super-linear scaling in throughput and highly efficient hardware utilization. Algorithmically, the world-model-augmented variant delivers unprecedented sample efficiency and robust training stability in complex control tasks.

Fonte: arXiv cs.LG

RLScore 85

CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks

arXiv:2603.18736v1 Announce Type: cross Abstract: Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models, current reward modeling heavily relies on experimental feedback data collected from human annotators under controlled and costly conditions. In this work, we introduce observational reward modeling -- learning reward models with observational user feedback (e.g., clicks, copies, and upvotes) -- as a scalable and cost-effective alternative. We identify two fundamental challenges in this setting: (1) observational feedback is noisy due to annotation errors, which deviates it from true user preference; (2) observational feedback is biased by user preference, where users preferentially provide feedback on responses they feel strongly about, which creats a distribution shift between training and inference data. To address these challenges, we propose CausalRM, a causal-theoretic reward modeling framework that aims to learn unbiased reward models from observational feedback. To tackle challenge (1), CausalRM introduces a noise-aware surrogate loss term that is provably equivalent to the primal loss under noise-free conditions by explicitly modeling the annotation error generation process. To tackle challenge (2), CausalRM uses propensity scores -- the probability of a user providing feedback for a given response -- to reweight training samples, yielding a loss function that eliminates user preference bias. Extensive experiments across diverse LLM backbones and benchmark datasets validate that CausalRM effectively learns accurate reward signals from noisy and biased observational feedback and delivers substantial performance improvements on downstream RLHF tasks -- including a 49.2% gain on WildGuardMix and a 32.7% improvement on HarmBench. Code is available on our project website.

Fonte: arXiv stat.ML

RLScore 85

On the Peril of (Even a Little) Nonstationarity in Satisficing Regret Minimization

arXiv:2603.18514v1 Announce Type: new Abstract: Motivated by the principle of satisficing in decision-making, we study satisficing regret guarantees for nonstationary $K$-armed bandits. We show that in the general realizable, piecewise-stationary setting with $L$ stationary segments, the optimal regret is $\Theta(L\log T)$ as long as $L\geq 2$. This stands in sharp contrast to the case of $L=1$ (i.e., the stationary setting), where a $T$-independent $\Theta(1)$ satisficing regret is achievable under realizability. In other words, the optimal regret has to scale with $T$ even if just a little nonstationarity presents. A key ingredient in our analysis is a novel Fano-based framework tailored to nonstationary bandits via a \emph{post-interaction reference} construction. This framework strictly extends the classical Fano method for passive estimation as well as recent interactive Fano techniques for stationary bandits. As a complement, we also discuss a special regime in which constant satisficing regret is again possible.

Fonte: arXiv stat.ML

RLScore 85

Kernel Single-Index Bandits: Estimation, Inference, and Learning

arXiv:2603.18938v1 Announce Type: new Abstract: We study contextual bandits with finitely many actions in which the reward of each arm follows a single-index model with an arm-specific index parameter and an unknown nonparametric link function. We consider a regime in which arms correspond to stable decision options and covariates evolve adaptively under the bandit policy. This setting creates significant statistical challenges: the sampling distribution depends on the allocation rule, observations are dependent over time, and inverse-propensity weighting induces variance inflation. We propose a kernelized $\varepsilon$-greedy algorithm that combines Stein-based estimation of the index parameters with inverse-propensity-weighted kernel ridge regression for the reward functions. This approach enables flexible semiparametric learning while retaining interpretability. Our analysis develops new tools for inference with adaptively collected data. We establish asymptotic normality for the single-index estimator under adaptive sampling, yielding valid confidence regions, and derive a directional functional central limit theorem for the RKHS estimator, which provides asymptotically valid pointwise confidence intervals. The analysis relies on concentration bounds for inverse-weighted Gram matrices together with martingale central limit theorems. We further obtain finite-time regret guarantees, including $\tilde{O}(\sqrt{T})$ rates under common-link Lipschitz conditions, showing that semiparametric structure can be exploited without sacrificing statistical efficiency. These results provide a unified framework for simultaneous learning and inference in single-index contextual bandits.

Fonte: arXiv stat.ML

Theory/OptimizationScore 85

Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum

arXiv:2603.18325v1 Announce Type: new Abstract: Chain-of-thought reasoning, where language models expend additional computation by producing thinking tokens prior to final responses, has driven significant advances in model capabilities. However, training these reasoning models is extremely costly in terms of both data and compute, as it involves collecting long traces of reasoning behavior from humans or synthetic generators and further post-training the model via reinforcement learning. Are these costs fundamental, or can they be reduced through better algorithmic design? We show that autocurriculum, where the model uses its own performance to decide which problems to focus training on, provably improves upon standard training recipes for both supervised fine-tuning (SFT) and reinforcement learning (RL). For SFT, we show that autocurriculum requires exponentially fewer reasoning demonstrations than non-adaptive fine-tuning, by focusing teacher supervision on prompts where the current model struggles. For RL fine-tuning, autocurriculum decouples the computational cost from the quality of the reference model, reducing the latter to a burn-in cost that is nearly independent of the target accuracy. These improvements arise purely from adaptive data selection, drawing on classical techniques from boosting and learning from counterexamples, and requiring no assumption on the distribution or difficulty of prompts.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

Approximate Subgraph Matching with Neural Graph Representations and Reinforcement Learning

arXiv:2603.18314v1 Announce Type: new Abstract: Approximate subgraph matching (ASM) is a task that determines the approximate presence of a given query graph in a large target graph. Being an NP-hard problem, ASM is critical in graph analysis with a myriad of applications ranging from database systems and network science to biochemistry and privacy. Existing techniques often employ heuristic search strategies, which cannot fully utilize the graph information, leading to sub-optimal solutions. This paper proposes a Reinforcement Learning based Approximate Subgraph Matching (RL-ASM) algorithm that exploits graph transformers to effectively extract graph representations and RL-based policies for ASM. Our model is built upon the branch-and-bound algorithm that selects one pair of nodes from the two input graphs at a time for potential matches. Instead of using heuristics, we exploit a Graph Transformer architecture to extract feature representations that encode the full graph information. To enhance the training of the RL policy, we use supervised signals to guide our agent in an imitation learning stage. Subsequently, the policy is fine-tuned with the Proximal Policy Optimization (PPO) that optimizes the accumulative long-term rewards over episodes. Extensive experiments on both synthetic and real-world datasets demonstrate that our RL-ASM outperforms existing methods in terms of effectiveness and efficiency. Our source code is available at https://github.com/KaiyangLi1992/RL-ASM.

Fonte: arXiv cs.LG

RLScore 85

R2-Dreamer: Redundancy-Reduced World Models without Decoders or Augmentation

arXiv:2603.18202v1 Announce Type: new Abstract: A central challenge in image-based Model-Based Reinforcement Learning (MBRL) is to learn representations that distill essential information from irrelevant visual details. While promising, reconstruction-based methods often waste capacity on large task-irrelevant regions. Decoder-free methods instead learn robust representations by leveraging Data Augmentation (DA), but reliance on such external regularizers limits versatility. We propose R2-Dreamer, a decoder-free MBRL framework with a self-supervised objective that serves as an internal regularizer, preventing representation collapse without resorting to DA. The core of our method is a redundancy-reduction objective inspired by Barlow Twins, which can be easily integrated into existing frameworks. On DeepMind Control Suite and Meta-World, R2-Dreamer is competitive with strong baselines such as DreamerV3 and TD-MPC2 while training 1.59x faster than DreamerV3, and yields substantial gains on DMC-Subtle with tiny task-relevant objects. These results suggest that an effective internal regularizer can enable versatile, high-performance decoder-free MBRL. Code is available at https://github.com/NM512/r2dreamer.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

Memento-Skills: Let Agents Design Agents

arXiv:2603.18743v1 Announce Type: new Abstract: We introduce \emph{Memento-Skills}, a generalist, continually-learnable LLM agent system that functions as an \emph{agent-designing agent}: it autonomously constructs, adapts, and improves task-specific agents through experience. The system is built on a memory-based reinforcement learning framework with \emph{stateful prompts}, where reusable skills (stored as structured markdown files) serve as persistent, evolving memory. These skills encode both behaviour and context, enabling the agent to carry forward knowledge across interactions. Starting from simple elementary skills (like Web search and terminal operations), the agent continually improves via the \emph{Read--Write Reflective Learning} mechanism introduced in \emph{Memento~2}~\cite{wang2025memento2}. In the \emph{read} phase, a behaviour-trainable skill router selects the most relevant skill conditioned on the current stateful prompt; in the \emph{write} phase, the agent updates and expands its skill library based on new experience. This closed-loop design enables \emph{continual learning without updating LLM parameters}, as all adaptation is realised through the evolution of externalised skills and prompts. Unlike prior approaches that rely on human-designed agents, Memento-Skills enables a generalist agent to \emph{design agents end-to-end} for new tasks. Through iterative skill generation and refinement, the system progressively improves its own capabilities. Experiments on the \emph{General AI Assistants} benchmark and \emph{Humanity's Last Exam} demonstrate sustained gains, achieving 26.2\% and 116.2\% relative improvements in overall accuracy, respectively. Code is available at https://github.com/Memento-Teams/Memento-Skills.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Adaptive Decoding via Test-Time Policy Learning for Self-Improving Generation

arXiv:2603.18428v1 Announce Type: new Abstract: Decoding strategies largely determine the quality of Large Language Model (LLM) outputs, yet widely used heuristics such as greedy or fixed temperature/top-p decoding are static and often task-agnostic, leading to suboptimal or inconsistent generation quality across domains that demand stylistic or structural flexibility. We introduce a reinforcement learning-based decoder sampler that treats decoding as sequential decision-making and learns a lightweight policy to adjust sampling parameters at test-time while keeping LLM weights frozen. We evaluated summarization datasets including BookSum, arXiv, and WikiHow using Granite-3.3-2B and Qwen-2.5-0.5B. Our policy sampler consistently outperforms greedy and static baselines, achieving relative gains of up to +88% (BookSum, Granite) and +79% (WikiHow, Qwen). Reward ablations show that overlap-only objectives underperform compared to composite rewards, while structured shaping terms (length, coverage, repetition, completeness) enable stable and sustained improvements. These findings highlight reinforcement learning as a practical mechanism for test-time adaptation in decoding, enabling domain-aware and user-controllable generation without retraining large models.

Fonte: arXiv cs.CL

RLScore 85

Maximum-Entropy Exploration with Future State-Action Visitation Measures

arXiv:2603.18965v1 Announce Type: cross Abstract: Maximum entropy reinforcement learning motivates agents to explore states and actions to maximize the entropy of some distribution, typically by providing additional intrinsic rewards proportional to that entropy function. In this paper, we study intrinsic rewards proportional to the entropy of the discounted distribution of state-action features visited during future time steps. This approach is motivated by two results. First, we show that the expected sum of these intrinsic rewards is a lower bound on the entropy of the discounted distribution of state-action features visited in trajectories starting from the initial states, which we relate to an alternative maximum entropy objective. Second, we show that the distribution used in the intrinsic reward definition is the fixed point of a contraction operator and can therefore be estimated off-policy. Experiments highlight that the new objective leads to improved visitation of features within individual trajectories, in exchange for slightly reduced visitation of features in expectation over different trajectories, as suggested by the lower bound. It also leads to improved convergence speed for learning exploration-only agents. Control performance remains similar across most methods on the considered benchmarks.

Fonte: arXiv stat.ML

RLScore 85

Escaping Offline Pessimism: Vector-Field Reward Shaping for Safe Frontier Exploration

arXiv:2603.18326v1 Announce Type: new Abstract: While offline reinforcement learning provides reliable policies for real-world deployment, its inherent pessimism severely restricts an agent's ability to explore and collect novel data online. Drawing inspiration from safe reinforcement learning, exploring near the boundary of regions well covered by the offline dataset and reliably modeled by the simulator allows an agent to take manageable risks--venturing into informative but moderate-uncertainty states while remaining close enough to familiar regions for safe recovery. However, naively rewarding this boundary-seeking behavior can lead to a degenerate parking behavior, where the agent simply stops once it reaches the frontier. To solve this, we propose a novel vector-field reward shaping paradigm designed to induce continuous, safe boundary exploration for non-adaptive deployed policies. Operating on an uncertainty oracle trained from offline data, our reward combines two complementary components: a gradient-alignment term that attracts the agent toward a target uncertainty level, and a rotational-flow term that promotes motion along the local tangent plane of the uncertainty manifold. Through theoretical analysis, we show that this reward structure naturally induces sustained exploratory behavior along the boundary while preventing degenerate solutions. Empirically, by integrating our proposed reward shaping with Soft Actor-Critic on a 2D continuous navigation task, we validate that agents successfully traverse uncertainty boundaries while balancing safe, informative data collection with primary task completion.

Fonte: arXiv cs.LG

RLScore 85

Balancing the Reasoning Load: Difficulty-Differentiated Policy Optimization with Length Redistribution for Efficient and Robust Reinforcement Learning

arXiv:2603.18533v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) have shown exceptional reasoning capabilities, but they also suffer from the issue of overthinking, often generating excessively long and redundant answers. For problems that exceed the model's capabilities, LRMs tend to exhibit the overconfidence phenomenon, generating overly short but incorrect answers, which may contribute to suboptimal performance. To address these issues, we propose Difficulty-Differentiated Policy Optimization (DDPO), an efficient reinforcement learning algorithm that optimizes simple and complex tasks separately based on the overconfidence phenomenon. Specifically, it reduces the output length for simple tasks without compromising accuracy, while for complex tasks, it expands the exploration space to improve performance. We further derive the theoretical conditions for maximizing expected accuracy, which require the length distribution to closely approximate the optimal length and be as concentrated as possible. Based on these conditions, we propose using the difficulty-level average as a well-founded reference for length optimization. Extensive experiments on both in-domain and out-of-domain benchmarks validate the superiority and effectiveness of DDPO. Compared to GRPO, DDPO reduces the average answer length by 12% while improving accuracy by 1.85% across multiple benchmarks, achieving a better trade-off between accuracy and length. The code is available at https://github.com/Yinan-Xia/DDPO.

Fonte: arXiv cs.LG

RLScore 85

RE-SAC: Disentangling aleatoric and epistemic risks in bus fleet control: A stable and robust ensemble DRL approach

arXiv:2603.18396v1 Announce Type: new Abstract: Bus holding control is challenging due to stochastic traffic and passenger demand. While deep reinforcement learning (DRL) shows promise, standard actor-critic algorithms suffer from Q-value instability in volatile environments. A key source of this instability is the conflation of two distinct uncertainties: aleatoric uncertainty (irreducible noise) and epistemic uncertainty (data insufficiency). Treating these as a single risk leads to value underestimation in noisy states, causing catastrophic policy collapse. We propose a robust ensemble soft actor-critic (RE-SAC) framework to explicitly disentangle these uncertainties. RE-SAC applies Integral Probability Metric (IPM)-based weight regularization to the critic network to hedge against aleatoric risk, providing a smooth analytical lower bound for the robust Bellman operator without expensive inner-loop perturbations. To address epistemic risk, a diversified Q-ensemble penalizes overconfident value estimates in sparsely covered regions. This dual mechanism prevents the ensemble variance from misidentifying noise as a data gap, a failure mode identified in our ablation study. Experiments in a realistic bidirectional bus corridor simulation demonstrate that RE-SAC achieves the highest cumulative reward (approx. -0.4e6) compared to vanilla SAC (-0.55e6). Mahalanobis rareness analysis confirms that RE-SAC reduces Oracle Q-value estimation error by up to 62% in rare out-of-distribution states (MAE of 1647 vs. 4343), demonstrating superior robustness under high traffic variability.

Fonte: arXiv cs.LG

RLScore 85

Reasonably reasoning AI agents can avoid game-theoretic failures in zero-shot, provably

arXiv:2603.18563v1 Announce Type: new Abstract: AI agents are increasingly deployed in interactive economic environments characterized by repeated AI-AI interactions. Despite AI agents' advanced capabilities, empirical studies reveal that such interactions often fail to stably induce a strategic equilibrium, such as a Nash equilibrium. Post-training methods have been proposed to induce a strategic equilibrium; however, it remains impractical to uniformly apply an alignment method across diverse, independently developed AI models in strategic settings. In this paper, we provide theoretical and empirical evidence that off-the-shelf reasoning AI agents can achieve Nash-like play zero-shot, without explicit post-training. Specifically, we prove that `reasonably reasoning' agents, i.e., agents capable of forming beliefs about others' strategies from previous observation and learning to best respond to these beliefs, eventually behave along almost every realized play path in a way that is weakly close to a Nash equilibrium of the continuation game. In addition, we relax the common-knowledge payoff assumption by allowing stage payoffs to be unknown and by having each agent observe only its own privately realized stochastic payoffs, and we show that we can still achieve the same on-path Nash convergence guarantee. We then empirically validate the proposed theories by simulating five game scenarios, ranging from a repeated prisoner's dilemma game to stylized repeated marketing promotion games. Our findings suggest that AI agents naturally exhibit such reasoning patterns and therefore attain stable equilibrium behaviors intrinsically, obviating the need for universal alignment procedures in many real-world strategic interactions.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

REAL: Regression-Aware Reinforcement Learning for LLM-as-a-Judge

arXiv:2603.17145v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as automated evaluators that assign numeric scores to model outputs, a paradigm known as LLM-as-a-Judge. However, standard Reinforcement Learning (RL) methods typically rely on binary rewards (e.g., 0-1 accuracy), thereby ignoring the ordinal structure inherent in regression tasks; for instance, they fail to recognize that predicting 4 is significantly better than predicting 1 when the ground truth is 5. Conversely, existing regression-aware approaches are often confined to Supervised Fine-Tuning (SFT), limiting their ability to explore optimal reasoning paths. To bridge this gap, we propose \textbf{REAL} (\underline{RE}gression-\underline{A}ware Reinforcement \underline{L}earning), a principled RL framework designed to optimize regression rewards, and also proven to be optimal for correlation metrics. A key technical challenge is that the regression objective is explicitly policy-dependent, thus invalidating standard policy gradient methods. To address this, we employ the generalized policy gradient estimator, which naturally decomposes optimization into two complementary components: (1) exploration over Chain-of-Thought (CoT) trajectory, and (2) regression-aware prediction refinement of the final score. Extensive experiments across model scales (8B to 32B) demonstrate that REAL consistently outperforms both regression-aware SFT baselines and standard RL methods, exhibiting significantly better generalization on out-of-domain benchmarks. On Qwen3-32B specifically, we achieve gains of +8.40 Pearson and +7.20 Spearman correlation over the SFT baseline, and +18.30/+11.20 over the base model. These findings highlight the critical value of integrating regression objectives into RL exploration for accurate LLM evaluation.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild

arXiv:2603.17187v1 Announce Type: new Abstract: Large language model (LLM) agents are increasingly used for complex tasks, yet deployed agents often remain static, failing to adapt as user needs evolve. This creates a tension between the need for continuous service and the necessity of updating capabilities to match shifting task distributions. On platforms like OpenClaw, which handle diverse workloads across 20+ channels, existing methods either store raw trajectories without distilling knowledge, maintain static skill libraries, or require disruptive downtime for retraining. We present MetaClaw, a continual meta-learning framework that jointly evolves a base LLM policy and a library of reusable behavioral skills. MetaClaw employs two complementary mechanisms. Skill-driven fast adaptation analyzes failure trajectories via an LLM evolver to synthesize new skills, enabling immediate improvement with zero downtime. Opportunistic policy optimization performs gradient-based updates via cloud LoRA fine-tuning and Reinforcement Learning with a Process Reward Model (RL-PRM). This is triggered during user-inactive windows by the Opportunistic Meta-Learning Scheduler (OMLS), which monitors system inactivity and calendar data. These mechanisms are mutually reinforcing: a refined policy generates better trajectories for skill synthesis, while richer skills provide higher-quality data for policy optimization. To prevent data contamination, a versioning mechanism separates support and query data. Built on a proxy-based architecture, MetaClaw scales to production-size LLMs without local GPUs. Experiments on MetaClaw-Bench and AutoResearchClaw show that skill-driven adaptation improves accuracy by up to 32% relative. The full pipeline advances Kimi-K2.5 accuracy from 21.4% to 40.6% and increases composite robustness by 18.3%. Code is available at https://github.com/aiming-lab/MetaClaw.

Fonte: arXiv cs.LG

RLScore 85

WINFlowNets: Warm-up Integrated Networks Training of Generative Flow Networks for Robotics and Machine Fault Adaptation

arXiv:2603.17301v1 Announce Type: new Abstract: Generative Flow Networks for continuous scenarios (CFlowNets) have shown promise in solving sequential decision-making tasks by learning stochastic policies using a flow and a retrieval network. Despite their demonstrated efficiency compared to state-of-the-art Reinforcement Learning (RL) algorithms, their practical application in robotic control tasks is constrained by the reliance on pre-training the retrieval network. This dependency poses challenges in dynamic robotic environments, where pre-training data may not be readily available or representative of the current environment. This paper introduces WINFlowNets, a novel CFlowNets framework that enables the co-training of flow and retrieval networks. WINFlowNets begins with a warm-up phase for the retrieval network to bootstrap its policy, followed by a shared training architecture and a shared replay buffer for co-training both networks. Experiments in simulated robotic environments demonstrate that WINFlowNets surpasses CFlowNets and state-of-the-art RL algorithms in terms of average reward and training stability. Furthermore, WINFlowNets exhibits strong adaptive capability in fault environments, making it suitable for tasks that demand quick adaptation with limited sample data. These findings highlight WINFlowNets' potential for deployment in dynamic and malfunction-prone robotic systems, where traditional pre-training or sample inefficient data collection may be impractical.

Fonte: arXiv cs.LG

RLScore 85

Multi-Agent Reinforcement Learning for Dynamic Pricing: Balancing Profitability,Stability and Fairness

arXiv:2603.16888v1 Announce Type: cross Abstract: Dynamic pricing in competitive retail markets requires strategies that adapt to fluctuating demand and competitor behavior. In this work, we present a systematic empirical evaluation of multi-agent reinforcement learning (MARL) approaches-specifically MAPPO and MADDPG-for dynamic price optimization under competition. Using a simulated marketplace environment derived from real-world retail data, we benchmark these algorithms against an Independent DDPG (IDDPG) baseline, a widely used independent learner in MARL literature. We evaluate profit performance, stability across random seeds, fairness, and training efficiency. Our results show that MAPPO consistently achieves the highest average returns with low variance, offering a stable and reproducible approach for competitive price optimization, while MADDPG achieves slightly lower profit but the fairest profit distribution among agents. These findings demonstrate that MARL methods-particularly MAPPO-provide a scalable and stable alternative to independent learning approaches for dynamic retail pricing.

Fonte: arXiv cs.AI

RLScore 90

Efficient Exploration at Scale

arXiv:2603.17378v1 Announce Type: new Abstract: We develop an online learning algorithm that dramatically improves the data efficiency of reinforcement learning from human feedback (RLHF). Our algorithm incrementally updates reward and language models as choice data is received. The reward model is fit to the choice data, while the language model is updated by a variation of reinforce, with reinforcement signals provided by the reward model. Several features enable the efficiency gains: a small affirmative nudge added to each reinforcement signal, an epistemic neural network that models reward uncertainty, and information-directed exploration. With Gemma large language models (LLMs), our algorithm matches the performance of offline RLHF trained on 200K labels using fewer than 20K labels, representing more than a 10x gain in data efficiency. Extrapolating from our results, we expect our algorithm trained on 1M labels to match offline RLHF trained on 1B labels. This represents a 1,000x gain. To our knowledge, these are the first results to demonstrate that such large improvements are possible.

Fonte: arXiv cs.LG

RLScore 85

Multi-Modal Multi-Agent Reinforcement Learning for Radiology Report Generation: Radiologist-Like Workflow with Clinically Verifiable Rewards

arXiv:2603.16876v1 Announce Type: cross Abstract: We propose MARL-Rad, a novel multi-modal multi-agent reinforcement learning framework for radiology report generation that coordinates region-specific agents and a global integrating agent, optimized via clinically verifiable rewards. Unlike prior single-model reinforcement learning or post-hoc agentization of independently trained models, our method jointly trains multiple agents and optimizes the entire agent system through reinforcement learning. Experiments on the MIMIC-CXR and IU X-ray datasets show that MARL-Rad consistently improves clinically efficacy (CE) metrics such as RadGraph, CheXbert, and GREEN scores, achieving state-of-the-art CE performance. Further analyses confirm that MARL-Rad enhances laterality consistency and produces more accurate, detail-informed reports.

Fonte: arXiv cs.AI

RLScore 85

Adaptive Anchor Policies for Efficient 4D Gaussian Streaming

arXiv:2603.17227v1 Announce Type: new Abstract: Dynamic scene reconstruction with Gaussian Splatting has enabled efficient streaming for real-time rendering and free-viewpoint video. However, most pipelines rely on fixed anchor selection such as Farthest Point Sampling (FPS), typically using 8,192 anchors regardless of scene complexity, which over-allocates computation under strict budgets. We propose Efficient Gaussian Streaming (EGS), a plug-in, budget-aware anchor sampler that replaces FPS with a reinforcement-learned policy while keeping the Gaussian streaming reconstruction backbone unchanged. The policy jointly selects an anchor budget and a subset of informative anchors under discrete constraints, balancing reconstruction quality and runtime using spatial features of the Gaussian representation. We evaluate EGS in two settings: fast rendering, which prioritizes runtime efficiency, and high-quality refinement, which enables additional optimization. Experiments on dynamic multi-view datasets show consistent improvements in the quality--efficiency trade-off over FPS sampling. On unseen data, in fast rendering at 256 anchors ($32\times$ fewer than 8,192), EGS improves PSNR by $+0.52$--$0.61$\,dB while running $1.29$--$1.35\times$ faster than IGS@8192 (N3DV and MeetingRoom). In high-quality refinement, EGS remains competitive with the full-anchor baseline at substantially lower anchor budgets. \emph{Code and pretrained checkpoints will be released upon acceptance.} \keywords{4D Gaussian Splatting \and 4D Gaussian Streaming \and Reinforcement Learning}

Fonte: arXiv cs.CV

NLP/LLMsScore 85

Federated Multi Agent Deep Learning and Neural Networks for Advanced Distributed Sensing in Wireless Networks

arXiv:2603.16881v1 Announce Type: new Abstract: Multi-agent deep learning (MADL), including multi-agent deep reinforcement learning (MADRL), distributed/federated training, and graph-structured neural networks, is becoming a unifying framework for decision-making and inference in wireless systems where sensing, communication, and computing are tightly coupled. Recent 5G-Advanced and 6G visions strengthen this coupling through integrated sensing and communication, edge intelligence, open programmable RAN, and non-terrestrial/UAV networking, which create decentralized, partially observed, time-varying, and resource-constrained control problems. This survey synthesizes the state of the art, with emphasis on 2021-2025 research, on MADL for distributed sensing and wireless communications. We present a task-driven taxonomy across (i) learning formulations (Markov games, Dec-POMDPs, CTDE), (ii) neural architectures (GNN-based radio resource management, attention-based policies, hierarchical learning, and over-the-air aggregation), (iii) advanced techniques (federated reinforcement learning, communication-efficient federated deep RL, and serverless edge learning orchestration), and (iv) application domains (MEC offloading with slicing, UAV-enabled heterogeneous networks with power-domain NOMA, intrusion detection in sensor networks, and ISAC-driven perceptive mobile networks). We also provide comparative tables of algorithms, training topologies, and system-level trade-offs in latency, spectral efficiency, energy, privacy, and robustness. Finally, we identify open issues including scalability, non-stationarity, security against poisoning and backdoors, communication overhead, and real-time safety, and outline research directions toward 6G-native sense-communicate-compute-learn systems.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

Sensi: Learn One Thing at a Time -- Curriculum-Based Test-Time Learning for LLM Game Agents

arXiv:2603.17683v1 Announce Type: new Abstract: Large language model (LLM) agents deployed in unknown environments must learn task structure at test time, but current approaches require thousands of interactions to form useful hypotheses. We present Sensi, an LLM agent architecture for the ARC-AGI-3 game-playing challenge that introduces structured test-time learning through three mechanisms: (1) a two-player architecture separating perception from action, (2) a curriculum-based learning system managed by an external state machine, and (3) a database-as-control-plane that makes the agents context window programmatically steerable. We further introduce an LLM-as-judge component with dynamically generated evaluation rubrics to determine when the agent has learned enough about one topic to advance to the next. We report results across two iterations: Sensi v1 solves 2 game levels using the two-player architecture alone, while Sensi v2 adds curriculum learning and solves 0 levels - but completes its entire learning curriculum in approximately 32 action attempts, achieving 50-94x greater sample efficiency than comparable systems that require 1600-3000 attempts. We precisely diagnose the failure mode as a self-consistent hallucination cascade originating in the perception layer, demonstrating that the architectural bottleneck has shifted from learning efficiency to perceptual grounding - a more tractable problem.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

arXiv:2603.17024v1 Announce Type: new Abstract: VLMs show strong multimodal capabilities, but they still struggle with fine-grained vision-language reasoning. We find that long CoT reasoning exposes diverse failure modes, including perception, reasoning, knowledge, and hallucination errors, which can compound across intermediate steps. However, most existing vision-language data used for RLVR does not involve complex reasoning chains that rely on visual evidence throughout, leaving these weaknesses largely unexposed. We therefore propose HopChain, a scalable framework for synthesizing multi-hop vision-language reasoning data specifically for RLVR training of VLMs. Each synthesized multi-hop query forms a logically dependent chain of instance-grounded hops, where earlier hops establish the instances, sets, or conditions needed for later hops, while the final answer remains a specific, unambiguous number suitable for verifiable rewards. We add the multi-hop data synthesized by HopChain to the original RLVR data used to train Qwen3.5-35B-A3B and Qwen3.5-397B-A17B, and compare against RLVR on the original RLVR data alone across 24 benchmarks spanning STEM and Puzzle, General VQA, Text Recognition and Document Understanding, and Video Understanding. Although this multi-hop data is not synthesized to target any specific benchmark, adding it improves 20 out of 24 benchmarks on both models, indicating broad and generalizable gains. To demonstrate that full chained queries are important, we replace them with half-multi-hop or single-hop variants, reducing the 24-benchmark average accuracy by 5.3 and 7.0 points, respectively. Multi-hop training also strengthens long-CoT vision-language reasoning, with gains peaking at more than 50 accuracy points in the ultra-long-CoT regime. These experiments establish HopChain as an effective, scalable framework for synthesizing multi-hop data that improves generalizable vision-language reasoning.

Fonte: arXiv cs.CV

NLP/LLMsScore 85

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

arXiv:2603.17305v1 Announce Type: new Abstract: We propose CRAFT, a red-teaming alignment framework that leverages model reasoning capabilities and hidden representations to improve robustness against jailbreak attacks. Unlike prior defenses that operate primarily at the output level, CRAFT aligns large reasoning models to generate safety-aware reasoning traces by explicitly optimizing objectives defined over the hidden state space. Methodologically, CRAFT integrates contrastive representation learning with reinforcement learning to separate safe and unsafe reasoning trajectories, yielding a latent-space geometry that supports robust, reasoning-level safety alignment. Theoretically, we show that incorporating latent-textual consistency into GRPO eliminates superficially aligned policies by ruling them out as local optima. Empirically, we evaluate CRAFT on multiple safety benchmarks using two strong reasoning models, Qwen3-4B-Thinking and R1-Distill-Llama-8B, where it consistently outperforms state-of-the-art defenses such as IPO and SafeKey. Notably, CRAFT delivers an average 79.0% improvement in reasoning safety and 87.7% improvement in final-response safety over the base models, demonstrating the effectiveness of hidden-space reasoning alignment.

Fonte: arXiv cs.AI

RLScore 85

Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

arXiv:2603.17051v1 Announce Type: new Abstract: Distilled autoregressive (AR) video models enable efficient streaming generation but frequently misalign with human visual preferences. Existing reinforcement learning (RL) frameworks are not naturally suited to these architectures, typically requiring either expensive re-distillation or solver-coupled reverse-process optimization that introduces considerable memory and computational overhead. We present Astrolabe, an efficient online RL framework tailored for distilled AR models. To overcome existing bottlenecks, we introduce a forward-process RL formulation based on negative-aware fine-tuning. By contrasting positive and negative samples directly at inference endpoints, this approach establishes an implicit policy improvement direction without requiring reverse-process unrolling. To scale this alignment to long videos, we propose a streaming training scheme that generates sequences progressively via a rolling KV-cache, applying RL updates exclusively to local clip windows while conditioning on prior context to ensure long-range coherence. Finally, to mitigate reward hacking, we integrate a multi-reward objective stabilized by uncertainty-aware selective regularization and dynamic reference updates. Extensive experiments demonstrate that our method consistently enhances generation quality across multiple distilled AR video models, serving as a robust and scalable alignment solution.

Fonte: arXiv cs.CV

RLScore 90

GigaWorld-Policy: An Efficient Action-Centered World--Action Model

arXiv:2603.17240v1 Announce Type: new Abstract: World-Action Models (WAM) initialized from pre-trained video generation backbones have demonstrated remarkable potential for robot policy learning. However, existing approaches face two critical bottlenecks that hinder performance and deployment. First, jointly reasoning over future visual dynamics and corresponding actions incurs substantial inference overhead. Second, joint modeling often entangles visual and motion representations, making motion prediction accuracy heavily dependent on the quality of future video forecasts. To address these issues, we introduce GigaWorld-Policy, an action-centered WAM that learns 2D pixel-action dynamics while enabling efficient action decoding, with optional video generation. Specifically, we formulate policy training into two coupled components: the model predicts future action sequences conditioned on the current observation, and simultaneously generates future videos conditioned on the predicted actions and the same observation. The policy is supervised by both action prediction and video generation, providing richer learning signals and encouraging physically plausible actions through visual-dynamics constraints. With a causal design that prevents future-video tokens from influencing action tokens, explicit future-video generation is optional at inference time, allowing faster action prediction during deployment. To support this paradigm, we curate a diverse, large-scale robot dataset to pre-train an action-centered video generation model, which is then adapted as the backbone for robot policy learning. Experimental results on real-world robotic platforms show that GigaWorld-Policy runs 9x faster than the leading WAM baseline, Motus, while improving task success rates by 7%. Moreover, compared with pi-0.5, GigaWorld-Policy improves performance by 95% on RoboTwin 2.0.

Fonte: arXiv cs.CV

RLScore 85

ShuttleEnv: An Interactive Data-Driven RL Environment for Badminton Strategy Modeling

arXiv:2603.17324v1 Announce Type: new Abstract: We present ShuttleEnv, an interactive and data-driven simulation environment for badminton, designed to support reinforcement learning and strategic behavior analysis in fast-paced adversarial sports. The environment is grounded in elite-player match data and employs explicit probabilistic models to simulate rally-level dynamics, enabling realistic and interpretable agent-opponent interactions without relying on physics-based simulation. In this demonstration, we showcase multiple trained agents within ShuttleEnv and provide live, step-by-step visualization of badminton rallies, allowing attendees to explore different play styles, observe emergent strategies, and interactively analyze decision-making behaviors. ShuttleEnv serves as a reusable platform for research, visualization, and demonstration of intelligent agents in sports AI. Our ShuttleEnv demo video URL: https://drive.google.com/file/d/1hTR4P16U27H2O0-w316bR73pxE2ucczX/view

Fonte: arXiv cs.AI

VisionScore 85

PaAgent: Portrait-Aware Image Restoration Agent via Subjective-Objective Reinforcement Learning

arXiv:2603.17055v1 Announce Type: new Abstract: Image Restoration (IR) agents, leveraging multimodal large language models to perceive degradation and invoke restoration tools, have shown promise in automating IR tasks. However, existing IR agents typically lack an insight summarization mechanism for past interactions, which results in an exhaustive search for the optimal IR tool. To address this limitation, we propose a portrait-aware IR agent, dubbed PaAgent, which incorporates a self-evolving portrait bank for IR tools and Retrieval-Augmented Generation (RAG) to select a suitable IR tool for input. Specifically, to construct and evolve the portrait bank, the PaAgent continuously enriches it by summarizing the characteristics of various IR tools with restored images, selected IR tools, and degraded images. In addition, the RAG is employed to select the optimal IR tool for the input image by retrieving relevant insights from the portrait bank. Furthermore, to enhance PaAgent's ability to perceive degradation in complex scenes, we propose a subjective-objective reinforcement learning strategy that considers both image quality scores and semantic insights in reward generation, which accurately provides the degradation information even under partial and non-uniform degradation. Extensive experiments across 8 IR benchmarks, covering six single-degradation and eight mixed-degradation scenarios, validate PaAgent's superiority in addressing complex IR tasks. Our project page is \href{https://wyjgr.github.io/PaAgent.html}{PaAgent}.

Fonte: arXiv cs.CV

RLScore 85

Per-Domain Generalizing Policies: On Learning Efficient and Robust Q-Value Functions (Extended Version with Technical Appendix)

arXiv:2603.17544v1 Announce Type: new Abstract: Learning per-domain generalizing policies is a key challenge in learning for planning. Standard approaches learn state-value functions represented as graph neural networks using supervised learning on optimal plans generated by a teacher planner. In this work, we advocate for learning Q-value functions instead. Such policies are drastically cheaper to evaluate for a given state, as they need to process only the current state rather than every successor. Surprisingly, vanilla supervised learning of Q-values performs poorly as it does not learn to distinguish between the actions taken and those not taken by the teacher. We address this by using regularization terms that enforce this distinction, resulting in Q-value policies that consistently outperform state-value policies across a range of 10 domains and are competitive with the planner LAMA-first.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

From Digital Twins to World Models:Opportunities, Challenges, and Applications for Mobile Edge General Intelligence

arXiv:2603.17420v1 Announce Type: new Abstract: The rapid evolution toward 6G and beyond communication systems is accelerating the convergence of digital twins and world models at the network edge. Traditional digital twins provide high-fidelity representations of physical systems and support monitoring, analysis, and offline optimization. However, in highly dynamic edge environments, they face limitations in autonomy, adaptability, and scalability. This paper presents a systematic survey of the transition from digital twins to world models and discusses its role in enabling edge general intelligence (EGI). First, the paper clarifies the conceptual differences between digital twins and world models and highlights the shift from physics-based, centralized, and system-centric replicas to data-driven, decentralized, and agent-centric internal models. This discussion helps readers gain a clear understanding of how this transition enables more adaptive, autonomous, and resource-efficient intelligence at the network edge. The paper reviews the design principles, architectures, and key components of world models, including perception, latent state representation, dynamics learning, imagination-based planning, and memory. In addition, it examines the integration of world models and digital twins in wireless EGI systems and surveys emerging applications in integrated sensing and communications, semantic communication, air-ground networks, and low-altitude wireless networks. Finally, this survey provides a systematic roadmap and practical insights for designing world-model-driven edge intelligence systems in wireless and edge computing environments. It also outlines key research challenges and future directions toward scalable, reliable, and interoperable world models for edge-native agentic AI.

Fonte: arXiv cs.AI

RLScore 85

Efficient Soft Actor-Critic with LLM-Based Action-Level Guidance for Continuous Control

arXiv:2603.17468v1 Announce Type: new Abstract: We present GuidedSAC, a novel reinforcement learning (RL) algorithm that facilitates efficient exploration in vast state-action spaces. GuidedSAC leverages large language models (LLMs) as intelligent supervisors that provide action-level guidance for the Soft Actor-Critic (SAC) algorithm. The LLM-based supervisor analyzes the most recent trajectory using state information and visual replays, offering action-level interventions that enable targeted exploration. Furthermore, we provide a theoretical analysis of GuidedSAC, proving that it preserves the convergence guarantees of SAC while improving convergence speed. Through experiments in both discrete and continuous control environments, including toy text tasks and complex MuJoCo benchmarks, we demonstrate that GuidedSAC consistently outperforms standard SAC and state-of-the-art exploration-enhanced variants (e.g., RND, ICM, and E3B) in terms of sample efficiency and final performance.

Fonte: arXiv cs.LG

RLScore 85

MHPO: Modulated Hazard-aware Policy Optimization for Stable Reinforcement Learning

arXiv:2603.16929v1 Announce Type: cross Abstract: Regulating the importance ratio is critical for the training stability of Group Relative Policy Optimization (GRPO) based frameworks. However, prevailing ratio control methods, such as hard clipping, suffer from non-differentiable boundaries and vanishing gradient regions, failing to maintain gradient fidelity. Furthermore, these methods lack a hazard-aware mechanism to adaptively suppress extreme deviations, leaving the optimization process vulnerable to abrupt policy shifts. To address these challenges, we propose Modulated Hazard-aware Policy Optimization (MHPO), a novel framework designed for robust and stable reinforcement learning. The proposed MHPO introduces a Log-Fidelity Modulator (LFM) to map unbounded importance ratios into a bounded, differentiable domain. This mechanism effectively prevents high-variance outlier tokens from destabilizing the loss landscape while ensuring global gradient stability. Complementarily, a Decoupled Hazard Penalty (DHP) integrates cumulative hazard functions from survival analysis to independently regulate positive and negative policy shifts. By shaping the optimization landscape with hazard-aware penalties, the proposed MHPO achieves fine-grained regulation of asymmetric policy shifts simultaneously mitigating mode collapse from over-expansion and preventing policy erosion from catastrophic contraction within a stabilized trust region. Extensive evaluations on diverse reasoning benchmarks across both text-based and vision-language tasks demonstrate that MHPO consistently outperforms existing methods, achieving superior performance while significantly enhancing training stability.

Fonte: arXiv cs.AI

RLScore 85

CircuitBuilder: From Polynomials to Circuits via Reinforcement Learning

arXiv:2603.17075v1 Announce Type: new Abstract: Motivated by auto-proof generation and Valiant's VP vs. VNP conjecture, we study the problem of discovering efficient arithmetic circuits to compute polynomials, using addition and multiplication gates. We formulate this problem as a single-player game, where an RL agent attempts to build the circuit within a fixed number of operations. We implement an AlphaZero-style training loop and compare two approaches: Proximal Policy Optimization with Monte Carlo Tree Search (PPO+MCTS) and Soft Actor-Critic (SAC). SAC achieves the highest success rates on two-variable targets, while PPO+MCTS scales to three variables and demonstrates steady improvement on harder instances. These results suggest that polynomial circuit synthesis is a compact, verifiable setting for studying self-improving search policies.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

InfoDensity: Rewarding Information-Dense Traces for Efficient Reasoning

arXiv:2603.17310v1 Announce Type: new Abstract: Large Language Models (LLMs) with extended reasoning capabilities often generate verbose and redundant reasoning traces, incurring unnecessary computational cost. While existing reinforcement learning approaches address this by optimizing final response length, they neglect the quality of intermediate reasoning steps, leaving models vulnerable to reward hacking. We argue that verbosity is not merely a length problem, but a symptom of poor intermediate reasoning quality. To investigate this, we conduct an empirical study tracking the conditional entropy of the answer distribution across reasoning steps. We find that high-quality reasoning traces exhibit two consistent properties: low uncertainty convergence and monotonic progress. These findings suggest that high-quality reasoning traces are informationally dense, that is, each step contributes meaningful entropy reduction relative to the total reasoning length. Motivated by this, we propose InfoDensity, a reward framework for RL training that combines an AUC-based reward and a monotonicity reward as a unified measure of reasoning quality, weighted by a length scaling term that favors achieving equivalent quality more concisely. Experiments on mathematical reasoning benchmarks demonstrate that InfoDensity matches or surpasses state-of-the-art baselines in accuracy while significantly reducing token usage, achieving a strong accuracy-efficiency trade-off.

Fonte: arXiv cs.AI

RLScore 90

Physics-informed offline reinforcement learning eliminates catastrophic fuel waste in maritime routing

arXiv:2603.17319v1 Announce Type: new Abstract: International shipping produces approximately 3% of global greenhouse gas emissions, yet voyage routing remains dominated by heuristic methods. We present PIER (Physics-Informed, Energy-efficient, Risk-aware routing), an offline reinforcement learning framework that learns fuel-efficient, safety-aware routing policies from physics-calibrated environments grounded in historical vessel tracking data and ocean reanalysis products, requiring no online simulator. Validated on one full year (2023) of AIS data across seven Gulf of Mexico routes (840 episodes per method), PIER reduces mean CO2 emissions by 10% relative to great-circle routing. However, PIER's primary contribution is eliminating catastrophic fuel waste: great-circle routing incurs extreme fuel consumption (>1.5x median) in 4.8% of voyages; PIER reduces this to 0.5%, a 9-fold reduction. Per-voyage fuel variance is 3.5x lower (p<0.001), with bootstrap 95% CI for mean savings [2.9%, 15.7%]. Partial validation against observed AIS vessel behavior confirms consistency with the fastest real transits while exhibiting 23.1x lower variance. Crucially, PIER is forecast-independent: unlike A* path optimization whose wave protection degrades 4.5x under realistic forecast uncertainty, PIER maintains constant performance using only local observations. The framework combines physics-informed state construction, demonstration-augmented offline data, and a decoupled post-hoc safety shield, an architecture that transfers to wildfire evacuation, aircraft trajectory optimization, and autonomous navigation in unmapped terrain.

Fonte: arXiv cs.AI

VisionScore 85

Leveraging Large Vision Model for Multi-UAV Co-perception in Low-Altitude Wireless Networks

arXiv:2603.16927v1 Announce Type: new Abstract: Multi-uncrewed aerial vehicle (UAV) cooperative perception has emerged as a promising paradigm for diverse low-altitude economy applications, where complementary multi-view observations are leveraged to enhance perception performance via wireless communications. However, the massive visual data generated by multiple UAVs poses significant challenges in terms of communication latency and resource efficiency. To address these challenges, this paper proposes a communication-efficient cooperative perception framework, termed Base-Station-Helped UAV (BHU), which reduces communication overhead while enhancing perception performance. Specifically, we employ a Top-K selection mechanism to identify the most informative pixels from UAV-captured RGB images, enabling sparsified visual transmission with reduced data volume and latency. The sparsified images are transmitted to a ground server via multi-user MIMO (MU-MIMO), where a Swin-large-based MaskDINO encoder extracts bird's-eye-view (BEV) features and performs cooperative feature fusion for ground vehicle perception. Furthermore, we develop a diffusion model-based deep reinforcement learning (DRL) algorithm to jointly select cooperative UAVs, sparsification ratios, and precoding matrices, achieving a balance between communication efficiency and perception utility. Simulation results on the Air-Co-Pred dataset demonstrate that, compared with traditional CNN-based BEV fusion baselines, the proposed BHU framework improves perception performance by over 5% while reducing communication overhead by 85%, providing an effective solution for multi-UAV cooperative perception under resource-constrained wireless environments.

Fonte: arXiv cs.CV

NLP/LLMsScore 92

A Progressive Visual-Logic-Aligned Framework for Ride-Hailing Adjudication

arXiv:2603.17328v1 Announce Type: new Abstract: The efficient adjudication of responsibility disputes is pivotal for maintaining marketplace fairness. However, the exponential surge in ride-hailing volume renders manual review intractable, while conventional automated methods lack the reasoning transparency required for quasi-judicial decisions. Although Multimodal LLMs offer a promising paradigm, they fundamentally struggle to bridge the gap between general visual semantics and rigorous evidentiary protocols, often leading to perceptual hallucinations and logical looseness. To address these systemic misalignments, we introduce RideJudge, a Progressive Visual-Logic-Aligned Framework. Instead of relying on generic pre-training, we bridge the semantic gap via SynTraj, a synthesis engine that grounds abstract liability concepts into concrete trajectory patterns. To resolve the conflict between massive regulation volume and limited context windows, we propose an Adaptive Context Optimization strategy that distills expert knowledge, coupled with a Chain-of-Adjudication mechanism to enforce active evidentiary inquiry. Furthermore, addressing the inadequacy of sparse binary feedback for complex liability assessment, we implement a novel Ordinal-Sensitive Reinforcement Learning mechanism that calibrates decision boundaries against hierarchical severity. Extensive experiments show that our RideJudge-8B achieves 88.41\% accuracy, surpassing 32B-scale baselines and establishing a new standard for interpretable adjudication.

Fonte: arXiv cs.AI

RLScore 85

Beyond Reward Suppression: Reshaping Steganographic Communication Protocols in MARL via Dynamic Representational Circuit Breaking

arXiv:2603.15655v1 Announce Type: new Abstract: In decentralized Multi-Agent Reinforcement Learning (MARL), steganographic collusion -- where agents develop private protocols to evade monitoring -- presents a critical AI safety threat. Existing defenses, limited to behavioral or reward layers, fail to detect coordination in latent communication channels. We introduce the Dynamic Representational Circuit Breaker (DRCB), an architectural defense operating at the optimization substrate. Building on the AI Mother Tongue (AIM) framework, DRCB utilizes a Vector Quantized Variational Autoencoder (VQ-VAE) bottleneck to convert unobservable messages into auditable statistical objects. DRCB monitors signals including Jensen-Shannon Divergence drift, L2-norm codebook displacement, and Randomized Observer Pool accuracy to compute an EMA-based Collusion Score. Threshold breaches trigger four escalating interventions: dynamic adaptation, gradient-space penalty injection into the Advantage function A^pi, temporal reward suppression, and full substrate circuit breaking via codebook shuffling and optimizer state reset. Experiments on a Contextual Prisoner's Dilemma with MNIST labels show that while static monitoring fails (p = 0.3517), DRCB improves observer mean accuracy from 0.858 to 0.938 (+9.3 percent) and reduces volatility by 43 percent, while preserving mean joint reward (p = 0.854). Analysis of 214,298 symbol samples confirms "Semantic Degradation," where high-frequency sequences converge to zero entropy, foreclosing complex steganographic encodings. We identify a "Transparency Paradox" where agents achieve surface-level determinism while preserving residual capacity in long-tail distributions, reflecting Goodhart's Law. This task-agnostic methodology provides a technical path toward MICA-compliant (Multi-Agent Internal Coupling Audit) pre-deployment auditing for autonomous systems.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning

arXiv:2603.15981v1 Announce Type: new Abstract: Speech large language models (LLMs) observe paralinguistic cues such as prosody, emotion, and non-verbal sounds--crucial for intent understanding. However, leveraging these cues faces challenges: limited training data, annotation difficulty, and models exploiting lexical shortcuts over paralinguistic signals. We propose multi-task reinforcement learning (RL) with chain-of-thought prompting that elicits explicit affective reasoning. To address data scarcity, we introduce a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation via a two-stage pipeline. Experiments demonstrate that our approach improves paralinguistics understanding over both supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio) by 8-12% on Expresso, IEMOCAP, and RAVDESS. The results show that modeling paralinguistic reasoning with multi-task RL is crucial for building emotionally intelligent speech LLMs.

Fonte: arXiv cs.CL

RLScore 85

Alternating Reinforcement Learning with Contextual Rubric Rewards

arXiv:2603.15646v1 Announce Type: new Abstract: Reinforcement Learning with Rubric Rewards (RLRR) is a framework that extends conventional reinforcement learning from human feedback (RLHF) and verifiable rewards (RLVR) by replacing scalar preference signals with structured, multi-dimensional, contextual rubric-based evaluations. However, existing approaches in RLRR are limited to linearly compressing vector rewards into a scalar reward with a fixed weightings, which is sensitive to artificial score design and fails to capture correlations among reward dimensions. To overcome the limitations of reward aggregation, this work proposes Alternating Reinforcement Learning with Rubric Rewards (ARL-RR), a framework that eliminates the need for a fixed scalarization by optimizing one semantic rubric meta-class at a time. Theoretically, we show that reward aggregation induces a variance contraction effect, which helps explain the performance gains. We further introduce a lightweight, search-based adaptation procedure that selects the next meta-class dynamically based on task performance, enabling the policy to emphasize critical objectives and thereby improve the model performance. Empirically, our experiments on the HealthBench dataset with experts annotations demonstrate that ARL-RR uniformly outperforms scalarized methods in both model performance and training efficiency across different model scales (1.7B, 4B, 8B, and 14B).

Fonte: arXiv cs.LG

NLP/LLMsScore 85

Reliable Reasoning in SVG-LLMs via Multi-Task Multi-Reward Reinforcement Learning

arXiv:2603.16189v1 Announce Type: new Abstract: With the rapid advancement of vision-language models, an increasing number of studies have explored their potential for SVG generation tasks. Although existing approaches improve performance by constructing large-scale SVG datasets and introducing SVG-specific tokens, they still suffer from limited generalization, redundant paths in code outputs, and a lack of explicit reasoning. In this work, we present CTRL-S (Chain-of-Thought Reinforcement Learning for SVG), a unified framework that introduces a chain-of-thought mechanism to explicitly expose the model's reasoning process during SVG generation. To support this structured reasoning, we construct SVG-Sophia, a high-quality dataset containing 145K samples across SVG code refinement, Text-to-SVG, and Image-to-SVG tasks. By training the model to generate group-level structured SVG code, CTRL-S significantly improves structural coherence and visual fidelity. Furthermore, we adopt the GRPO algorithm and design a multi-reward optimization framework, incorporating DINO, image-text similarity, format, and code efficiency rewards. Through joint multi-reward optimization and multi-task training, our approach systematically enhances overall generation capabilities. Extensive experiments show that CTRL-S outperforms existing methods, achieving higher task success rates, superior SVG code quality, and exceptional visual fidelity.

Fonte: arXiv cs.CV

RLScore 85

Game-Theory-Assisted Reinforcement Learning for Border Defense: Early Termination based on Analytical Solutions

arXiv:2603.15907v1 Announce Type: new Abstract: Game theory provides the gold standard for analyzing adversarial engagements, offering strong optimality guarantees. However, these guarantees often become brittle when assumptions such as perfect information are violated. Reinforcement learning (RL), by contrast, is adaptive but can be sample-inefficient in large, complex domains. This paper introduces a hybrid approach that leverages game-theoretic insights to improve RL training efficiency. We study a border defense game with limited perceptual range, where defender performance depends on both search and pursuit strategies, making classical differential game solutions inapplicable. Our method employs the Apollonius Circle (AC) to compute equilibrium in the post-detection phase, enabling early termination of RL episodes without learning pursuit dynamics. This allows RL to concentrate on learning search strategies while guaranteeing optimal continuation after detection. Across single- and multi-defender settings, this early termination method yields 10-20% higher rewards, faster convergence, and more efficient search trajectories. Extensive experiments validate these findings and demonstrate the overall effectiveness of our approach.

Fonte: arXiv cs.LG

RLScore 85

Noisy Data is Destructive to Reinforcement Learning with Verifiable Rewards

arXiv:2603.16140v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has driven recent capability advances of large language models across various domains. Recent studies suggest that improved RLVR algorithms allow models to learn effectively from incorrect annotations, achieving performance comparable to learning from clean data. In this work, we show that these findings are invalid because the claimed 100% noisy training data is "contaminated" with clean data. After rectifying the dataset with a rigorous re-verification pipeline, we demonstrate that noise is destructive to RLVR. We show that existing RLVR algorithm improvements fail to mitigate the impact of noise, achieving similar performance to that of the basic GRPO. Furthermore, we find that the model trained on truly incorrect annotations performs 8-10% worse than the model trained on clean data across mathematical reasoning benchmarks. Finally, we show that these findings hold for real-world noise in Text2SQL tasks, where training on real-world, human annotation errors cause 5-12% lower accuracy than clean data. Our results show that current RLVR methods cannot yet compensate for poor data quality. High-quality data remains essential.

Fonte: arXiv cs.LG

RLScore 85

Counteractive RL: Rethinking Core Principles for Efficient and Scalable Deep Reinforcement Learning

arXiv:2603.15871v1 Announce Type: new Abstract: Following the pivotal success of learning strategies to win at tasks, solely by interacting with an environment without any supervision, agents have gained the ability to make sequential decisions in complex MDPs. Yet, reinforcement learning policies face exponentially growing state spaces in high dimensional MDPs resulting in a dichotomy between computational complexity and policy success. In our paper we focus on the agent's interaction with the environment in a high-dimensional MDP during the learning phase and we introduce a theoretically-founded novel paradigm based on experiences obtained through counteractive actions. Our analysis and method provide a theoretical basis for efficient, effective, scalable and accelerated learning, and further comes with zero additional computational complexity while leading to significant acceleration in training. We conduct extensive experiments in the Arcade Learning Environment with high-dimensional state representation MDPs. The experimental results further verify our theoretical analysis, and our method achieves significant performance increase with substantial sample-efficiency in high-dimensional environments.

Fonte: arXiv cs.LG

Evaluation/BenchmarksScore 85

CUBE: A Standard for Unifying Agent Benchmarks

arXiv:2603.15798v1 Announce Type: new Abstract: The proliferation of agent benchmarks has created critical fragmentation that threatens research productivity. Each new benchmark requires substantial custom integration, creating an "integration tax" that limits comprehensive evaluation. We propose CUBE (Common Unified Benchmark Environments), a universal protocol standard built on MCP and Gym that allows benchmarks to be wrapped once and used everywhere. By separating task, benchmark, package, and registry concerns into distinct API layers, CUBE enables any compliant platform to access any compliant benchmark for evaluation, RL training, or data generation without custom integration. We call on the community to contribute to the development of this standard before platform-specific implementations deepen fragmentation as benchmark production accelerates through 2026.

Fonte: arXiv cs.AI

RLScore 85

Regularized Latent Dynamics Prediction is a Strong Baseline For Behavioral Foundation Models

arXiv:2603.15857v1 Announce Type: new Abstract: Behavioral Foundation Models (BFMs) produce agents with the capability to adapt to any unknown reward or task. These methods, however, are only able to produce near-optimal policies for the reward functions that are in the span of some pre-existing state features, making the choice of state features crucial to the expressivity of the BFM. As a result, BFMs are trained using a variety of complex objectives and require sufficient dataset coverage, to train task-useful spanning features. In this work, we examine the question: are these complex representation learning objectives necessary for zero-shot RL? Specifically, we revisit the objective of self-supervised next-state prediction in latent space for state feature learning, but observe that such an objective alone is prone to increasing state-feature similarity, and subsequently reducing span. We propose an approach, Regularized Latent Dynamics Prediction (RLDP), that adds a simple orthogonality regularization to maintain feature diversity and can match or surpass state-of-the-art complex representation learning methods for zero-shot RL. Furthermore, we empirically show that prior approaches perform poorly in low-coverage scenarios where RLDP still succeeds.

Fonte: arXiv cs.AI

RLScore 85

Meta-TTRL: A Metacognitive Framework for Self-Improving Test-Time Reinforcement Learning in Unified Multimodal Models

arXiv:2603.15724v1 Announce Type: new Abstract: Existing test-time scaling (TTS) methods for unified multimodal models (UMMs) in text-to-image (T2I) generation primarily rely on search or sampling strategies that produce only instance-level improvements, limiting the ability to learn from prior inferences and accumulate knowledge across similar prompts. To overcome these limitations, we propose Meta-TTRL, a metacognitive test-time reinforcement learning framework. Meta-TTRL performs test-time parameter optimization guided by model-intrinsic monitoring signals derived from the meta-knowledge of UMMs, achieving self-improvement and capability-level improvement at test time. Extensive experiments demonstrate that Meta-TTRL generalizes well across three representative UMMs, including Janus-Pro-7B, BAGEL, and Qwen-Image, achieving significant gains on compositional reasoning tasks and multiple T2I benchmarks with limited data. We provide the first comprehensive analysis to investigate the potential of test-time reinforcement learning (TTRL) for T2I generation in UMMs. Our analysis further reveals a key insight underlying effective TTRL: metacognitive synergy, where monitoring signals align with the model's optimization regime to enable self-improvement.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

ECHO: Edge-Cloud Humanoid Orchestration for Language-to-Motion Control

arXiv:2603.16188v1 Announce Type: new Abstract: We present ECHO, an edge--cloud framework for language-driven whole-body control of humanoid robots. A cloud-hosted diffusion-based text-to-motion generator synthesizes motion references from natural language instructions, while an edge-deployed reinforcement-learning tracker executes them in closed loop on the robot. The two modules are bridged by a compact, robot-native 38-dimensional motion representation that encodes joint angles, root planar velocity, root height, and a continuous 6D root orientation per frame, eliminating inference-time retargeting from human body models and remaining directly compatible with low-level PD control. The generator adopts a 1D convolutional UNet with cross-attention conditioned on CLIP-encoded text features; at inference, DDIM sampling with 10 denoising steps and classifier-free guidance produces motion sequences in approximately one second on a cloud GPU. The tracker follows a Teacher--Student paradigm: a privileged teacher policy is distilled into a lightweight student equipped with an evidential adaptation module for sim-to-real transfer, further strengthened by morphological symmetry constraints and domain randomization. An autonomous fall recovery mechanism detects falls via onboard IMU readings and retrieves recovery trajectories from a pre-built motion library. We evaluate ECHO on a retargeted HumanML3D benchmark, where it achieves strong generation quality (FID 0.029, R-Precision Top-1 0.686) under a unified robot-domain evaluator, while maintaining high motion safety and trajectory consistency. Real-world experiments on a Unitree G1 humanoid demonstrate stable execution of diverse text commands with zero hardware fine-tuning.

Fonte: arXiv cs.CV

RLScore 85

Collaborative Temporal Feature Generation via Critic-Free Reinforcement Learning for Cross-User Sensor-Based Activity Recognition

arXiv:2603.16043v1 Announce Type: new Abstract: Human Activity Recognition using wearable inertial sensors is foundational to healthcare monitoring, fitness analytics, and context-aware computing, yet its deployment is hindered by cross-user variability arising from heterogeneous physiological traits, motor habits, and sensor placements. Existing domain generalization approaches either neglect temporal dependencies in sensor streams or depend on impractical target-domain annotations. We propose a different paradigm: modeling generalizable feature extraction as a collaborative sequential generation process governed by reinforcement learning. Our framework, CTFG (Collaborative Temporal Feature Generation), employs a Transformer-based autoregressive generator that incrementally constructs feature token sequences, each conditioned on prior context and the encoded sensor input. The generator is optimized via Group-Relative Policy Optimization, a critic-free algorithm that evaluates each generated sequence against a cohort of alternatives sampled from the same input, deriving advantages through intra-group normalization rather than learned value estimation. This design eliminates the distribution-dependent bias inherent in critic-based methods and provides self-calibrating optimization signals that remain stable across heterogeneous user distributions. A tri-objective reward comprising class discrimination, cross-user invariance, and temporal fidelity jointly shapes the feature space to separate activities, align user distributions, and preserve fine-grained temporal content. Evaluations on the DSADS and PAMAP2 benchmarks demonstrate state-of-the-art cross-user accuracy (88.53\% and 75.22\%), substantial reduction in inter-task training variance, accelerated convergence, and robust generalization under varying action-space dimensionalities.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

AsgardBench - Evaluating Visually Grounded Interactive Planning Under Minimal Feedback

arXiv:2603.15888v1 Announce Type: new Abstract: With AsgardBench we aim to evaluate visually grounded, high-level action sequence generation and interactive planning, focusing specifically on plan adaptation during execution based on visual observations rather than navigation or low-level manipulation. In the landscape of embodied AI benchmarks, AsgardBench targets the capability category of interactive planning, which is more sophisticated than offline high-level planning as it requires agents to revise plans in response to environmental feedback, yet remains distinct from low-level execution. Unlike prior embodied AI benchmarks that conflate reasoning with navigation or provide rich corrective feedback that substitutes for perception, AsgardBench restricts agent input to images, action history, and lightweight success/failure signals, isolating interactive planning in a controlled simulator without low-level control noise. The benchmark contains 108 task instances spanning 12 task types, each systematically varied through object state, placement, and scene configuration. These controlled variations create conditional branches in which a single instruction can require different action sequences depending on what the agent observes, emphasizing conditional branching and plan repair during execution. Our evaluations of leading vision language models show that performance drops sharply without visual input, revealing weaknesses in visual grounding and state tracking that ultimately undermine interactive planning. Our benchmark zeroes in on a narrower question: can a model actually use what it sees to adapt a plan when things do not go as expected?

Fonte: arXiv cs.AI

RLScore 85

SQL-ASTRA: Alleviating Sparse Feedback in Agentic SQL via Column-Set Matching and Trajectory Aggregation

arXiv:2603.16161v1 Announce Type: new Abstract: Agentic Reinforcement Learning (RL) shows promise for complex tasks, but Text-to-SQL remains mostly restricted to single-turn paradigms. A primary bottleneck is the credit assignment problem. In traditional paradigms, rewards are determined solely by the final-turn feedback, which ignores the intermediate process and leads to ambiguous credit evaluation. To address this, we propose Agentic SQL, a framework featuring a universal two-tiered reward mechanism designed to provide effective trajectory-level evaluation and dense step-level signals. First, we introduce Aggregated Trajectory Reward (ATR) to resolve multi-turn credit assignment. Using an asymmetric transition matrix, ATR aggregates process-oriented scores to incentivize continuous improvement. Leveraging Lyapunov stability theory, we prove ATR acts as an energy dissipation operator, guaranteeing a cycle-free policy and monotonic convergence. Second, Column-Set Matching Reward (CSMR) provides immediate step-level rewards to mitigate sparsity. By executing queries at each turn, CSMR converts binary (0/1) feedback into dense [0, 1] signals based on partial correctness. Evaluations on BIRD show a 5% gain over binary-reward GRPO. Notably, our approach outperforms SOTA Arctic-Text2SQL-R1-7B on BIRD and Spider 2.0 using identical models, propelling Text-to-SQL toward a robust multi-turn agent paradigm.

Fonte: arXiv cs.AI

RLScore 85

Agile Interception of a Flying Target using Competitive Reinforcement Learning

arXiv:2603.16279v1 Announce Type: cross Abstract: This article presents a solution to intercept an agile drone by another agile drone carrying a catching net. We formulate the interception as a Competitive Reinforcement Learning problem, where the interceptor and the target drone are controlled by separate policies trained with Proximal Policy Optimization (PPO). We introduce a high-fidelity simulation environment that integrates a realistic quadrotor dynamics model and a low-level control architecture implemented in JAX, which allows for fast parallelized execution on GPUs. We train the agents using low-level control, collective thrust and body rates, to achieve agile flights both for the interceptor and the target. We compare the performance of the trained policies in terms of catch rate, time to catch, and crash rate, against common heuristic baselines and show that our solution outperforms these baselines for interception of agile targets. Finally, we demonstrate the performance of the trained policies in a scaled real-world scenario using agile drones inside an indoor flight arena.

Fonte: arXiv stat.ML

RLScore 85

ARISE: Agent Reasoning with Intrinsic Skill Evolution in Hierarchical Reinforcement Learning

arXiv:2603.16060v1 Announce Type: new Abstract: The dominant paradigm for improving mathematical reasoning in language models relies on Reinforcement Learning with verifiable rewards. Yet existing methods treat each problem instance in isolation without leveraging the reusable strategies that emerge and accumulate during training. To this end, we introduce ARISE (Agent Reasoning via Intrinsic Skill Evolution), a hierarchical reinforcement learning framework, in which a shared policy operates both to manage skills at high-level and to generate responses at low-level (denoted as a Skills Manager and a Worker, respectively). The Manager maintains a tiered skill library through a dedicated skill generation rollout that performs structured summarization of successful solution traces (after execution), while employing a policy-driven selection mechanism to retrieve relevant skills to condition future rollouts (before execution). A hierarchical reward design guides the co-evolution of reasoning ability and library quality. Experiments on two base models and seven benchmarks spanning both competition mathematics and Omni-MATH show that ARISE consistently outperforms GRPO-family algorithms and memory-augmented baselines, with particularly notable gains on out-of-distribution tasks. Ablation studies confirm that each component contributes to the observed improvements and that library quality and reasoning performance improve in tandem throughout training. Code is available at \href{https://github.com/Skylanding/ARISE}{https://github.com/Skylanding/ARISE}.

Fonte: arXiv cs.AI

RLScore 85

SAC-NeRF: Adaptive Ray Sampling for Neural Radiance Fields via Soft Actor-Critic Reinforcement Learning

arXiv:2603.15622v1 Announce Type: new Abstract: Neural Radiance Fields (NeRF) have achieved photorealistic novel view synthesis but suffer from computational inefficiency due to dense ray sampling during volume rendering. We propose SAC-NeRF, a reinforcement learning framework that learns adaptive sampling policies using Soft Actor-Critic (SAC). Our method formulates sampling as a Markov Decision Process where an RL agent learns to allocate samples based on scene characteristics. We introduce three technical components: (1) a Gaussian mixture distribution color model providing uncertainty estimates, (2) a multi-component reward function balancing quality, efficiency, and consistency, and (3) a two-stage training strategy addressing environment non-stationarity. Experiments on Synthetic-NeRF and LLFF datasets show that SAC-NeRF reduces sampling points by 35-48\% while maintaining rendering quality within 0.3-0.8 dB PSNR of dense sampling baselines. While the learned policy is scene-specific and the RL framework adds complexity compared to simpler heuristics, our work demonstrates that data-driven sampling strategies can discover effective patterns that would be difficult to hand-design.

Fonte: arXiv cs.CV

RLScore 85

Distilling Deep Reinforcement Learning into Interpretable Fuzzy Rules: An Explainable AI Framework

arXiv:2603.13257v1 Announce Type: new Abstract: Deep Reinforcement Learning (DRL) agents achieve remarkable performance in continuous control but remain opaque, hindering deployment in safety-critical domains. Existing explainability methods either provide only local insights (SHAP, LIME) or employ over-simplified surrogates failing to capture continuous dynamics (decision trees). This work proposes a Hierarchical Takagi-Sugeno-Kang (TSK) Fuzzy Classifier System (FCS) distilling neural policies into human-readable IF-THEN rules through K-Means clustering for state partitioning and Ridge Regression for local action inference. Three quantifiable metrics are introduced: Fuzzy Rule Activation Density (FRAD) measuring explanation focus, Fuzzy Set Coverage (FSC) validating vocabulary completeness, and Action Space Granularity (ASG) assessing control mode diversity. Dynamic Time Warping (DTW) validates temporal behavioral fidelity. Empirical evaluation on \textit{Lunar Lander(Continuous)} shows the Triangular membership function variant achieves 81.48\% $\pm$ 0.43\% fidelity, outperforming Decision Trees by 21 percentage points. The framework exhibits statistically superior interpretability (FRAD = 0.814 vs. 0.723 for Gaussian, $p < 0.001$) with low MSE (0.0053) and DTW distance (1.05). Extracted rules such as ``IF lander drifting left at high altitude THEN apply upward thrust with rightward correction'' enable human verification, establishing a pathway toward trustworthy autonomous systems.

Fonte: arXiv cs.AI

RLScore 85

Learning When to Trust in Contextual Bandits

arXiv:2603.13356v1 Announce Type: new Abstract: Standard approaches to Robust Reinforcement Learning assume that feedback sources are either globally trustworthy or globally adversarial. In this paper, we challenge this assumption and we identify a more subtle failure mode. We term this mode as Contextual Sycophancy, where evaluators are truthful in benign contexts but strategically biased in critical ones. We prove that standard robust methods fail in this setting, suffering from Contextual Objective Decoupling. To address this, we propose CESA-LinUCB, which learns a high-dimensional Trust Boundary for each evaluator. We prove that CESA-LinUCB achieves sublinear regret $\tilde{O}(\sqrt{T})$ against contextual adversaries, recovering the ground truth even when no evaluator is globally reliable.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Language-Guided Token Compression with Reinforcement Learning in Large Vision-Language Models

arXiv:2603.13394v1 Announce Type: new Abstract: Large Vision-Language Models (LVLMs) incur substantial inference costs due to the processing of a vast number of visual tokens. Existing methods typically struggle to model progressive visual token reduction as a multi-step decision process with sequential dependencies and often rely on hand-engineered scoring rules that lack adaptive optimization for complex reasoning trajectories. To overcome these limitations, we propose TPRL, a reinforcement learning framework that learns adaptive pruning trajectories through language-guided sequential optimization tied directly to end-task performance. We formulate visual token pruning as a sequential decision process with explicit state transitions and employ a self-supervised autoencoder to compress visual tokens into a compact state representation for efficient policy learning. The pruning policy is initialized through learning from demonstrations and subsequently fine-tuned using Proximal Policy Optimization (PPO) to jointly optimize task accuracy and computational efficiency. Our experimental results demonstrate that TPRL removes up to 66.7\% of visual tokens and achieves up to a 54.2\% reduction in FLOPs during inference while maintaining a near-lossless average accuracy drop of only 0.7\%. Code is released at \href{https://github.com/MagicVicCoder/TPRL}{\textcolor{mypink}{https://github.com/MagicVicCoder/TPRL}}.

Fonte: arXiv cs.CV

NLP/LLMsScore 85

Knowledge Distillation for Large Language Models

arXiv:2603.13765v1 Announce Type: new Abstract: We propose a resource-efficient framework for compressing large language models through knowledge distillation, combined with guided chain-of-thought reinforcement learning. Using Qwen 3B as the teacher and Qwen 0.5B as the student, we apply knowledge distillation across English Dolly-15k, Spanish Dolly-15k, and code BugNet and PyTorrent datasets, with hyperparameters tuned in the English setting to optimize student performance. Across tasks, the distilled student retains a substantial portion of the teacher's capability while remaining significantly smaller: 70% to 91% in English, up to 95% in Spanish, and up to 93.5% Rouge-L in code. For coding tasks, integrating chain-of-thought prompting with Group Relative Policy Optimization using CoT-annotated Codeforces data improves reasoning coherence and solution correctness compared to knowledge distillation alone. Post-training 4-bit weight quantization further reduces memory footprint and inference latency. These results show that knowledge distillation combined with chain-of-thought guided reinforcement learning can produce compact, efficient models suitable for deployment in resource-constrained settings.

Fonte: arXiv cs.CL

NLP/LLMsScore 85

APEX-Searcher: Augmenting LLMs' Search Capabilities through Agentic Planning and Execution

arXiv:2603.13853v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG), based on large language models (LLMs), serves as a vital approach to retrieving and leveraging external knowledge in various domain applications. When confronted with complex multi-hop questions, single-round retrieval is often insufficient for accurate reasoning and problem solving. To enhance search capabilities for complex tasks, most existing works integrate multi-round iterative retrieval with reasoning processes via end-to-end training. While these approaches significantly improve problem-solving performance, they are still faced with challenges in task reasoning and model training, especially ambiguous retrieval execution paths and sparse rewards in end-to-end reinforcement learning (RL) process, leading to inaccurate retrieval results and performance degradation. To address these issues, in this paper, we proposes APEX-Searcher, a novel Agentic Planning and Execution framework to augment LLM search capabilities. Specifically, we introduce a two-stage agentic framework that decouples the retrieval process into planning and execution: It first employs RL with decomposition-specific rewards to optimize strategic planning; Built on the sub-task decomposition, it then applies supervised fine-tuning on high-quality multi-hop trajectories to equip the model with robust iterative sub-task execution capabilities. Extensive experiments demonstrate that our proposed framework achieves significant improvements in both multi-hop RAG and task planning performances across multiple benchmarks.

Fonte: arXiv cs.CL

RLScore 85

Demand Acceptance using Reinforcement Learning for Dynamic Vehicle Routing Problem with Emission Quota

arXiv:2603.13279v1 Announce Type: new Abstract: This paper introduces and formalizes the Dynamic and Stochastic Vehicle Routing Problem with Emission Quota (DS-QVRP-RR), a novel routing problems that integrates dynamic demand acceptance and routing with a global emission constraint. A key contribution is a two-layer optimization framework designed to facilitate anticipatory rejections of demands and generation of new routes. To solve this, we develop hybrid algorithms that combine reinforcement learning with combinatorial optimization techniques. We present a comprehensive computational study that compares our approach against traditional methods. Our findings demonstrate the relevance of our approach for different types of inputs, even when the horizon of the problem is uncertain.

Fonte: arXiv cs.LG

RLScore 85

Delightful Policy Gradient

arXiv:2603.14608v1 Announce Type: cross Abstract: Standard policy gradients weight each sampled action by advantage alone, regardless of how likely that action was under the current policy. This creates two pathologies: within a single decision context (e.g. one image or prompt), a rare negative-advantage action can disproportionately distort the update direction; across many such contexts in a batch, the expected gradient over-allocates budget to contexts the policy already handles well. We introduce the \textit{Delightful Policy Gradient} (DG), which gates each term with a sigmoid of \emph{delight}, the product of advantage and action surprisal (negative log-probability). For $K$-armed bandits, DG provably improves directional accuracy in a single context and, across multiple contexts, shifts the expected gradient strictly closer to the supervised cross-entropy oracle. This second effect is not variance reduction: it persists even with infinite samples. Empirically, DG outperforms REINFORCE, PPO, and advantage-weighted baselines across MNIST, transformer sequence modeling, and continuous control, with larger gains on harder tasks.

Fonte: arXiv stat.ML

NLP/LLMsScore 85

LightningRL: Breaking the Accuracy-Parallelism Trade-off of Block-wise dLLMs via Reinforcement Learning

arXiv:2603.13319v1 Announce Type: new Abstract: Diffusion Large Language Models (dLLMs) have emerged as a promising paradigm for parallel token generation, with block-wise variants garnering significant research interest. Despite their potential, existing dLLMs typically suffer from a rigid accuracy-parallelism trade-off: increasing the number of tokens per forward (TPF) via aggressive parallel decoding often leads to performance degradation and increased generation instability. We identify that this limitation stems from the model's inability to navigate high-parallelism regimes where approximation errors and local corruptions accumulate, ultimately undermining the reliability of parallel generation. To address this, we propose LightningRL, a post-training framework designed to directly optimize the speed-quality Pareto frontier of pre-trained dLLMs. Instead of forcing uniform parallelization, our approach leverages reinforcement learning to identify and reinforce high-parallelism trajectories that maintain generation accuracy. Built upon the Group Relative Policy Optimization (GRPO) framework, LightningRL introduces several enhancements tailored for dLLMs: (1) stabilized training via per-reward decoupled normalization; (2) token-level negative log-likelihood (NLL) regularization on correct trajectories to anchor model performance; and (3) a dynamic sampling strategy with TPF-aware filtering to enhance training efficiency. Experimental results across mathematical and coding benchmarks demonstrate that LightningRL consistently advances the Pareto frontier, achieving competitive task accuracy while significantly increasing parallelism, reaching an average TPF of 7.32 (with a peak of 11.10 on the MBPP dataset). Our code is available at https://github.com/SJTU-DENG-Lab/LightningRL.

Fonte: arXiv cs.LG

RLScore 85

FastODT: A tree-based framework for efficient continual learning

arXiv:2603.13276v1 Announce Type: new Abstract: Machine learning models deployed in real-world settings must operate under evolving data distributions and constrained computational resources. This challenge is particularly acute in non-stationary domains such as energy time series, weather monitoring, and environmental sensing. To remain effective, models must support adaptability, continuous learning, and long-term knowledge retention. This paper introduces a oblivious tree-based model with Hoeffding bound controlling its growth. It seamlessly integrates rapid learning and inference with efficient memory management and robust knowledge preservation, thus allowing for online learning. Extensive experiments across energy and environmental sensing time-series benchmarks demonstrate that the proposed framework achieves performance competitive with, and in several cases surpassing, existing online and batch learning methods, while maintaining superior computational efficiency. Collectively, these results demonstrate that the proposed approach fulfills the core objectives of adaptability, continual updating, and efficient retraining without full model retraining. The framework provides a scalable and resource-aware foundation for deployment in real-world non-stationary environments where resources are constrained and sustained adaptation is essential.

Fonte: arXiv cs.LG

RLScore 85

ICPRL: Acquiring Physical Intuition from Interactive Control

arXiv:2603.13295v1 Announce Type: new Abstract: VLMs excel at static perception but falter in interactive reasoning in dynamic physical environments, which demands planning and adaptation to dynamic outcomes. Existing physical reasoning methods often depend on abstract symbolic inputs or lack the ability to learn and adapt from direct, pixel-based visual interaction in novel scenarios. We introduce ICPRL (In-Context Physical Reinforcement Learning), a framework inspired by In-Context Reinforcement Learning (ICRL) that empowers VLMs to acquire physical intuition and adapt their policies in-context. Our approach trains a vision-grounded policy model via multi-turn Group Relative Policy Optimization (GRPO) over diverse multi-episode interaction histories. This enables the agent to adapt strategies by conditioning on past trial-and-error sequences, without requiring any weight updates. This adaptive policy works in concert with a separately trained world model that provides explicit physical reasoning by predicting the results of potential actions. At inference, the policy proposes candidate actions, while the world model predicts outcomes to guide a root-node PUCT search to select the most promising action. Evaluated on the diverse physics-based puzzle-solving tasks in the DeepPHY benchmark, ICPRL demonstrates significant improvements across both its I. policy-only, and II. world-model-augmented stages. Notably, these gains are retained in unseen physical environments, demonstrating that our framework facilitates genuine in-context acquisition of the environment's physical dynamics from interactive experience.

Fonte: arXiv cs.LG

RLScore 85

Few Batches or Little Memory, But Not Both: Simultaneous Space and Adaptivity Constraints in Stochastic Bandits

arXiv:2603.13742v1 Announce Type: cross Abstract: We study stochastic multi-armed bandits under simultaneous constraints on space and adaptivity: the learner interacts with the environment in $B$ batches and has only $W$ bits of persistent memory. Prior work shows that each constraint alone is surprisingly mild: near-minimax regret $\widetilde{O}(\sqrt{KT})$ is achievable with $O(\log T)$ bits of memory under fully adaptive interaction, and with a $K$-independent $O(\log\log T)$-type number of batches when memory is unrestricted. We show that this picture breaks down in the simultaneously constrained regime. We prove that any algorithm with a $W$-bit memory constraint must use at least $\Omega(K/W)$ batches to achieve near-minimax regret $\widetilde{O}(\sqrt{KT})$ , even under adaptive grids. In particular, logarithmic memory rules out $K$-independent batch complexity. Our proof is based on an information bottleneck. We show that near-minimax regret forces the learner to acquire $\Omega(K)$ bits of information about the hidden set of good arms under a suitable hard prior, whereas an algorithm with $B$ batches and $W$ bits of memory allows only $O(BW)$ bits of information. A key ingredient is a localized change-of-measure lemma that yields probability-level arm exploration guarantees, which is of independent interest. We also give an algorithm using $O(\log T)$ bits of memory and $\widetilde{O}(K)$ batches that achieves regret $\widetilde{O}(\sqrt{KT})$, which nearly matches our lower bound.

Fonte: arXiv stat.ML

NLP/LLMsScore 85

GRPO and Reflection Reward for Mathematical Reasoning in Large Language Models

arXiv:2603.14041v1 Announce Type: new Abstract: The enhancement of reasoning capabilities in large language models (LLMs) has garnered significant attention, with supervised fine-tuning (SFT) and reinforcement learning emerging as dominant paradigms. While recent studies recognize the importance of reflection in reasoning processes, existing methodologies seldom address proactive reflection encouragement during training. This study focuses on mathematical reasoning by proposing a four-stage framework integrating Group Relative Policy Optimization (GRPO) with reflection reward mechanisms to strengthen LLMs' self-reflective capabilities. Besides, this approach incorporates established accuracy and format reward. Experimental results demonstrate GRPO's state-of-the-art performance through reflection-encouraged training, with ablation studies confirming the reflection reward's pivotal role. Comparative evaluations demonstrate full-parameter SFT's superiority over low-rank adaptation (LoRA) despite heightened computational demands. Building on these cumulative findings, this research substantiates GRPO's methodological significance in post-training optimization and envisions its potential to serve as a pivotal enabler for future LLM-based intelligent agents through the synergistic integration of cognitive rewards with dynamic environmental interactions.

Fonte: arXiv cs.AI

RLScore 85

AutoTool: Automatic Scaling of Tool-Use Capabilities in RL via Decoupled Entropy Constraints

arXiv:2603.13348v1 Announce Type: new Abstract: Tool use represents a critical capability for AI agents, with recent advances focusing on leveraging reinforcement learning (RL) to scale up the explicit reasoning process to achieve better performance. However, there are some key challenges for tool use in current RL-based scaling approaches: (a) direct RL training often struggles to scale up thinking length sufficiently to solve complex problems, and (b) scaled-up models tend to overthink simpler problems, resulting in substantial token inefficiency. To address these challenges, we propose a novel training paradigm that first employs warm-up supervised fine-tuning to help models distinguish between simple and complex problems, followed by RL that enable models to automatically determine appropriate reasoning trajectories. Furthermore, to tackle the issue of automatic thinking-length scaling, we discover that entropy-based optimization objectives effectively maintain model diversity while successfully unlocking the model's scaling capabilities. Based on this insight, we introduce an entropy-based long-short reasoning fusion RL strategy. Our experiments on three benchmarks demonstrate that model successfully achieves auto-scaling for efficient tool use, achieving significant 9.8\% accuracy improvements while reducing computational overhead by \textasciitilde81\%.

Fonte: arXiv cs.AI

RLScore 85

CALF: Communication-Aware Learning Framework for Distributed Reinforcement Learning

arXiv:2603.12543v1 Announce Type: new Abstract: Distributed reinforcement learning policies face network delays, jitter, and packet loss when deployed across edge devices and cloud servers. Standard RL training assumes zero-latency interaction, causing severe performance degradation under realistic network conditions. We introduce CALF (Communication-Aware Learning Framework), which trains policies under realistic network models during simulation. Systematic experiments demonstrate that network-aware training substantially reduces deployment performance gaps compared to network-agnostic baselines. Distributed policy deployments across heterogeneous hardware validate that explicitly modelling communication constraints during training enables robust real-world execution. These findings establish network conditions as a major axis of sim-to-real transfer for Wi-Fi-like distributed deployments, complementing physics and visual domain randomisation.

Fonte: arXiv cs.LG

RLScore 85

FastDSAC: Unlocking the Potential of Maximum Entropy RL in High-Dimensional Humanoid Control

arXiv:2603.12612v1 Announce Type: new Abstract: Scaling Maximum Entropy Reinforcement Learning (RL) to high-dimensional humanoid control remains a formidable challenge, as the ``curse of dimensionality'' induces severe exploration inefficiency and training instability in expansive action spaces. Consequently, recent high-throughput paradigms have largely converged on deterministic policy gradients combined with massive parallel simulation. We challenge this compromise with FastDSAC, a framework that effectively unlocks the potential of maximum entropy stochastic policies for complex continuous control. We introduce Dimension-wise Entropy Modulation (DEM) to dynamically redistribute the exploration budget and enforce diversity, alongside a continuous distributional critic tailored to ensure value fidelity and mitigate high-dimensional value overestimation. Extensive evaluations on HumanoidBench and other continuous control tasks demonstrate that rigorously designed stochastic policies can consistently match or outperform deterministic baselines, achieving notable gains of 180\% and 400\% on the challenging \textit{Basketball} and \textit{Balance Hard} tasks.

Fonte: arXiv cs.LG

RLScore 85

Swap-guided Preference Learning for Personalized Reinforcement Learning from Human Feedback

arXiv:2603.12595v1 Announce Type: new Abstract: Reinforcement Learning from Human Feedback (RLHF) is a widely used approach to align large-scale AI systems with human values. However, RLHF typically assumes a single, universal reward, which overlooks diverse preferences and limits personalization. Variational Preference Learning (VPL) seeks to address this by introducing user-specific latent variables. Despite its promise, we found that VPL suffers from posterior collapse. While this phenomenon is well known in VAEs, it has not previously been identified in preference learning frameworks. Under sparse preference data and with overly expressive decoders, VPL may cause latent variables to be ignored, reverting to a single-reward model. To overcome this limitation, we propose Swap-guided Preference Learning (SPL). The key idea is to construct fictitious swap annotators and use the mirroring property of their preferences to guide the encoder. SPL introduces three components: (1) swap-guided base regularization, (2) Preferential Inverse Autoregressive Flow (P-IAF), and (3) adaptive latent conditioning. Experiments show that SPL mitigates collapse, enriches user-specific latents, and improves preference prediction. Our code and data are available at https://github.com/cobang0111/SPL

Fonte: arXiv cs.LG

RLScore 85

Rethinking Multiple-Choice Questions for RLVR: Unlocking Potential via Distractor Design

arXiv:2603.12826v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capabilities of Large Language Models. When applied to RLVR, Multiple-Choice Questions (MCQs) offer a scalable source of verifiable data but risk inducing reward hacking, where models shortcut reasoning via random guessing or simple elimination. Current approaches often mitigate this by converting MCQs to open-ended formats, thereby discarding the contrastive signal provided by expert-designed distractors. In this work, we systematically investigate the impact of option design on RLVR. Our analysis highlights two primary insights: (1) Mismatches in option counts between training and testing degrade performance. (2) Strong distractors effectively mitigate random guessing, enabling effective RLVR training even with 2-way questions. Motivated by these findings, we propose Iterative Distractor Curation (IDC), a framework that actively constructs high-quality distractors to block elimination shortcuts and promote deep reasoning. Experiments on various benchmarks demonstrate that our method effectively enhances distractor quality and yields significant gains in RLVR training compared to the original data.

Fonte: arXiv cs.CL

RLScore 85

A Spectral Revisit of the Distributional Bellman Operator under the Cram\'er Metric

arXiv:2603.12576v1 Announce Type: new Abstract: Distributional reinforcement learning (DRL) studies the evolution of full return distributions under Bellman updates rather than focusing on expected values. A classical result is that the distributional Bellman operator is contractive under the Cram\'er metric, which corresponds to an $L^2$ geometry on differences of cumulative distribution functions (CDFs). While this contraction ensures stability of policy evaluation, existing analyses remain largely metric, focusing on contraction properties without elucidating the structural action of the Bellman update on distributions. In this work, we analyse distributional Bellman dynamics directly at the level of CDFs, treating the Cram\'er geometry as the intrinsic analytical setting. At this level, the Bellman update acts affinely on CDFs and linearly on differences between CDFs, and its contraction property yields a uniform bound on this linear action. Building on this intrinsic formulation, we construct a family of regularised spectral Hilbert representations that realise the CDF-level geometry by exact conjugation, without modifying the underlying Bellman dynamics. The regularisation affects only the geometry and vanishes in the zero-regularisation limit, recovering the native Cram\'er metric. This framework clarifies the operator structure underlying distributional Bellman updates and provides a foundation for further functional and operator-theoretic analyses in DRL.

Fonte: arXiv cs.LG

RLScore 85

When Right Meets Wrong: Bilateral Context Conditioning with Reward-Confidence Correction for GRPO

arXiv:2603.13134v1 Announce Type: new Abstract: Group Relative Policy Optimization (GRPO) has emerged as an effective method for training reasoning models. While it computes advantages based on group mean, GRPO treats each output as an independent sample during the optimization and overlooks a vital structural signal: the natural contrast between correct and incorrect solutions within the same group, thus ignoring the rich, comparative data that could be leveraged by explicitly pitting successful reasoning traces against failed ones. To capitalize on this, we present a contrastive reformulation of GRPO, showing that the GRPO objective implicitly maximizes the margin between the policy ratios of correct and incorrect samples. Building on this insight, we propose Bilateral Context Conditioning (BICC), a mechanism that allows the model to cross-reference successful and failed reasoning traces during the optimization, enabling a direct information flow across samples. We further introduce Reward-Confidence Correction (RCC) to stabilize training by dynamically adjusts the advantage baseline in GRPO using reward-confidence covariance derived from the first-order approximation of the variance-minimizing estimator. Both mechanisms require no additional sampling or auxiliary models and can be adapted to all GRPO variants. Experiments on mathematical reasoning benchmarks demonstrate consistent improvements across comprehensive models and algorithms. Code is available at \href{https://github.com/Skylanding/BiCC}{https://github.com/Skylanding/BiCC}.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

RetroReasoner: A Reasoning LLM for Strategic Retrosynthesis Prediction

arXiv:2603.12666v1 Announce Type: new Abstract: Retrosynthesis prediction is a core task in organic synthesis that aims to predict reactants for a given product molecule. Traditionally, chemists select a plausible bond disconnection and derive corresponding reactants, which is time-consuming and requires substantial expertise. While recent advancements in molecular large language models (LLMs) have made progress, many methods either predict reactants without strategic reasoning or conduct only a generic product analysis, rather than reason explicitly about bond-disconnection strategies that logically lead to the choice of specific reactants. To overcome these limitations, we propose RetroReasoner, a retrosynthetic reasoning model that leverages chemists' strategic thinking. RetroReasoner is trained using both supervised fine-tuning (SFT) and reinforcement learning (RL). For SFT, we introduce SyntheticRetro, a framework that generates structured disconnection rationales alongside reactant predictions. In the case of RL, we apply a round-trip accuracy as reward, where predicted reactants are passed through a forward synthesis model, and predictions are rewarded when the forward-predicted product matches the original input product. Experimental results show that RetroReasoner not only outperforms prior baselines but also generates a broader range of feasible reactant proposals, particularly in handling more challenging reaction instances.

Fonte: arXiv cs.LG

RLScore 85

Maximum Entropy Exploration Without the Rollouts

arXiv:2603.12325v1 Announce Type: new Abstract: Efficient exploration remains a central challenge in reinforcement learning, serving as a useful pretraining objective for data collection, particularly when an external reward function is unavailable. A principled formulation of the exploration problem is to find policies that maximize the entropy of their induced steady-state visitation distribution, thereby encouraging uniform long-run coverage of the state space. Many existing exploration approaches require estimating state visitation frequencies through repeated on-policy rollouts, which can be computationally expensive. In this work, we instead consider an intrinsic average-reward formulation in which the reward is derived from the visitation distribution itself, so that the optimal policy maximizes steady-state entropy. An entropy-regularized version of this objective admits a spectral characterization: the relevant stationary distributions can be computed from the dominant eigenvectors of a problem-dependent transition matrix. This insight leads to a novel algorithm for solving the maximum entropy exploration problem, EVE (EigenVector-based Exploration), which avoids explicit rollouts and distribution estimation, instead computing the solution through iterative updates, similar to a value-based approach. To address the original unregularized objective, we employ a posterior-policy iteration (PPI) approach, which monotonically improves the entropy and converges in value. We prove convergence of EVE under standard assumptions and demonstrate empirically that it efficiently produces policies with high steady-state entropy, achieving competitive exploration performance relative to rollout-based baselines in deterministic grid-world environments.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

Lyapunov Stable Graph Neural Flow

arXiv:2603.12557v1 Announce Type: new Abstract: Graph Neural Networks (GNNs) are highly vulnerable to adversarial perturbations in both topology and features, making the learning of robust representations a critical challenge. In this work, we bridge GNNs with control theory to introduce a novel defense framework grounded in integer- and fractional-order Lyapunov stability. Unlike conventional strategies that rely on resource-heavy adversarial training or data purification, our approach fundamentally constrains the underlying feature-update dynamics of the GNN. We propose an adaptive, learnable Lyapunov function paired with a novel projection mechanism that maps the network's state into a stable space, thereby offering theoretically provable stability guarantees. Notably, this mechanism is orthogonal to existing defenses, allowing for seamless integration with techniques like adversarial training to achieve cumulative robustness. Extensive experiments demonstrate that our Lyapunov-stable graph neural flows substantially outperform base neural flows and state-of-the-art baselines across standard benchmarks and various adversarial attack scenarios.

Fonte: arXiv cs.LG

RLScore 85

Batched Kernelized Bandits: Refinements and Extensions

arXiv:2603.12627v1 Announce Type: new Abstract: In this paper, we consider the problem of black-box optimization with noisy feedback revealed in batches, where the unknown function to optimize has a bounded norm in some Reproducing Kernel Hilbert Space (RKHS). We refer to this as the Batched Kernelized Bandits problem, and refine and extend existing results on regret bounds. For algorithmic upper bounds, (Li and Scarlett, 2022) shows that $B=O(\log\log T)$ batches suffice to attain near-optimal regret, where $T$ is the time horizon and $B$ is the number of batches. We further refine this by (i) finding the optimal number of batches including constant factors (to within $1+o(1)$), and (ii) removing a factor of $B$ in the regret bound. For algorithm-independent lower bounds, noticing that existing results only apply when the batch sizes are fixed in advance, we present novel lower bounds when the batch sizes are chosen adaptively, and show that adaptive batches have essentially same minimax regret scaling as fixed batches. Furthermore, we consider a robust setting where the goal is to choose points for which the function value remains high even after an adversarial perturbation. We present the robust-BPE algorithm, and show that a suitably-defined cumulative regret notion incurs the same bound as the non-robust setting, and derive a simple regret bound significantly below that of previous work.

Fonte: arXiv stat.ML

RLScore 85

Bayesian Conservative Policy Optimization (BCPO): A Novel Uncertainty-Calibrated Offline Reinforcement Learning with Credible Lower Bounds

arXiv:2603.12284v1 Announce Type: cross Abstract: Offline reinforcement learning (RL) aims to learn decision policies from a fixed batch of logged transitions, without additional environment interaction. Despite remarkable empirical progress, offline RL remains fragile under distribution shifts: value-based methods can overestimate the value of unseen actions, yielding policies that exploit model errors rather than genuine long-term rewards. We propose \emph{Bayesian Conservative Policy Optimization (BCPO)}, a unified framework that converts epistemic uncertainty into \emph{provably conservative} policy improvement. BCPO maintains a hierarchical Bayesian posterior over environment/value models, constructs a \emph{credible lower bound} (LCB) on action values, and performs policy updates under explicit KL regularization toward the behavior distribution. This yields an uncertainty-calibrated analogue of conservative policy iteration in the offline regime. We provide a finite-MDP theory showing that the pessimistic fixed point lower-bounds the true value function with high probability and that KL-controlled updates improve a computable return lower bound. Empirically, we verify the methodology on a real offline replay dataset for the CartPole benchmark obtained via the \texttt{d3rlpy} ecosystem, and report diagnostics that link uncertainty growth and policy drift to offline instability, motivating principled early stopping and calibration

Fonte: arXiv stat.ML

RLScore 85

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

arXiv:2603.12554v1 Announce Type: new Abstract: Reinforcement learning (RL) has been effective for post-training autoregressive (AR) language models, but extending these methods to diffusion language models (DLMs) is challenging due to intractable sequence-level likelihoods. Existing approaches therefore rely on surrogate likelihoods or heuristic approximations, which can introduce bias and obscure the sequential structure of denoising. We formulate diffusion-based sequence generation as a finite-horizon Markov decision process over the denoising trajectory and derive an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood. To obtain a practical and compute-efficient estimator, we (i) select denoising steps for policy updates via an entropy-guided approximation bound, and (ii) estimate intermediate advantages using a one-step denoising reward naturally provided by the diffusion model, avoiding costly multi-step rollouts. Experiments on coding and logical reasoning benchmarks demonstrate state-of-the-art results, with strong competitive performance on mathematical reasoning, outperforming existing RL post-training approaches for DLMs. Code is available at https://github.com/vishnutez/egspo-dllm-rl.

Fonte: arXiv cs.LG

RLScore 85

Optimize Wider, Not Deeper: Consensus Aggregation for Policy Optimization

arXiv:2603.12596v1 Announce Type: new Abstract: Proximal policy optimization (PPO) approximates the trust region update using multiple epochs of clipped SGD. Each epoch may drift further from the natural gradient direction, creating path-dependent noise. To understand this drift, we can use Fisher information geometry to decompose policy updates into signal (the natural gradient projection) and waste (the Fisher-orthogonal residual that consumes trust region budget without first-order surrogate improvement). Empirically, signal saturates but waste grows with additional epochs, creating an optimization-depth dilemma. We propose Consensus Aggregation for Policy Optimization (CAPO), which redirects compute from depth to width: $K$ PPO replicates are optimized on the same batch, differing only in minibatch shuffling order, and then aggregated into a consensus. We study aggregation in two spaces: Euclidean parameter space, and the natural parameter space of the policy distribution via the logarithmic opinion pool. In natural parameter space, the consensus provably achieves higher KL-penalized surrogate and tighter trust region compliance than the mean expert; parameter averaging inherits these guarantees approximately. On continuous control tasks, CAPO outperforms PPO and compute-matched deeper baselines under fixed sample budgets by up to 8.6x. CAPO demonstrates that policy optimization can be improved by optimizing wider, rather than deeper, without additional environment interactions.

Fonte: arXiv cs.LG

RLScore 85

A Reduction Algorithm for Markovian Contextual Linear Bandits

arXiv:2603.12530v1 Announce Type: new Abstract: Recent work shows that when contexts are drawn i.i.d., linear contextual bandits can be reduced to single-context linear bandits. This ``contexts are cheap" perspective is highly advantageous, as it allows for sharper finite-time analyses and leverages mature techniques from the linear bandit literature, such as those for misspecification and adversarial corruption. Motivated by applications with temporally correlated availability, we extend this perspective to Markovian contextual linear bandits, where the action set evolves via an exogenous Markov chain. Our main contribution is a reduction that applies under uniform geometric ergodicity. We construct a stationary surrogate action set to solve the problem using a standard linear bandit oracle, employing a delayed-update scheme to control the bias induced by the nonstationary conditional context distributions. We further provide a phased algorithm for unknown transition distributions that learns the surrogate mapping online. In both settings, we obtain a high-probability worst-case regret bound matching that of the underlying linear bandit oracle, with only lower-order dependence on the mixing time.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization

arXiv:2603.12639v1 Announce Type: new Abstract: Scalable Embodied AI faces fundamental constraints due to prohibitive costs and safety risks of real-world interaction. While Embodied World Models (EWMs) offer promise through imagined rollouts, existing approaches suffer from geometric hallucinations and lack unified optimization frameworks for practical policy improvement. We introduce RoboStereo, a symmetric dual-tower 4D world model that employs bidirectional cross-modal enhancement to ensure spatiotemporal geometric consistency and alleviate physics hallucinations. Building upon this high-fidelity 4D simulator, we present the first unified framework for world-model-based policy optimization: (1) Test-Time Policy Augmentation (TTPA) for pre-execution verification, (2) Imitative-Evolutionary Policy Learning (IEPL) leveraging visual perceptual rewards to learn from expert demonstrations, and (3) Open-Exploration Policy Learning (OEPL) enabling autonomous skill discovery and self-correction. Comprehensive experiments demonstrate RoboStereo achieves state-of-the-art generation quality, with our unified framework delivering >97% average relative improvement on fine-grained manipulation tasks.

Fonte: arXiv cs.CV

Theory/OptimizationScore 85

Tight Non-asymptotic Inference via Sub-Gaussian Intrinsic Moment Norm

arXiv:2303.07287v3 Announce Type: replace Abstract: In non-asymptotic learning, variance-type parameters of sub-Gaussian distributions are of paramount importance. However, directly estimating these parameters using the empirical moment generating function (MGF) is infeasible. To address this, we suggest using the sub-Gaussian intrinsic moment norm [Buldygin and Kozachenko (2000), Theorem 1.3] achieved by maximizing a sequence of normalized moments. Significantly, the suggested norm can not only reconstruct the exponential moment bounds of MGFs but also provide tighter sub-Gaussian concentration inequalities. In practice, we provide an intuitive method for assessing whether data with a finite sample size is sub-Gaussian, utilizing the sub-Gaussian plot. The intrinsic moment norm can be robustly estimated via a simple plug-in approach. Our theoretical findings are also applicable to reinforcement learning, including the multi-armed bandit scenario.

Fonte: arXiv stat.ML

RLScore 85

Steve-Evolving: Open-World Embodied Self-Evolution via Fine-Grained Diagnosis and Dual-Track Knowledge Distillation

arXiv:2603.13131v1 Announce Type: new Abstract: Open-world embodied agents must solve long-horizon tasks where the main bottleneck is not single-step planning quality but how interaction experience is organized and evolved. To this end, we present Steve-Evolving, a non-parametric self-evolving framework that tightly couples fine-grained execution diagnosis with dual-track knowledge distillation in a closed loop. The method follows three phases: Experience Anchoring, Experience Distillation, and Knowledge-Driven Closed-Loop Control. In detail, Experience Anchoring solidifies each subgoal attempt into a structured experience tuple with a fixed schema (pre-state, action, diagnosis-result, and post-state) and organizes it in a three-tier experience space with multi-dimensional indices (e.g., condition signatures, spatial hashing, and semantic tags) plus rolling summarization for efficient and auditable recall. To ensure sufficient information density for attribution, the execution layer provides compositional diagnosis signals beyond binary outcomes, including state-difference summaries, enumerated failure causes, continuous indicators, and stagnation/loop detection. Moreover, successful trajectories of Experience Distillation are generalized into reusable skills with explicit preconditions and verification criteria, while failures are distilled into executable guardrails that capture root causes and forbid risky operations at both subgoal and task granularities. Besides, Knowledge-Driven Closed-Loop Control retrieved skills and guardrails are injected into an LLM planner, and diagnosis-triggered local replanning updates the active constraints online, forming a continual evolution process without any model parameter updates. Experiments on the long-horizon suite of Minecraft MCU demonstrate consistent improvements over static-retrieval baselines.

Fonte: arXiv cs.AI

RLScore 85

EvolveCoder: Evolving Test Cases via Adversarial Verification for Code Reinforcement Learning

arXiv:2603.12698v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) is a promising approach for improving code generation in large language models, but its effectiveness is limited by weak and static verification signals in existing coding RL datasets. In this paper, we propose a solution-conditioned and adversarial verification framework that iteratively refines test cases based on the execution behaviors of candidate solutions, with the goal of increasing difficulty, improving discriminative power, and reducing redundancy. Based on this framework, we introduce EvolveCoder-22k, a large-scale coding reinforcement learning dataset constructed through multiple rounds of adversarial test case evolution. Empirical analysis shows that iterative refinement substantially strengthens verification, with pass@1 decreasing from 43.80 to 31.22. Reinforcement learning on EvolveCoder-22k yields stable optimization and consistent performance gains, improving Qwen3-4B by an average of 4.2 points across four downstream benchmarks and outperforming strong 4B-scale baselines. Our results highlight the importance of adversarial, solution-conditioned verification for effective and scalable reinforcement learning in code generation.

Fonte: arXiv cs.CL

RLScore 85

Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models

arXiv:2603.12893v1 Announce Type: cross Abstract: Reinforcement learning (RL) has become a standard technique for post-training diffusion-based image synthesis models, as it enables learning from reward signals to explicitly improve desirable aspects such as image quality and prompt alignment. In this paper, we propose an online RL variant that reduces the variance in the model updates by sampling paired trajectories and pulling the flow velocity in the direction of the more favorable image. Unlike existing methods that treat each sampling step as a separate policy action, we consider the entire sampling process as a single action. We experiment with both high-quality vision language models and off-the-shelf quality metrics for rewards, and evaluate the outputs using a broad set of metrics. Our method converges faster and yields higher output quality and prompt alignment than previous approaches.

Fonte: arXiv stat.ML

RLScore 85

One-Step Flow Policy: Self-Distillation for Fast Visuomotor Policies

arXiv:2603.12480v1 Announce Type: cross Abstract: Generative flow and diffusion models provide the continuous, multimodal action distributions needed for high-precision robotic policies. However, their reliance on iterative sampling introduces severe inference latency, degrading control frequency and harming performance in time-sensitive manipulation. To address this problem, we propose the One-Step Flow Policy (OFP), a from-scratch self-distillation framework for high-fidelity, single-step action generation without a pre-trained teacher. OFP unifies a self-consistency loss to enforce coherent transport across time intervals, and a self-guided regularization to sharpen predictions toward high-density expert modes. In addition, a warm-start mechanism leverages temporal action correlations to minimize the generative transport distance. Evaluations across 56 diverse simulated manipulation tasks demonstrate that a one-step OFP achieves state-of-the-art results, outperforming 100-step diffusion and flow policies while accelerating action generation by over $100\times$. We further integrate OFP into the $\pi_{0.5}$ model on RoboTwin 2.0, where one-step OFP surpasses the original 10-step policy. These results establish OFP as a practical, scalable solution for highly accurate and low-latency robot control.

Fonte: arXiv cs.AI

RLScore 85

Thermodynamics of Reinforcement Learning Curricula

arXiv:2603.12324v1 Announce Type: new Abstract: Connections between statistical mechanics and machine learning have repeatedly proven fruitful, providing insight into optimization, generalization, and representation learning. In this work, we follow this tradition by leveraging results from non-equilibrium thermodynamics to formalize curriculum learning in reinforcement learning (RL). In particular, we propose a geometric framework for RL by interpreting reward parameters as coordinates on a task manifold. We show that, by minimizing the excess thermodynamic work, optimal curricula correspond to geodesics in this task space. As an application of this framework, we provide an algorithm, "MEW" (Minimum Excess Work), to derive a principled schedule for temperature annealing in maximum-entropy RL.

Fonte: arXiv cs.LG

RLScore 85

Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

arXiv:2603.13045v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated remarkable capability in machine translation on high-resource language pairs, yet their performance on low-resource translation still lags behind. Existing post-training methods rely heavily on high-quality parallel data, which are often scarce or unavailable for low-resource languages. In this paper, we introduce WALAR, a reinforcement training method using only monolingual text to elevate LLMs' translation capabilities on massive low-resource languages while retaining their performance on high-resource languages. Our key insight is based on the observation of failure modes (or "holes") in existing source-based multilingual quality estimation (QE) models. Reinforcement learning (RL) using these QE models tends to amplify such holes, resulting in poorer multilingual LLMs. We develop techniques including word alignment and language alignment to mitigate such holes in WALAR's reward for RL training. We continually trained an LLM supporting translation of 101 languages using WALAR. The experiments show that our new model outperforms LLaMAX, one of the strongest open-source multilingual LLMs by a large margin on 1400 language directions on Flores-101 dataset.

Fonte: arXiv cs.CL

NLP/LLMsScore 85

Scaling Reasoning Efficiently via Relaxed On-Policy Distillation

arXiv:2603.11137v1 Announce Type: new Abstract: On-policy distillation is pivotal for transferring reasoning capabilities to capacity-constrained models, yet remains prone to instability and negative transfer. We show that on-policy distillation can be interpreted, both theoretically and empirically, as a form of policy optimization, where the teacher-student log-likelihood ratio acts as a token reward. From this insight, we introduce REOPOLD (Relaxed On-Policy Distillation) a framework that stabilizes optimization by relaxing the strict imitation constraints of standard on-policy distillation. Specifically, REOPOLD temperately and selectively leverages rewards from the teacher through mixture-based reward clipping, entropy-based token-level dynamic sampling, and a unified exploration-to-refinement training strategy. Empirically, REOPOLD surpasses its baselines with superior sample efficiency during training and enhanced test-time scaling at inference, across mathematical, visual, and agentic tool-use reasoning tasks. Specifically, REOPOLD outperforms recent RL approaches achieving 6.7~12x greater sample efficiency and enables a 7B student to match a 32B teacher in visual reasoning with a ~3.32x inference speedup.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

SVLL: Staged Vision-Language Learning for Physically Grounded Embodied Task Planning

arXiv:2603.11563v1 Announce Type: new Abstract: Embodied task planning demands vision-language models to generate action sequences that are both visually grounded and causally coherent over time. However, existing training paradigms face a critical trade-off: joint end-to-end training often leads to premature temporal binding, while standard reinforcement learning methods suffer from optimization instability. To bridge this gap, we present Staged Vision-Language Learning (SVLL), a unified three-stage framework for robust, physically-grounded embodied planning. In the first two stages, SVLL decouples spatial grounding from temporal reasoning, establishing robust visual dependency before introducing sequential action history. In the final stage, we identify a key limitation of standard Direct Preference Optimization (DPO), its purely relative nature -- optimizing only the preference gap between winning and losing trajectories while neglecting absolute likelihood constraints on optimal path, often yields unsafe or hallucinated behaviors. To address this, we further introduce Bias-DPO, a novel alignment objective that injects an inductive bias toward expert trajectories by explicitly maximizing likelihood on ground-truth actions while penalizing overconfident hallucinations. By anchoring the policy to the expert manifold and mitigating causal misalignment, SVLL, powered by Bias-DPO, ensures strict adherence to environmental affordances and effectively suppresses physically impossible shortcuts. Finally, extensive experiments on the interactive AI2-THOR benchmark and real-world robotic deployments demonstrate that SVLL outperforms both state-of-the-art open-source (e.g., Qwen2.5-VL-7B) and closed-source models (e.g., GPT-4o, Gemini-2.0-flash) in task success rate, while significantly reducing physical constraint violations.

Fonte: arXiv cs.CV

RLScore 85

Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning

arXiv:2603.11346v1 Announce Type: new Abstract: Humanoid robotics has strong potential to transform daily service and caregiving applications. Although recent advances in general motion tracking within physics engines (GMT) have enabled virtual characters and humanoid robots to reproduce a broad range of human motions, these behaviors are primarily limited to contact-less social interactions or isolated movements. Assistive scenarios, by contrast, require continuous awareness of a human partner and rapid adaptation to their evolving posture and dynamics. In this paper, we formulate the imitation of closely interacting, force-exchanging human-human motion sequences as a multi-agent reinforcement learning problem. We jointly train partner-aware policies for both the supporter (assistant) agent and the recipient agent in a physics simulator to track assistive motion references. To make this problem tractable, we introduce a partner policies initialization scheme that transfers priors from single-human motion-tracking controllers, greatly improving exploration. We further propose dynamic reference retargeting and contact-promoting reward, which adapt the assistant's reference motion to the recipient's real-time pose and encourage physically meaningful support. We show that AssistMimic is the first method capable of successfully tracking assistive interaction motions on established benchmarks, demonstrating the benefits of a multi-agent RL formulation for physically grounded and socially aware humanoid control.

Fonte: arXiv cs.CV

RLScore 85

Exploiting Expertise of Non-Expert and Diverse Agents in Social Bandit Learning: A Free Energy Approach

arXiv:2603.11757v1 Announce Type: cross Abstract: Personalized AI-based services involve a population of individual reinforcement learning agents. However, most reinforcement learning algorithms focus on harnessing individual learning and fail to leverage the social learning capabilities commonly exhibited by humans and animals. Social learning integrates individual experience with observing others' behavior, presenting opportunities for improved learning outcomes. In this study, we focus on a social bandit learning scenario where a social agent observes other agents' actions without knowledge of their rewards. The agents independently pursue their own policy without explicit motivation to teach each other. We propose a free energy-based social bandit learning algorithm over the policy space, where the social agent evaluates others' expertise levels without resorting to any oracle or social norms. Accordingly, the social agent integrates its direct experiences in the environment and others' estimated policies. The theoretical convergence of our algorithm to the optimal policy is proven. Empirical evaluations validate the superiority of our social learning method over alternative approaches in various scenarios. Our algorithm strategically identifies the relevant agents, even in the presence of random or suboptimal agents, and skillfully exploits their behavioral information. In addition to societies including expert agents, in the presence of relevant but non-expert agents, our algorithm significantly enhances individual learning performance, where most related methods fail. Importantly, it also maintains logarithmic regret.

Fonte: arXiv stat.ML

RLScore 85

Adaptive Prior Selection in Gaussian Process Bandits with Thompson Sampling

arXiv:2502.01226v3 Announce Type: replace-cross Abstract: Gaussian process (GP) bandits provide a powerful framework for performing blackbox optimization of unknown functions. The characteristics of the unknown function depend heavily on the assumed GP prior. Most work in the literature assume that this prior is known but in practice this seldom holds. Instead, practitioners often rely on maximum likelihood estimation to select the hyperparameters of the prior - which lacks theoretical guarantees. In this work, we propose two algorithms for joint prior selection and regret minimization in GP bandits based on GP Thompson sampling (GP-TS): Prior-Elimination GP-TS (PE-GP-TS) that disqualifies priors with poor predictive performance, and HyperPrior GP-TS (HP-GP-TS) that utilizes a bi-level Thompson sampling scheme. We theoretically analyze the algorithms and establish upper bounds for their respective regret. In addition, we demonstrate the effectiveness of our algorithms compared to the alternatives through extensive experiments with synthetic and real-world data.

Fonte: arXiv stat.ML

RLScore 85

Onflow: a model free, online portfolio allocation algorithm robust to transaction fees

arXiv:2312.05169v2 Announce Type: replace-cross Abstract: We introduce Onflow, a reinforcement learning method for optimizing portfolio allocation via gradient flows. Our approach dynamically adjusts portfolio allocations to maximize expected log returns while accounting for transaction costs. Using a softmax parameterization, Onflow updates allocations through an ordinary differential equation derived from gradient flow methods. This algorithm belongs to the large class of stochastic optimization procedures; we measure its efficiency by comparing our results to the mathematical theoretical values in a log-normal framework and to standard benchmarks from the 'old NYSE' dataset. For log-normal assets with zero transaction costs, Onflow replicates Markowitz optimal portfolio, achieving the best possible allocation. Numerical experiments from the 'old NYSE' dataset show that Onflow leads to dynamic asset allocation strategies whose performances are: a) comparable to benchmark strategies such as Cover's Universal Portfolio or Helmbold et al. ``multiplicative updates'' approach when transaction costs are zero, and b) better than previous procedures when transaction costs are high. Onflow can even remain efficient in regimes where other dynamical allocation techniques do not work anymore. Onflow is a promising portfolio management strategy that relies solely on observed prices, requiring no assumptions about asset return distributions. This makes it robust against model risk, offering a practical solution for real-world trading strategies.

Fonte: arXiv stat.ML

RLScore 85

RIE-Greedy: Regularization-Induced Exploration for Contextual Bandits

arXiv:2603.11276v1 Announce Type: new Abstract: Real-world contextual bandit problems with complex reward models are often tackled with iteratively trained models, such as boosting trees. However, it is difficult to directly apply simple and effective exploration strategies--such as Thompson Sampling or UCB--on top of those black-box estimators. Existing approaches rely on sophisticated assumptions or intractable procedures that are hard to verify and implement in practice. In this work, we explore the use of an exploration-free (pure-greedy) action selection strategy, that exploits the randomness inherent in model fitting process as an intrinsic source of exploration. More specifically, we note that the stochasticity in cross-validation based regularization process can naturally induce Thompson Sampling-like exploration. We show that this regularization-induced exploration is theoretically equivalent to Thompson Sampling in the two-armed bandit case and empirically leads to reliable exploration in large-scale business environments compared to benchmark methods such as epsilon-greedy and other state-of-the-art approaches. Overall, our work reveals how regularized estimator training itself can induce effective exploration, offering both theoretical insight and practical guidance for contextual bandit design.

Fonte: arXiv stat.ML

RLScore 85

Meta-Reinforcement Learning with Self-Reflection for Agentic Search

arXiv:2603.11327v1 Announce Type: new Abstract: This paper introduces MR-Search, an in-context meta reinforcement learning (RL) formulation for agentic search with self-reflection. Instead of optimizing a policy within a single independent episode with sparse rewards, MR-Search trains a policy that conditions on past episodes and adapts its search strategy across episodes. MR-Search learns to learn a search strategy with self-reflection, allowing search agents to improve in-context exploration at test-time. Specifically, MR-Search performs cross-episode exploration by generating explicit self-reflections after each episode and leveraging them as additional context to guide subsequent attempts, thereby promoting more effective exploration during test-time. We further introduce a multi-turn RL algorithm that estimates a dense relative advantage at the turn level, enabling fine-grained credit assignment on each episode. Empirical results across various benchmarks demonstrate the advantages of MR-Search over baselines based RL, showing strong generalization and relative improvements of 9.2% to 19.3% across eight benchmarks. Our code and data are available at https://github.com/tengxiao1/MR-Search.

Fonte: arXiv cs.LG

RLScore 85

Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings

arXiv:2603.11321v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for post-training reasoning models. However, group-based methods such as Group Relative Policy Optimization (GRPO) face a critical dilemma in sparse-reward settings: pure Reinforcement Learning (RL) suffers from advantage collapse and high-variance gradient estimation, while mixed-policy optimization introduces persistent distributional bias. To resolve this dilemma, we introduce Hindsight-Anchored Policy Optimization (HAPO). HAPO employs the Synthetic Success Injection (SSI) operator, a hindsight mechanism that selectively anchors optimization to teacher demonstrations during failure. This injection is governed by a Thompson sampling-inspired gating mechanism, creating an autonomous, self-paced curriculum. Theoretically, we demonstrate that HAPO achieves \textit{asymptotic consistency}: by naturally annealing the teacher signal as the policy improves, HAPO recovers the unbiased on-policy gradient. This ensures off-policy guidance acts as a temporary scaffold rather than a persistent ceiling, enabling the model to surpass the limitations of static teacher forcing.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

Group Resonance Network: Learnable Prototypes and Multi-Subject Resonance for EEG Emotion Recognition

arXiv:2603.11119v1 Announce Type: new Abstract: Electroencephalography(EEG)-basedemotionrecognitionre- mains challenging in cross-subject settings due to severe inter-subject variability. Existing methods mainly learn subject-invariant features, but often under-exploit stimulus-locked group regularities shared across sub- jects. To address this issue, we propose the Group Resonance Network (GRN), which integrates individual EEG dynamics with offline group resonance modeling. GRN contains three components: an individual en- coder for band-wise EEG features, a set of learnable group prototypes for prototype-induced resonance, and a multi-subject resonance branch that encodes PLV/coherence-based synchrony with a small reference set. A resonance-aware fusion module combines individual and group-level rep- resentations for final classification. Experiments on SEED and DEAP under both subject-dependent and leave-one-subject-out protocols show that GRN consistently outperforms competitive baselines, while abla- tion studies confirm the complementary benefits of prototype learning and multi-subject resonance modeling.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks

arXiv:2603.11554v1 Announce Type: new Abstract: Real-world robotic tasks are long-horizon and often span multiple floors, demanding rich spatial reasoning. However, existing embodied benchmarks are largely confined to single-floor in-house environments, failing to reflect the complexity of real-world tasks. We introduce MANSION, the first language-driven framework for generating building-scale, multi-floor 3D environments. Being aware of vertical structural constraints, MANSION generates realistic, navigable whole-building structures with diverse, human-friendly scenes, enabling the development and evaluation of cross-floor long-horizon tasks. Building on this framework, we release MansionWorld, a dataset of over 1,000 diverse buildings ranging from hospitals to offices, alongside a Task-Semantic Scene Editing Agent that customizes these environments using open-vocabulary commands to meet specific user needs. Benchmarking reveals that state-of-the-art agents degrade sharply in our settings, establishing MANSION as a critical testbed for the next generation of spatial reasoning and planning.

Fonte: arXiv cs.CV

VisionScore 85

Senna-2: Aligning VLM and End-to-End Driving Policy for Consistent Decision Making and Planning

arXiv:2603.11219v1 Announce Type: new Abstract: Vision-language models (VLMs) enhance the planning capability of end-to-end (E2E) driving policy by leveraging high-level semantic reasoning. However, existing approaches often overlook the dual-system consistency between VLM's high-level decision and E2E's low-level planning. As a result, the generated trajectories may misalign with the intended driving decisions, leading to weakened top-down guidance and decision-following ability of the system. To address this issue, we propose Senna-2, an advanced VLM-E2E driving policy that explicitly aligns the two systems for consistent decision-making and planning. Our method follows a consistency-oriented three-stage training paradigm. In the first stage, we conduct driving pre-training to achieve preliminary decision-making and planning, with a decision adapter transmitting VLM decisions to E2E policy in the form of implicit embeddings. In the second stage, we align the VLM and the E2E policy in an open-loop setting. In the third stage, we perform closed-loop alignment via bottom-up Hierarchical Reinforcement Learning in 3DGS environments to reinforce the safety and efficiency. Extensive experiments demonstrate that Senna-2 achieves superior dual-system consistency (19.3% F1 score improvement) and significantly enhances driving safety in both open-loop (5.7% FDE reduction) and closed-loop settings (30.6% AF-CR reduction).

Fonte: arXiv cs.CV

RLScore 85

A Further Efficient Algorithm with Best-of-Both-Worlds Guarantees for $m$-Set Semi-Bandit Problem

arXiv:2603.11764v1 Announce Type: cross Abstract: This paper studies the optimality and complexity of Follow-the-Perturbed-Leader (FTPL) policy in $m$-set semi-bandit problems. FTPL has been studied extensively as a promising candidate of an efficient algorithm with favorable regret for adversarial combinatorial semi-bandits. Nevertheless, the optimality of FTPL has still been unknown unlike Follow-the-Regularized-Leader (FTRL) whose optimality has been proved for various tasks of online learning. In this paper, we extend the analysis of FTPL with geometric resampling (GR) to $m$-set semi-bandits, which is a special case of combinatorial semi-bandits, showing that FTPL with Fr\'{e}chet and Pareto distributions with certain parameters achieves the best possible regret of $O(\sqrt{mdT})$ in adversarial setting. We also show that FTPL with Fr\'{e}chet and Pareto distributions with a certain parameter achieves a logarithmic regret for stochastic setting, meaning the Best-of-Both-Worlds optimality of FTPL for $m$-set semi-bandit problems. Furthermore, we extend the conditional geometric resampling to $m$-set semi-bandits for efficient loss estimation in FTPL, reducing the computational complexity from $O(d^2)$ of the original geometric resampling to $O(md(\log(d/m)+1))$ without sacrificing the regret performance.

Fonte: arXiv stat.ML

RLScore 85

abx_amr_simulator: A simulation environment for antibiotic prescribing policy optimization under antimicrobial resistance

arXiv:2603.11369v1 Announce Type: new Abstract: Antimicrobial resistance (AMR) poses a global health threat, reducing the effectiveness of antibiotics and complicating clinical decision-making. To address this challenge, we introduce abx_amr_simulator, a Python-based simulation package designed to model antibiotic prescribing and AMR dynamics within a controlled, reinforcement learning (RL)-compatible environment. The simulator allows users to specify patient populations, antibiotic-specific AMR response curves, and reward functions that balance immedi- ate clinical benefit against long-term resistance management. Key features include a modular design for configuring patient attributes, antibiotic resistance dynamics modeled via a leaky-balloon abstraction, and tools to explore partial observability through noise, bias, and delay in observations. The package is compatible with the Gymnasium RL API, enabling users to train and test RL agents under diverse clinical scenarios. From an ML perspective, the package provides a configurable benchmark environment for sequential decision-making under uncertainty, including partial observability induced by noisy, biased, and delayed observations. By providing a customizable and extensible framework, abx_amr_simulator offers a valuable tool for studying AMR dynamics and optimizing antibiotic stewardship strategies under realistic uncertainty.

Fonte: arXiv cs.LG

RLScore 85

Ensuring Safety in Automated Mechanical Ventilation through Offline Reinforcement Learning and Digital Twin Verification

arXiv:2603.11372v1 Announce Type: new Abstract: Mechanical ventilation (MV) is a life-saving intervention for patients with acute respiratory failure (ARF) in the ICU. However, inappropriate ventilator settings could cause ventilator-induced lung injury (VILI). Also, clinicians workload is shown to be directly linked to patient outcomes. Hence, MV should be personalized and automated to improve patient outcomes. Previous attempts to incorporate personalization and automation in MV include traditional supervised learning and offline reinforcement learning (RL) approaches, which often neglect temporal dependencies and rely excessively on mortality-based rewards. As a result, early stage physiological deterioration and the risk of VILI are not adequately captured. To address these limitations, we propose Transformer-based Conservative Q-Learning (T-CQL), a novel offline RL framework that integrates a Transformer encoder for effective temporal modeling of patient dynamics, conservative adaptive regularization based on uncertainty quantification to ensure safety, and consistency regularization for robust decision-making. We build a clinically informed reward function that incorporates indicators of VILI and a score for severity of patients illness. Also, previous work predominantly uses Fitted Q-Evaluation (FQE) for RL policy evaluation on static offline data, which is less responsive to dynamic environmental changes and susceptible to distribution shifts. To overcome these evaluation limitations, interactive digital twins of ARF patients were used for online "at the bedside" evaluation. Our results demonstrate that T-CQL consistently outperforms existing state-of-the-art offline RL methodologies, providing safer and more effective ventilatory adjustments. Our framework demonstrates the potential of Transformer-based models combined with conservative RL strategies as a decision support tool in critical care.

Fonte: arXiv cs.LG

RLScore 85

ARROW: Augmented Replay for RObust World models

arXiv:2603.11395v1 Announce Type: new Abstract: Continual reinforcement learning challenges agents to acquire new skills while retaining previously learned ones with the goal of improving performance in both past and future tasks. Most existing approaches rely on model-free methods with replay buffers to mitigate catastrophic forgetting; however, these solutions often face significant scalability challenges due to large memory demands. Drawing inspiration from neuroscience, where the brain replays experiences to a predictive World Model rather than directly to the policy, we present ARROW (Augmented Replay for RObust World models), a model-based continual RL algorithm that extends DreamerV3 with a memory-efficient, distribution-matching replay buffer. Unlike standard fixed-size FIFO buffers, ARROW maintains two complementary buffers: a short-term buffer for recent experiences and a long-term buffer that preserves task diversity through intelligent sampling. We evaluate ARROW on two challenging continual RL settings: Tasks without shared structure (Atari), and tasks with shared structure, where knowledge transfer is possible (Procgen CoinRun variants). Compared to model-free and model-based baselines with replay buffers of the same-size, ARROW demonstrates substantially less forgetting on tasks without shared structure, while maintaining comparable forward transfer. Our findings highlight the potential of model-based RL and bio-inspired approaches for continual reinforcement learning, warranting further research.

Fonte: arXiv cs.LG

RLScore 85

Actor-Accelerated Policy Dual Averaging for Reinforcement Learning in Continuous Action Spaces

arXiv:2603.10199v1 Announce Type: new Abstract: Policy Dual Averaging (PDA) offers a principled Policy Mirror Descent (PMD) framework that more naturally admits value function approximation than standard PMD, enabling the use of approximate advantage (or Q-) functions while retaining strong convergence guarantees. However, applying PDA in continuous state and action spaces remains computationally challenging, since action selection involves solving an optimization sub-problem at each decision step. In this paper, we propose \textit{actor-accelerated PDA}, which uses a learned policy network to approximate the solution of the optimization sub-problems, yielding faster runtimes while maintaining convergence guarantees. We provide a theoretical analysis that quantifies how actor approximation error impacts the convergence of PDA under suitable assumptions. We then evaluate its performance on several benchmarks in robotics, control, and operations research problems. Actor-accelerated PDA achieves superior performance compared to popular on-policy baselines such as Proximal Policy Optimization (PPO). Overall, our results bridge the gap between the theoretical advantages of PDA and its practical deployment in continuous-action problems with function approximation.

Fonte: arXiv cs.LG

RLScore 85

Cluster-Aware Attention-Based Deep Reinforcement Learning for Pickup and Delivery Problems

arXiv:2603.10053v1 Announce Type: new Abstract: The Pickup and Delivery Problem (PDP) is a fundamental and challenging variant of the Vehicle Routing Problem, characterized by tightly coupled pickup--delivery pairs, precedence constraints, and spatial layouts that often exhibit clustering. Existing deep reinforcement learning (DRL) approaches either model all nodes on a flat graph, relying on implicit learning to enforce constraints, or achieve strong performance through inference-time collaborative search at the cost of substantial latency. In this paper, we propose \emph{CAADRL} (Cluster-Aware Attention-based Deep Reinforcement Learning), a DRL framework that explicitly exploits the multi-scale structure of PDP instances via cluster-aware encoding and hierarchical decoding. The encoder builds on a Transformer and combines global self-attention with intra-cluster attention over depot, pickup, and delivery nodes, producing embeddings that are both globally informative and locally role-aware. Based on these embeddings, we introduce a Dynamic Dual-Decoder with a learnable gate that balances intra-cluster routing and inter-cluster transitions at each step. The policy is trained end-to-end with a POMO-style policy gradient scheme using multiple symmetric rollouts per instance. Experiments on synthetic clustered and uniform PDP benchmarks show that CAADRL matches or improves upon strong state-of-the-art baselines on clustered instances and remains highly competitive on uniform instances, particularly as problem size increases. Crucially, our method achieves these results with substantially lower inference time than neural collaborative-search baselines, suggesting that explicitly modeling cluster structure provides an effective and efficient inductive bias for neural PDP solvers.

Fonte: arXiv cs.LG

RLScore 85

SiMPO: Measure Matching for Online Diffusion Reinforcement Learning

arXiv:2603.10250v1 Announce Type: new Abstract: A commonly used family of RL algorithms for diffusion policies conducts softmax reweighting over the behavior policy, which usually induces an over-greedy policy and fails to leverage feedback from negative samples. In this work, we introduce Signed Measure Policy Optimization (SiMPO), a simple and unified framework that generalizes reweighting scheme in diffusion RL with general monotonic functions. SiMPO revisits diffusion RL via a two-stage measure matching lens. First, we construct a virtual target policy by $f$-divergence regularized policy optimization, where we can relax the non-negativity constraint to allow for a signed target measure. Second, we use this signed measure to guide diffusion or flow models through reweighted matching. This formulation offers two key advantages: a) it generalizes to arbitrary monotonically increasing weighting functions; and b) it provides a principled justification and practical guidance for negative reweighting. Furthermore, we provide geometric interpretations to illustrate how negative reweighting actively repels the policy from suboptimal actions. Extensive empirical evaluations demonstrate that SiMPO achieves superior performance by leveraging these flexible weighting schemes, and we provide practical guidelines for selecting reweighting methods tailored to the reward landscape.

Fonte: arXiv cs.LG

RLScore 85

Personalized Group Relative Policy Optimization for Heterogenous Preference Alignment

arXiv:2603.10009v1 Announce Type: new Abstract: Despite their sophisticated general-purpose capabilities, Large Language Models (LLMs) often fail to align with diverse individual preferences because standard post-training methods, like Reinforcement Learning with Human Feedback (RLHF), optimize for a single, global objective. While Group Relative Policy Optimization (GRPO) is a widely adopted on-policy reinforcement learning framework, its group-based normalization implicitly assumes that all samples are exchangeable, inheriting this limitation in personalized settings. This assumption conflates distinct user reward distributions and systematically biases learning toward dominant preferences while suppressing minority signals. To address this, we introduce Personalized GRPO (P-GRPO), a novel alignment framework that decouples advantage estimation from immediate batch statistics. By normalizing advantages against preference-group-specific reward histories rather than the concurrent generation group, P-GRPO preserves the contrastive signal necessary for learning distinct preferences. We evaluate P-GRPO across diverse tasks and find that it consistently achieves faster convergence and higher rewards than standard GRPO, thereby enhancing its ability to recover and align with heterogeneous preference signals. Our results demonstrate that accounting for reward heterogeneity at the optimization level is essential for building models that faithfully align with diverse human preferences without sacrificing general capabilities.

Fonte: arXiv cs.LG

RLScore 85

CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR

arXiv:2603.10101v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capacity of Large Language Models (LLMs). However, RLVR solely relies on final answers as outcome rewards, neglecting the correctness of intermediate reasoning steps. Training on these process-wrong but outcome-correct rollouts can lead to hallucination and answer-copying, severely undermining the model's generalization and robustness. To address this, we incorporate a Contrastive Learning mechanism into the Policy Optimization (CLIPO) to generalize the RLVR process. By optimizing a contrastive loss over successful rollouts, CLIPO steers the LLM to capture the invariant structure shared across correct reasoning paths. This provides a more robust cross-trajectory regularization than the original single-path supervision in RLVR, effectively mitigating step-level reasoning inconsistencies and suppressing hallucinatory artifacts. In experiments, CLIPO consistently improves multiple RLVR baselines across diverse reasoning benchmarks, demonstrating uniform improvements in generalization and robustness for policy optimization of LLMs. Our code and training recipes are available at https://github.com/Qwen-Applications/CLIPO.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

ReMix: Reinforcement routing for mixtures of LoRAs in LLM finetuning

arXiv:2603.10160v1 Announce Type: new Abstract: Low-rank adapters (LoRAs) are a parameter-efficient finetuning technique that injects trainable low-rank matrices into pretrained models to adapt them to new tasks. Mixture-of-LoRAs models expand neural networks efficiently by routing each layer input to a small subset of specialized LoRAs of the layer. Existing Mixture-of-LoRAs routers assign a learned routing weight to each LoRA to enable end-to-end training of the router. Despite their empirical promise, we observe that the routing weights are typically extremely imbalanced across LoRAs in practice, where only one or two LoRAs often dominate the routing weights. This essentially limits the number of effective LoRAs and thus severely hinders the expressive power of existing Mixture-of-LoRAs models. In this work, we attribute this weakness to the nature of learnable routing weights and rethink the fundamental design of the router. To address this critical issue, we propose a new router designed that we call Reinforcement Routing for Mixture-of-LoRAs (ReMix). Our key idea is using non-learnable routing weights to ensure all active LoRAs to be equally effective, with no LoRA dominating the routing weights. However, our routers cannot be trained directly via gradient descent due to our non-learnable routing weights. Hence, we further propose an unbiased gradient estimator for the router by employing the reinforce leave-one-out (RLOO) technique, where we regard the supervision loss as the reward and the router as the policy in reinforcement learning. Our gradient estimator also enables to scale up training compute to boost the predictive performance of our ReMix. Extensive experiments demonstrate that our proposed ReMix significantly outperform state-of-the-art parameter-efficient finetuning methods under a comparable number of activated parameters.

Fonte: arXiv cs.LG

RLScore 85

Graph-GRPO: Training Graph Flow Models with Reinforcement Learning

arXiv:2603.10395v1 Announce Type: new Abstract: Graph generation is a fundamental task with broad applications, such as drug discovery. Recently, discrete flow matching-based graph generation, \aka, graph flow model (GFM), has emerged due to its superior performance and flexible sampling. However, effectively aligning GFMs with complex human preferences or task-specific objectives remains a significant challenge. In this paper, we propose Graph-GRPO, an online reinforcement learning (RL) framework for training GFMs under verifiable rewards. Our method makes two key contributions: (1) We derive an analytical expression for the transition probability of GFMs, replacing the Monte Carlo sampling and enabling fully differentiable rollouts for RL training; (2) We propose a refinement strategy that randomly perturbs specific nodes and edges in a graph, and regenerates them, allowing for localized exploration and self-improvement of generation quality. Extensive experiments on both synthetic and real datasets demonstrate the effectiveness of Graph-GRPO. With only 50 denoising steps, our method achieves 95.0\% and 97.5\% Valid-Unique-Novelty scores on the planar and tree datasets, respectively. Moreover, Graph-GRPO achieves state-of-the-art performance on the molecular optimization tasks, outperforming graph-based and fragment-based RL methods as well as classic genetic algorithms.

Fonte: arXiv cs.LG

RLScore 85

Improving Search Agent with One Line of Code

arXiv:2603.10069v1 Announce Type: new Abstract: Tool-based Agentic Reinforcement Learning (TARL) has emerged as a promising paradigm for training search agents to interact with external tools for a multi-turn information-seeking process autonomously. However, we identify a critical training instability that leads to catastrophic model collapse: Importance Sampling Distribution Drift(ISDD). In Group Relative Policy Optimization(GRPO), a widely adopted TARL algorithm, ISDD manifests as a precipitous decline in the importance sampling ratios, which nullifies gradient updates and triggers irreversible training failure. To address this, we propose \textbf{S}earch \textbf{A}gent \textbf{P}olicy \textbf{O}ptimization (\textbf{SAPO}), which stabilizes training via a conditional token-level KL constraint. Unlike hard clipping, which ignores distributional divergence, SAPO selectively penalizes the KL divergence between the current and old policies. Crucially, this penalty is applied only to positive tokens with low probabilities where the policy has shifted excessively, thereby preventing distribution drift while preserving gradient flow. Remarkably, SAPO requires only one-line code modification to standard GRPO, ensuring immediate deployability. Extensive experiments across seven QA benchmarks demonstrate that SAPO achieves \textbf{+10.6\% absolute improvement} (+31.5\% relative) over Search-R1, yielding consistent gains across varying model scales (1.5B, 14B) and families (Qwen, LLaMA).

Fonte: arXiv cs.LG

RLScore 85

Inverse Contextual Bandits without Rewards: Learning from a Non-Stationary Learner via Suffix Imitation

arXiv:2603.03778v1 Announce Type: new Abstract: We study the Inverse Contextual Bandit (ICB) problem, in which a learner seeks to optimize a policy while an observer, who cannot access the learner's rewards and only observes actions, aims to recover the underlying problem parameters. During the learning process, the learner's behavior naturally transitions from exploration to exploitation, resulting in non-stationary action data that poses significant challenges for the observer. To address this issue, we propose a simple and effective framework called Two-Phase Suffix Imitation. The framework discards data from an initial burn-in phase and performs empirical risk minimization using only data from a subsequent imitation phase. We derive a predictive decision loss bound that explicitly characterizes the bias-variance trade-off induced by the choice of burn-in length. Despite the severe information deficit, we show that a reward-free observer can achieve a convergence rate of $\tilde O(1/\sqrt{N})$, matching the asymptotic efficiency of a fully reward-aware learner. This result demonstrates that a passive observer can effectively uncover the optimal policy from actions alone, attaining performance comparable to that of the learner itself.

Fonte: arXiv cs.LG

RLScore 85

BeamPERL: Parameter-Efficient RL with Verifiable Rewards Specializes Compact LLMs for Structured Beam Mechanics Reasoning

arXiv:2603.04124v1 Announce Type: new Abstract: Can reinforcement learning with hard, verifiable rewards teach a compact language model to reason about physics, or does it primarily learn to pattern-match toward correct answers? We study this question by training a 1.5B-parameter reasoning model on beam statics, a classic engineering problem, using parameter-efficient RLVR with binary correctness rewards from symbolic solvers, without teacher-generated reasoning traces. The best BeamPERL checkpoint achieves a 66.7% improvement in Pass@1 over the base model. However, the learned competence is anisotropic: the model generalizes compositionally (more loads) but fails under topological shifts (moved supports) that require the same equilibrium equations. Intermediate checkpoints yield the strongest reasoning, while continued optimization degrades robustness while maintaining reward. These findings reveal a key limitation of outcome-level alignment: reinforcement learning with exact physics rewards induces procedural solution templates rather than internalization of governing equations. The precision of the reward signal - even when analytically exact - does not by itself guarantee transferable physical reasoning. Our results suggest that verifiable rewards may need to be paired with structured reasoning scaffolding to move beyond template matching toward robust scientific reasoning.

Fonte: arXiv cs.AI

RLScore 75

[Re] FairDICE: A Gap Between Theory And Practice

arXiv:2603.03454v1 Announce Type: new Abstract: Offline Reinforcement Learning (RL) is an emerging field of RL in which policies are learned solely from demonstrations. Within offline RL, some environments involve balancing multiple objectives, but existing multi-objective offline RL algorithms do not provide an efficient way to find a fair compromise. FairDICE (see arXiv:2506.08062v2) seeks to fill this gap by adapting OptiDICE (an offline RL algorithm) to automatically learn weights for multiple objectives to e.g.\ incentivise fairness among objectives. As this would be a valuable contribution, this replication study examines the replicability of claims made regarding FairDICE. We find that many theoretical claims hold, but an error in the code reduces FairDICE to standard behaviour cloning in continuous environments, and many important hyperparameters were originally underspecified. After rectifying this, we show in experiments extending the original paper that FairDICE can scale to complex environments and high-dimensional rewards, though it can be reliant on (online) hyperparameter tuning. We conclude that FairDICE is a theoretically interesting method, but the experimental justification requires significant revision.

Fonte: arXiv cs.LG

RLScore 85

Fairness Begins with State: Purifying Latent Preferences for Hierarchical Reinforcement Learning in Interactive Recommendation

arXiv:2603.03820v1 Announce Type: new Abstract: Interactive recommender systems (IRS) are increasingly optimized with Reinforcement Learning (RL) to capture the sequential nature of user-system dynamics. However, existing fairness-aware methods often suffer from a fundamental oversight: they assume the observed user state is a faithful representation of true preferences. In reality, implicit feedback is contaminated by popularity-driven noise and exposure bias, creating a distorted state that misleads the RL agent. We argue that the persistent conflict between accuracy and fairness is not merely a reward-shaping issue, but a state estimation failure. In this work, we propose \textbf{DSRM-HRL}, a framework that reformulates fairness-aware recommendation as a latent state purification problem followed by decoupled hierarchical decision-making. We introduce a Denoising State Representation Module (DSRM) based on diffusion models to recover the low-entropy latent preference manifold from high-entropy, noisy interaction histories. Built upon this purified state, a Hierarchical Reinforcement Learning (HRL) agent is employed to decouple conflicting objectives: a high-level policy regulates long-term fairness trajectories, while a low-level policy optimizes short-term engagement under these dynamic constraints. Extensive experiments on high-fidelity simulators (KuaiRec, KuaiRand) demonstrate that DSRM-HRL effectively breaks the "rich-get-richer" feedback loop, achieving a superior Pareto frontier between recommendation utility and exposure equity.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

A Rubric-Supervised Critic from Sparse Real-World Outcomes

arXiv:2603.03800v1 Announce Type: new Abstract: Academic benchmarks for coding agents tend to reward autonomous task completion, measured by verifiable rewards such as unit-test success. In contrast, real-world coding agents operate with humans in the loop, where success signals are typically noisy, delayed, and sparse. How can we bridge this gap? In this paper, we propose a process to learn a "critic" model from sparse and noisy interaction data, which can then be used both as a reward model for either RL-based training or inference-time scaling. Specifically, we introduce Critic Rubrics, a rubric-based supervision framework with 24 behavioral features that can be derived from human-agent interaction traces alone. Using a semi-supervised objective, we can then jointly predict these rubrics and sparse human feedback (when present). In experiments, we demonstrate that, despite being trained primarily from trace-observable rubrics and sparse real-world outcome proxies, these critics improve best-of-N reranking on SWE-bench (Best@8 +15.9 over Random@8 over the rerankable subset of trajectories), enable early stopping (+17.7 with 83% fewer attempts), and support training-time data curation via critic-selected trajectories.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

RAGNav: A Retrieval-Augmented Topological Reasoning Framework for Multi-Goal Visual-Language Navigation

arXiv:2603.03745v1 Announce Type: new Abstract: Vision-Language Navigation (VLN) is evolving from single-point pathfinding toward the more challenging Multi-Goal VLN. This task requires agents to accurately identify multiple entities while collaboratively reasoning over their spatial-physical constraints and sequential execution order. However, generic Retrieval-Augmented Generation (RAG) paradigms often suffer from spatial hallucinations and planning drift when handling multi-object associations due to the lack of explicit spatial modeling.To address these challenges, we propose RAGNav, a framework that bridges the gap between semantic reasoning and physical structure. The core of RAGNav is a Dual-Basis Memory system, which integrates a low-level topological map for maintaining physical connectivity with a high-level semantic forest for hierarchical environment abstraction. Building on this representation, the framework introduces an anchor-guided conditional retrieval and a topological neighbor score propagation mechanism. This approach facilitates the rapid screening of candidate targets and the elimination of semantic noise, while performing semantic calibration by leveraging the physical associations inherent in the topological neighborhood.This mechanism significantly enhances the capability of inter-target reachability reasoning and the efficiency of sequential planning. Experimental results demonstrate that RAGNav achieves state-of-the-art (SOTA) performance in complex multi-goal navigation tasks.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

SE-Search: Self-Evolving Search Agent via Memory and Dense Reward

arXiv:2603.03293v1 Announce Type: new Abstract: Retrieval augmented generation (RAG) reduces hallucinations and factual errors in large language models (LLMs) by conditioning generation on retrieved external knowledge. Recent search agents further cast RAG as an autonomous, multi-turn information-seeking process. However, existing methods often accumulate irrelevant or noisy documents and rely on sparse reinforcement learning signals. We propose \textbf{S}elf-\textbf{E}volving \textbf{Search}, a Self-Evolving Search agent that improves online search behavior through three components, memory purification, atomic query training, and dense rewards. SE-Search follows a \textit{Think-Search-Memorize} strategy that retains salient evidence while filtering irrelevant content. Atomic query training promotes shorter and more diverse queries, improving evidence acquisition. Dense rewards provide fine-grained feedback that speeds training. Experiments on single-hop and multi-hop question answering benchmarks show that \texttt{SE-Search-3B} outperforms strong baselines, yielding a $10.8$ point absolute improvement and a $33.8\%$ relative gain over Search-R1.\footnote{We will make the code and model weights publicly available upon acceptance.}

Fonte: arXiv cs.CL

RLScore 85

MAGE: Meta-Reinforcement Learning for Language Agents toward Strategic Exploration and Exploitation

arXiv:2603.03680v1 Announce Type: new Abstract: Large Language Model (LLM) agents have demonstrated remarkable proficiency in learned tasks, yet they often struggle to adapt to non-stationary environments with feedback. While In-Context Learning and external memory offer some flexibility, they fail to internalize the adaptive ability required for long-term improvement. Meta-Reinforcement Learning (meta-RL) provides an alternative by embedding the learning process directly within the model. However, existing meta-RL approaches for LLMs focus primarily on exploration in single-agent settings, neglecting the strategic exploitation necessary for multi-agent environments. We propose MAGE, a meta-RL framework that empowers LLM agents for strategic exploration and exploitation. MAGE utilizes a multi-episode training regime where interaction histories and reflections are integrated into the context window. By using the final episode reward as the objective, MAGE incentivizes the agent to refine its strategy based on past experiences. We further combine population-based training with an agent-specific advantage normalization technique to enrich agent diversity and ensure stable learning. Experiment results show that MAGE outperforms existing baselines in both exploration and exploitation tasks. Furthermore, MAGE exhibits strong generalization to unseen opponents, suggesting it has internalized the ability for strategic exploration and exploitation. Code is available at https://github.com/Lu-Yang666/MAGE.

Fonte: arXiv cs.AI

RLScore 85

Asymmetric Goal Drift in Coding Agents Under Value Conflict

arXiv:2603.03456v1 Announce Type: new Abstract: Agentic coding agents are increasingly deployed autonomously, at scale, and over long-context horizons. Throughout an agent's lifetime, it must navigate tensions between explicit instructions, learned values, and environmental pressures, often in contexts unseen during training. Prior work on model preferences, agent behavior under value tensions, and goal drift has relied on static, synthetic settings that do not capture the complexity of real-world environments. To this end, we introduce a framework built on OpenCode to orchestrate realistic, multi-step coding tasks to measure how agents violate explicit constraints in their system prompt over time with and without environmental pressure toward competing values. Using this framework, we demonstrate that GPT-5 mini, Haiku 4.5, and Grok Code Fast 1 exhibit asymmetric drift: they are more likely to violate their system prompt when its constraint opposes strongly-held values like security and privacy. We find for the models and values tested that goal drift correlates with three compounding factors: value alignment, adversarial pressure, and accumulated context. However, even strongly-held values like privacy show non-zero violation rates under sustained environmental pressure. These findings reveal that shallow compliance checks are insufficient and that comment-based pressure can exploit model value hierarchies to override system prompt instructions. More broadly, our findings highlight a gap in current alignment approaches in ensuring that agentic systems appropriately balance explicit user constraints against broadly beneficial learned preferences under sustained environmental pressure.

Fonte: arXiv cs.AI

RLScore 85

Optimal Best-Arm Identification under Fixed Confidence with Multiple Optima

arXiv:2505.15643v2 Announce Type: replace-cross Abstract: We study best-arm identification in stochastic multi-armed bandits under the fixed-confidence setting, focusing on instances with multiple optimal arms. Unlike prior work that addresses the unknown-number-of-optimal-arms case, we consider the setting where the number of optimal arms is known in advance. We derive a new information-theoretic lower bound on the expected sample complexity that leverages this structural knowledge and is strictly tighter than previous bounds. Building on the Track-and-Stop algorithm, we propose a modified, tie-aware stopping rule and prove that it achieves asymptotic instance-optimality, matching the new lower bound. Our results provide the first formal guarantee of optimality for Track-and-Stop in multi-optimal settings with known cardinality, offering both theoretical insights and practical guidance for efficiently identifying any optimal arm.

Fonte: arXiv stat.ML

RLScore 85

Invariance-Based Dynamic Regret Minimization

arXiv:2603.03843v1 Announce Type: new Abstract: We consider stochastic non-stationary linear bandits where the linear parameter connecting contexts to the reward changes over time. Existing algorithms in this setting localize the policy by gradually discarding or down-weighting past data, effectively shrinking the time horizon over which learning can occur. However, in many settings historical data may still carry partial information about the reward model. We propose to leverage such data while adapting to changes, by assuming the reward model decomposes into stationary and non-stationary components. Based on this assumption, we introduce ISD-linUCB, an algorithm that uses past data to learn invariances in the reward model and subsequently exploits them to improve online performance. We show both theoretically and empirically that leveraging invariance reduces the problem dimensionality, yielding significant regret improvements in fast-changing environments when sufficient historical data is available.

Fonte: arXiv stat.ML

RLScore 85

Fixed-Budget Constrained Best Arm Identification in Grouped Bandits

arXiv:2603.04007v1 Announce Type: cross Abstract: We study fixed budget constrained best-arm identification in grouped bandits, where each arm consists of multiple independent attributes with stochastic rewards. An arm is considered feasible only if all its attributes' means are above a given threshold. The aim is to find the feasible arm with the largest overall mean. We first derive a lower bound on the error probability for any algorithm on this setting. We then propose Feasibility Constrained Successive Rejects (FCSR), a novel algorithm that identifies the best arm while ensuring feasibility. We show it attains optimal dependence on problem parameters up to constant factors in the exponent. Empirically, FCSR outperforms natural baselines while preserving feasibility guarantees.

Fonte: arXiv stat.ML

RLScore 85

Hybrid Belief Reinforcement Learning for Efficient Coordinated Spatial Exploration

arXiv:2603.03595v1 Announce Type: new Abstract: Coordinating multiple autonomous agents to explore and serve spatially heterogeneous demand requires jointly learning unknown spatial patterns and planning trajectories that maximize task performance. Pure model-based approaches provide structured uncertainty estimates but lack adaptive policy learning, while deep reinforcement learning often suffers from poor sample efficiency when spatial priors are absent. This paper presents a hybrid belief-reinforcement learning (HBRL) framework to address this gap. In the first phase, agents construct spatial beliefs using a Log-Gaussian Cox Process (LGCP) and execute information-driven trajectories guided by a Pathwise Mutual Information (PathMI) planner with multi-step lookahead. In the second phase, trajectory control is transferred to a Soft Actor-Critic (SAC) agent, warm-started through dual-channel knowledge transfer: belief state initialization supplies spatial uncertainty, and replay buffer seeding provides demonstration trajectories generated during LGCP exploration. A variance-normalized overlap penalty enables coordinated coverage through shared belief state, permitting cooperative sensing in high-uncertainty regions while discouraging redundant coverage in well-explored areas. The framework is evaluated on a multi-UAV wireless service provisioning task. Results show 10.8% higher cumulative reward and 38% faster convergence over baselines, with ablation studies confirming that dual-channel transfer outperforms either channel alone.

Fonte: arXiv cs.LG

RLScore 85

When and Where to Reset Matters for Long-Term Test-Time Adaptation

arXiv:2603.03796v1 Announce Type: new Abstract: When continual test-time adaptation (TTA) persists over the long term, errors accumulate in the model and further cause it to predict only a few classes for all inputs, a phenomenon known as model collapse. Recent studies have explored reset strategies that completely erase these accumulated errors. However, their periodic resets lead to suboptimal adaptation, as they occur independently of the actual risk of collapse. Moreover, their full resets cause catastrophic loss of knowledge acquired over time, even though such knowledge could be beneficial in the future. To this end, we propose (1) an Adaptive and Selective Reset (ASR) scheme that dynamically determines when and where to reset, (2) an importance-aware regularizer to recover essential knowledge lost due to reset, and (3) an on-the-fly adaptation adjustment scheme to enhance adaptability under challenging domain shifts. Extensive experiments across long-term TTA benchmarks demonstrate the effectiveness of our approach, particularly under challenging conditions. Our code is available at https://github.com/YonseiML/asr.

Fonte: arXiv cs.LG

RLScore 85

Buzz, Choose, Forget: A Meta-Bandit Framework for Bee-Like Decision Making

arXiv:2510.16462v2 Announce Type: replace-cross Abstract: This work introduces MAYA, a sequential imitation learning model based on multi-armed bandits, designed to reproduce and predict individual bees' decisions in contextualized foraging tasks. The model accounts for bees' limited memory through a temporal window $\tau$, whose optimal value is around 7 trials, with a slight dependence on weather conditions. Experimental results on real, simulated, and complementary (mice) datasets show that MAYA (particularly with the Wasserstein distance) outperforms imitation baselines and classical statistical models, while providing interpretability of individual learning strategies and enabling the inference of realistic trajectories for prospective ecological applications.

Fonte: arXiv stat.ML

RLScore 85

Freezing of Gait Prediction using Proactive Agent that Learns from Selected Experience and DDQN Algorithm

arXiv:2603.03651v1 Announce Type: new Abstract: Freezing of Gait (FOG) is a debilitating motor symptom commonly experienced by individuals with Parkinson's Disease (PD) which often leads to falls and reduced mobility. Timely and accurate prediction of FOG episodes is essential for enabling proactive interventions through assistive technologies. This study presents a reinforcement learning-based framework designed to identify optimal pre-FOG onset points, thereby extending the prediction horizon for anticipatory cueing systems. The model implements a Double Deep Q-Network (DDQN) architecture enhanced with Prioritized Experience Replay (PER) allowing the agent to focus learning on high-impact experiences and refine its policy. Trained over 9000 episodes with a reward shaping strategy that promotes cautious decision-making, the agent demonstrated robust performance in both subject-dependent and subject-independent evaluations. The model achieved a prediction horizon of up to 8.72 seconds prior to FOG onset in subject-independent scenarios and 7.89 seconds in subject-dependent settings. These results highlight the model's potential for integration into wearable assistive devices, offering timely and personalized interventions to mitigate FOG in PD patients.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion

arXiv:2603.03485v1 Announce Type: new Abstract: Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present \textbf{Phys4D}, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts \textbf{a three-stage training paradigm} that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, establishing a foundation for 4D scene modeling. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of \textbf{4D world consistency evaluation} that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance. Our project page is available at https://sensational-brioche-7657e7.netlify.app/

Fonte: arXiv cs.CV

VisionScore 85

Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning

arXiv:2603.03818v1 Announce Type: new Abstract: Continual learning is a long-standing challenge in robot policy learning, where a policy must acquire new skills over time without catastrophically forgetting previously learned ones. While prior work has extensively studied continual learning in relatively small behavior cloning (BC) policy models trained from scratch, its behavior in modern large-scale pretrained Vision-Language-Action (VLA) models remains underexplored. In this work, we found that pretrained VLAs are remarkably resistant to forgetting compared with smaller policy models trained from scratch. Simple Experience Replay (ER) works surprisingly well on VLAs, sometimes achieving zero forgetting even with a small replay data size. Our analysis reveals that pretraining plays a critical role in downstream continual learning performance: large pretrained models mitigate forgetting with a small replay buffer size while maintaining strong forward learning capabilities. Furthermore, we found that VLAs can retain relevant knowledge from prior tasks despite performance degradation during learning new tasks. This knowledge retention enables rapid recovery of seemingly forgotten skills through finetuning. Together, these insights imply that large-scale pretraining fundamentally changes the dynamics of continual learning, enabling models to continually acquire new skills over time with simple replay. Code and more information can be found at https://ut-austin-rpl.github.io/continual-vla

Fonte: arXiv cs.LG

RLScore 85

Low-Rank Contextual Reinforcement Learning from Heterogeneous Human Feedback

arXiv:2412.19436v2 Announce Type: replace Abstract: Reinforcement learning from human feedback (RLHF) has become a cornerstone for aligning large language models with human preferences. However, the heterogeneity of human feedback, driven by diverse individual contexts and preferences, poses significant challenges for reward learning. To address this, we propose a Low-rank Contextual RLHF (LoCo-RLHF) framework that integrates contextual information to better model heterogeneous feedback while maintaining computational efficiency. Our approach builds on a contextual preference model, leveraging the intrinsic low-rank structure of the interaction between user contexts and query-answer pairs to mitigate the high dimensionality of feature representations. Furthermore, we address the challenge of distributional shifts in feedback through our Pessimism in Reduced Subspace (PRS) policy, inspired by pessimistic offline reinforcement learning techniques. We theoretically demonstrate that our policy achieves a tighter sub-optimality gap compared to existing methods. Extensive experiments validate the effectiveness of LoCo-RLHF, showcasing its superior performance in personalized RLHF settings and its robustness to distribution shifts.

Fonte: arXiv stat.ML

RLScore 85

Minimax Optimal Strategy for Delayed Observations in Online Reinforcement Learning

arXiv:2603.03480v1 Announce Type: new Abstract: We study reinforcement learning with delayed state observation, where the agent observes the current state after some random number of time steps. We propose an algorithm that combines the augmentation method and the upper confidence bound approach. For tabular Markov decision processes (MDPs), we derive a regret bound of $\tilde{\mathcal{O}}(H \sqrt{D_{\max} SAK})$, where $S$ and $A$ are the cardinalities of the state and action spaces, $H$ is the time horizon, $K$ is the number of episodes, and $D_{\max}$ is the maximum length of the delay. We also provide a matching lower bound up to logarithmic factors, showing the optimality of our approach. Our analytical framework formulates this problem as a special case of a broader class of MDPs, where their transition dynamics decompose into a known component and an unknown but structured component. We establish general results for this abstract setting, which may be of independent interest.

Fonte: arXiv cs.LG

RLScore 85

Q-Measure-Learning for Continuous State RL: Efficient Implementation and Convergence

arXiv:2603.03523v1 Announce Type: new Abstract: We study reinforcement learning in infinite-horizon discounted Markov decision processes with continuous state spaces, where data are generated online from a single trajectory under a Markovian behavior policy. To avoid maintaining an infinite-dimensional, function-valued estimate, we propose the novel Q-Measure-Learning, which learns a signed empirical measure supported on visited state-action pairs and reconstructs an action-value estimate via kernel integration. The method jointly estimates the stationary distribution of the behavior chain and the Q-measure through coupled stochastic approximation, leading to an efficient weight-based implementation with $O(n)$ memory and $O(n)$ computation cost per iteration. Under uniform ergodicity of the behavior chain, we prove almost sure sup-norm convergence of the induced Q-function to the fixed point of a kernel-smoothed Bellman operator. We also bound the approximation error between this limit and the optimal $Q^*$ as a function of the kernel bandwidth. To assess the performance of our proposed algorithm, we conduct RL experiments in a two-item inventory control setting.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

PhyPrompt: RL-based Prompt Refinement for Physically Plausible Text-to-Video Generation

arXiv:2603.03505v1 Announce Type: new Abstract: State-of-the-art text-to-video (T2V) generators frequently violate physical laws despite high visual quality. We show this stems from insufficient physical constraints in prompts rather than model limitations: manually adding physics details reliably produces physically plausible videos, but requires expertise and does not scale. We present PhyPrompt, a two-stage reinforcement learning framework that automatically refines prompts for physically realistic generation. First, we fine-tune a large language model on a physics-focused Chain-of-Thought dataset to integrate principles like object motion and force interactions while preserving user intent. Second, we apply Group Relative Policy Optimization with a dynamic reward curriculum that initially prioritizes semantic fidelity, then progressively shifts toward physical commonsense. This curriculum achieves synergistic optimization: PhyPrompt-7B reaches 40.8\% joint success on VideoPhy2 (8.6pp gain), improving physical commonsense by 11pp (55.8\% to 66.8\%) while simultaneously increasing semantic adherence by 4.4pp (43.4\% to 47.8\%). Remarkably, our curriculum exceeds single-objective training on both metrics, demonstrating compositional prompt discovery beyond conventional multi-objective trade-offs. PhyPrompt outperforms GPT-4o (+3.8\% joint) and DeepSeek-V3 (+2.2\%, 100$\times$ larger) using only 7B parameters. The approach transfers zero-shot across diverse T2V architectures (Lavie, VideoCrafter2, CogVideoX-5B) with up to 16.8\% improvement, establishing that domain-specialized reinforcement learning with compositional curricula surpasses general-purpose scaling for physics-aware generation.

Fonte: arXiv cs.CV

RLScore 85

Group Orthogonalized Policy Optimization:Group Policy Optimization as Orthogonal Projection in Hilbert Space

arXiv:2602.21269v1 Announce Type: cross Abstract: We present Group Orthogonalized Policy Optimization (GOPO), a new alignment algorithm for large language models derived from the geometry of Hilbert function spaces. Instead of optimizing on the probability simplex and inheriting the exponential curvature of Kullback-Leibler divergence, GOPO lifts alignment into the Hilbert space L2(pi_k) of square-integrable functions with respect to the reference policy. Within this space, the simplex constraint reduces to a linear orthogonality condition = 0, defining a codimension-one subspace H0. Minimizing distance to an unconstrained target u_star yields the work-dissipation functional J(v) = - (mu / 2) ||v||^2, whose maximizer follows directly from the Hilbert projection theorem. Enforcing the boundary v >= -1 produces a bounded Hilbert projection that induces exact sparsity, assigning zero probability to catastrophically poor actions through a closed-form threshold. To connect this functional theory with practice, GOPO projects from infinite-dimensional L2(pi_k) to a finite empirical subspace induced by group sampling. Because group-normalized advantages sum to zero, the Lagrange multiplier enforcing probability conservation vanishes exactly, reducing the constrained projection to an unconstrained empirical loss. The resulting objective has constant Hessian curvature mu I, non-saturating linear gradients, and an intrinsic dead-zone mechanism without heuristic clipping. Experiments on mathematical reasoning benchmarks show that GOPO achieves competitive generalization while maintaining stable gradient dynamics and entropy preservation in regimes where clipping-based methods plateau.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection

arXiv:2602.21887v1 Announce Type: new Abstract: Current large reasoning models (LRMs) have shown strong ability on challenging tasks after reinforcement learning (RL) based post-training. However, previous work mainly focuses on English reasoning in expectation of the strongest performance, despite the demonstrated potential advantage of multilingual thinking, as well as the requirement for native thinking traces by global users. In this paper, we propose ExpLang, a novel LLM post-training pipeline that enables on-policy thinking language selection to improve exploration and exploitation during RL with the use of multiple languages. The results show that our method steadily outperforms English-only training with the same training budget, while showing high thinking language compliance for both seen and unseen languages. Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged non-English advantage. The method is orthogonal to most RL algorithms and opens up a new perspective on using multilinguality to improve LRMs.

Fonte: arXiv cs.CL

VisionScore 85

SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video

arXiv:2602.21706v1 Announce Type: new Abstract: Minimally invasive surgery has dramatically improved patient operative outcomes, yet identifying safe operative zones remains challenging in critical phases, requiring surgeons to integrate visual cues, procedural phase, and anatomical context under high cognitive load. Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning. We introduce ResGo, a benchmark of laparoscopic frames annotated with Go Zone bounding boxes and clinician-authored rationales covering phase, exposure quality reasoning, next action and risk reminder. We introduce evaluation metrics that treat correct grounding under incorrect phase as failures, revealing that most vision-language models cannot handle such tasks and perform poorly. We then present SurGo-R1, a model optimized via RLHF with a multi-turn phase-then-go architecture where the model first identifies the surgical phase, then generates reasoning and Go Zone coordinates conditioned on that context. On unseen procedures, SurGo-R1 achieves 76.6% phase accuracy, 32.7 mIoU, and 54.8% hardcore accuracy, a 6.6$\times$ improvement over the mainstream generalist VLMs. Code, model and benchmark will be available at https://github.com/jinlab-imvr/SurGo-R1

Fonte: arXiv cs.CV

RLScore 85

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

arXiv:2602.21420v1 Announce Type: cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become the leading paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard RLVR algorithms suffer from a well-documented pathology: while they improve Pass@1 accuracy through sharpened sampling, they simultaneously narrow the model's reasoning boundary and reduce generation diversity. We identify a root cause that existing methods overlook: the uniform penalization of errors. Current approaches -- whether data-filtering methods that select prompts by difficulty, or advantage normalization schemes -- treat all incorrect rollouts within a group identically. We show that this uniformity allows overconfident errors (incorrect reasoning paths that the RL process has spuriously reinforced) to persist and monopolize probability mass, ultimately suppressing valid exploratory trajectories. To address this, we propose the Asymmetric Confidence-aware Error Penalty (ACE). ACE introduces a per-rollout confidence shift metric, c_i = log(pi_theta(y_i|x) / pi_ref(y_i|x)), to dynamically modulate negative advantages. Theoretically, we demonstrate that ACE's gradient can be decomposed into the gradient of a selective regularizer restricted to overconfident errors, plus a well-characterized residual that partially moderates the regularizer's strength. We conduct extensive experiments fine-tuning Qwen2.5-Math-7B, Qwen3-8B-Base, and Llama-3.1-8B-Instruct on the DAPO-Math-17K dataset using GRPO and DAPO within the VERL framework. Evaluated on MATH-500 and AIME 2025, ACE composes seamlessly with existing methods and consistently improves the full Pass@k spectrum across all three model families and benchmarks.

Fonte: arXiv cs.AI

RLScore 85

ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

arXiv:2602.21534v1 Announce Type: new Abstract: Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks. Despite encouraging early results, ARL remains highly unstable, often leading to training collapse. This instability limits scalability to larger environments and longer interaction horizons, and constrains systematic exploration of algorithmic design choices. In this paper, we first propose ARLArena, a stable training recipe and systematic analysis framework that examines training stability in a controlled and reproducible setting. ARLArena first constructs a clean and standardized testbed. Then, we decompose policy gradient into four core design dimensions and assess the performance and stability of each dimension. Through this fine-grained analysis, we distill a unified perspective on ARL and propose SAMPO, a stable agentic policy optimization method designed to mitigate the dominant sources of instability in ARL. Empirically, SAMPO achieves consistently stable training and strong performance across diverse agentic tasks. Overall, this study provides a unifying policy gradient perspective for ARL and offers practical guidance for building stable and reproducible LLM-based agent training pipelines.

Fonte: arXiv cs.AI

Theory/OptimizationScore 85

Efficient Uncoupled Learning Dynamics with $\tilde{O}\!\left(T^{-1/4}\right)$ Last-Iterate Convergence in Bilinear Saddle-Point Problems over Convex Sets under Bandit Feedback

arXiv:2602.21436v1 Announce Type: new Abstract: In this paper, we study last-iterate convergence of learning algorithms in bilinear saddle-point problems, a preferable notion of convergence that captures the day-to-day behavior of learning dynamics. We focus on the challenging setting where players select actions from compact convex sets and receive only bandit feedback. Our main contribution is the design of an uncoupled learning algorithm that guarantees last-iterate convergence to the Nash equilibrium with high probability. We establish a convergence rate of $\tilde{O}(T^{-1/4})$ up to polynomial factors in problem parameters. Crucially, our proposed algorithm is computationally efficient, requiring only an efficient linear optimization oracle over the players' compact action sets. The algorithm is obtained by combining techniques from experimental design and the classic Follow-The-Regularized-Leader (FTRL) framework, with a carefully chosen regularizer function tailored to the geometry of the action set of each learner.

Fonte: arXiv stat.ML

NLP/LLMsScore 85

RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning

arXiv:2602.21628v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a prevailing paradigm for enhancing reasoning in Multimodal Large Language Models (MLLMs). However, relying solely on outcome supervision risks reward hacking, where models learn spurious reasoning patterns to satisfy final answer checks. While recent rubric-based approaches offer fine-grained supervision signals, they suffer from high computational costs of instance-level generation and inefficient training dynamics caused by treating all rubrics as equally learnable. In this paper, we propose Stratified Rubric-based Curriculum Learning (RuCL), a novel framework that reformulates curriculum learning by shifting the focus from data selection to reward design. RuCL generates generalized rubrics for broad applicability and stratifies them based on the model's competence. By dynamically adjusting rubric weights during training, RuCL guides the model from mastering foundational perception to tackling advanced logical reasoning. Extensive experiments on various visual reasoning benchmarks show that RuCL yields a remarkable +7.83% average improvement over the Qwen2.5-VL-7B model, achieving a state-of-the-art accuracy of 60.06%.

Fonte: arXiv cs.CL

NLP/LLMsScore 90

Tool-R0: Self-Evolving LLM Agents for Tool-Learning from Zero Data

arXiv:2602.21320v1 Announce Type: new Abstract: Large language models (LLMs) are becoming the foundation for autonomous agents that can use tools to solve complex tasks. Reinforcement learning (RL) has emerged as a common approach for injecting such agentic capabilities, but typically under tightly controlled training setups. It often depends on carefully constructed task-solution pairs and substantial human supervision, which creates a fundamental obstacle to open-ended self-evolution toward superintelligent systems. In this paper, we propose Tool-R0 framework for training general purpose tool-calling agents from scratch with self-play RL, under a zero-data assumption. Initialized from the same base LLM, Tool-R0 co-evolves a Generator and a Solver with complementary rewards: one proposes targeted challenging tasks at the other's competence frontier and the other learns to solve them with real-world tool calls. This creates a self-evolving cycle that requires no pre-existing tasks or datasets. Evaluation on different tool-use benchmarks show that Tool-R0 yields 92.5 relative improvement over the base model and surpasses fully supervised tool-calling baselines under the same setting. Our work further provides empirical insights into self-play LLM agents by analyzing co-evolution, curriculum dynamics, and scaling behavior.

Fonte: arXiv cs.LG

Theory/OptimizationScore 85

Efficient Opportunistic Approachability

arXiv:2602.21328v1 Announce Type: new Abstract: We study the problem of opportunistic approachability: a generalization of Blackwell approachability where the learner would like to obtain stronger guarantees (i.e., approach a smaller set) when their adversary limits themselves to a subset of their possible action space. Bernstein et al. (2014) introduced this problem in 2014 and presented an algorithm that guarantees sublinear approachability rates for opportunistic approachability. However, this algorithm requires the ability to produce calibrated online predictions of the adversary's actions, a problem whose standard implementations require time exponential in the ambient dimension and result in approachability rates that scale as $T^{-O(1/d)}$. In this paper, we present an efficient algorithm for opportunistic approachability that achieves a rate of $O(T^{-1/4})$ (and an inefficient one that achieves a rate of $O(T^{-1/3})$), bypassing the need for an online calibration subroutine. Moreover, in the case where the dimension of the adversary's action set is at most two, we show it is possible to obtain the optimal rate of $O(T^{-1/2})$.

Fonte: arXiv cs.LG

NLP/LLMsScore 85

Urban Vibrancy Embedding and Application on Traffic Prediction

arXiv:2602.21232v1 Announce Type: cross Abstract: Urban vibrancy reflects the dynamic human activity within urban spaces and is often measured using mobile data that captures floating population trends. This study proposes a novel approach to derive Urban Vibrancy embeddings from real-time floating population data to enhance traffic prediction models. Specifically, we utilize variational autoencoders (VAE) to compress this data into actionable embeddings, which are then integrated with long short-term memory (LSTM) networks to predict future embeddings. These are subsequently applied in a sequence-to-sequence framework for traffic forecasting. Our contributions are threefold: (1) We use principal component analysis (PCA) to interpret the embeddings, revealing temporal patterns such as weekday versus weekend distinctions and seasonal patterns; (2) We propose a method that combines VAE and LSTM, enabling forecasting dynamic urban knowledge embedding; and (3) Our approach improves accuracy and responsiveness in traffic prediction models, including RNN, DCRNN, GTS, and GMAN. This study demonstrates the potential of Urban Vibrancy embeddings to advance traffic prediction and offer a more nuanced analysis of urban mobility.

Fonte: arXiv cs.AI

RLScore 85

Training Generalizable Collaborative Agents via Strategic Risk Aversion

arXiv:2602.21515v1 Announce Type: new Abstract: Many emerging agentic paradigms require agents to collaborate with one another (or people) to achieve shared goals. Unfortunately, existing approaches to learning policies for such collaborative problems produce brittle solutions that fail when paired with new partners. We attribute these failures to a combination of free-riding during training and a lack of strategic robustness. To address these problems, we study the concept of strategic risk aversion and interpret it as a principled inductive bias for generalizable cooperation with unseen partners. While strategically risk-averse players are robust to deviations in their partner's behavior by design, we show that, in collaborative games, they also (1) can have better equilibrium outcomes than those at classical game-theoretic concepts like Nash, and (2) exhibit less or no free-riding. Inspired by these insights, we develop a multi-agent reinforcement learning (MARL) algorithm that integrates strategic risk aversion into standard policy optimization methods. Our empirical results across collaborative benchmarks (including an LLM collaboration task) validate our theory and demonstrate that our approach consistently achieves reliable collaboration with heterogeneous and previously unseen partners across collaborative tasks.

Fonte: arXiv cs.LG

RLScore 85

On the Structural Non-Preservation of Epistemic Behaviour under Policy Transformation

arXiv:2602.21424v1 Announce Type: new Abstract: Reinforcement learning (RL) agents under partial observability often condition actions on internally accumulated information such as memory or inferred latent context. We formalise such information-conditioned interaction patterns as behavioural dependency: variation in action selection with respect to internal information under fixed observations. This induces a probe-relative notion of $\epsilon$-behavioural equivalence and a within-policy behavioural distance that quantifies probe sensitivity. We establish three structural results. First, the set of policies exhibiting non-trivial behavioural dependency is not closed under convex aggregation. Second, behavioural distance contracts under convex combination. Third, we prove a sufficient local condition under which gradient ascent on a skewed mixture objective decreases behavioural distance when a dominant-mode gradient aligns with the direction of steepest contraction. Minimal bandit and partially observable gridworld experiments provide controlled witnesses of these mechanisms. In the examined settings, behavioural distance decreases under convex aggregation and under continued optimisation with skewed latent priors, and in these experiments it precedes degradation under latent prior shift. These results identify structural conditions under which probe-conditioned behavioural separation is not preserved under common policy transformations.

Fonte: arXiv cs.LG

NLP/LLMsScore 88

GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning

arXiv:2602.21492v1 Announce Type: new Abstract: Reinforcement learning (RL) has become a central post-training paradigm for large language models (LLMs), but its performance is highly sensitive to the quality of training problems. This sensitivity stems from the non-stationarity of RL: rollouts are generated by an evolving policy, and learning is shaped by exploration and reward feedback, unlike supervised fine-tuning (SFT) with fixed trajectories. As a result, prior work often relies on manual curation or simple heuristic filters (e.g., accuracy), which can admit incorrect or low-utility problems. We propose GradAlign, a gradient-aligned data selection method for LLM reinforcement learning that uses a small, trusted validation set to prioritize training problems whose policy gradients align with validation gradients, yielding an adaptive curriculum. We evaluate GradAlign across three challenging data regimes: unreliable reward signals, distribution imbalance, and low-utility training corpus, showing that GradAlign consistently outperforms existing baselines, underscoring the importance of directional gradient signals in navigating non-stationary policy optimization and yielding more stable training and improved final performance. We release our implementation at https://github.com/StigLidu/GradAlign

Fonte: arXiv cs.LG

RLScore 85

Generalisation of RLHF under Reward Shift and Clipped KL Regularisation

arXiv:2602.21765v1 Announce Type: cross Abstract: Alignment and adaptation in large language models heavily rely on reinforcement learning from human feedback (RLHF); yet, theoretical understanding of its generalisability remains premature, especially when the learned reward could shift, and the KL control is estimated and clipped. To address this issue, we develop generalisation theory for RLHF that explicitly accounts for (1) \emph{reward shift}: reward models are trained on preference data from earlier or mixed behaviour policies while RLHF optimises the current policy on its own rollouts; and (2) \emph{clipped KL regularisation}: the KL regulariser is estimated from sampled log-probability ratios and then clipped for stabilisation, resulting in an error to RLHF. We present generalisation bounds for RLHF, suggesting that the generalisation error stems from a sampling error from prompts and rollouts, a reward shift error, and a KL clipping error. We also discuss special cases of (1) initialising RLHF parameters with a uniform prior over a finite space, and (2) training RLHF by stochastic gradient descent, as an Ornstein-Uhlenbeck process. The theory yields practical implications in (1) optimal KL clipping threshold, and (2) budget allocation in prompts, rollouts, and preference data.

Fonte: arXiv stat.ML

MultimodalScore 85

See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs

arXiv:2602.21497v1 Announce Type: new Abstract: Recent large vision-language models (LVLMs) have demonstrated impressive reasoning ability by generating long chain-of-thought (CoT) responses. However, CoT reasoning in multimodal contexts is highly vulnerable to visual hallucination propagation: once an intermediate reasoning step becomes inconsistent with the visual evidence, subsequent steps-even if logically valid-can still lead to incorrect final answers. Existing solutions attempt to mitigate this issue by training models to "think with images" via reinforcement learning (RL). While effective, these methods are costly, model-specific, and difficult to generalize across architectures. Differently, we present a lightweight method that bypasses RL training and provides an iterative, training-free, plug-and-play framework for visually-grounded multimodal reasoning. Our key idea is to supervise each reasoning step at test time with visual evidence, ensuring that every decoded token is justified by corresponding visual cues. Concretely, we construct a textual visual-evidence pool that guides the model's reasoning generation. When existing evidence is insufficient, a visual decider module dynamically extracts additional relevant evidence from the image based on the ongoing reasoning context, expanding the pool until the model achieves sufficient visual certainty to terminate reasoning and produce the final answer. Extensive experiments on multiple LVLM backbones and benchmarks demonstrate the effectiveness of our approach. Our method achieves 16.5%-29.5% improvements on TreeBench and 13.7% RH-AUC gains on RH-Bench, substantially reducing hallucination rates while improving reasoning accuracy without additional training.

Fonte: arXiv cs.CV

NLP/LLMsScore 85

Bayesian Generative Adversarial Networks via Gaussian Approximation for Tabular Data Synthesis

arXiv:2602.21948v1 Announce Type: cross Abstract: Generative Adversarial Networks (GAN) have been used in many studies to synthesise mixed tabular data. Conditional tabular GAN (CTGAN) have been the most popular variant but struggle to effectively navigate the risk-utility trade-off. Bayesian GAN have received less attention for tabular data, but have been explored with unstructured data such as images and text. The most used technique employed in Bayesian GAN is Markov Chain Monte Carlo (MCMC), but it is computationally intensive, particularly in terms of weight storage. In this paper, we introduce Gaussian Approximation of CTGAN (GACTGAN), an integration of the Bayesian posterior approximation technique using Stochastic Weight Averaging-Gaussian (SWAG) within the CTGAN generator to synthesise tabular data, reducing computational overhead after the training phase. We demonstrate that GACTGAN yields better synthetic data compared to CTGAN, achieving better preservation of tabular structure and inferential statistics with less privacy risk. These results highlight GACTGAN as a simpler, effective implementation of Bayesian tabular synthesis.

Fonte: arXiv stat.ML

NLP/LLMsScore 85

Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling

arXiv:2602.21728v1 Announce Type: new Abstract: The reasoning process of Large Language Models (LLMs) is often plagued by hallucinations and missing facts in question-answering tasks. A promising solution is to ground LLMs' answers in verifiable knowledge sources, such as Knowledge Graphs (KGs). Prevailing KG-enhanced methods typically constrained LLM reasoning either by enforcing rules during generation or by imitating paths from a fixed set of demonstrations. However, they naturally confined the reasoning patterns of LLMs within the scope of prior experience or fine-tuning data, limiting their generalizability to out-of-distribution graph reasoning problems. To tackle this problem, in this paper, we propose Explore-on-Graph (EoG), a novel framework that encourages LLMs to autonomously explore a more diverse reasoning space on KGs. To incentivize exploration and discovery of novel reasoning paths, we propose to introduce reinforcement learning during training, whose reward is the correctness of the reasoning paths' final answers. To enhance the efficiency and meaningfulness of the exploration, we propose to incorporate path information as additional reward signals to refine the exploration process and reduce futile efforts. Extensive experiments on five KGQA benchmark datasets demonstrate that, to the best of our knowledge, our method achieves state-of-the-art performance, outperforming not only open-source but also even closed-source LLMs.

Fonte: arXiv cs.CL

NLP/LLMsScore 85

Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning

arXiv:2602.21720v1 Announce Type: new Abstract: Human recursive numeral systems (i.e., counting systems such as English base-10 numerals), like many other grammatical systems, are highly regular. Following prior work that relates cross-linguistic tendencies to biases in learning, we ask whether regular systems are common because regularity facilitates learning. Adopting methods from the Reinforcement Learning literature, we confirm that highly regular human(-like) systems are easier to learn than unattested but possible irregular systems. This asymmetry emerges under the natural assumption that recursive numeral systems are designed for generalisation from limited data to represent all integers exactly. We also find that the influence of regularity on learnability is absent for unnatural, highly irregular systems, whose learnability is influenced instead by signal length, suggesting that different pressures may influence learnability differently in different parts of the space of possible numeral systems. Our results contribute to the body of work linking learnability to cross-linguistic prevalence.

Fonte: arXiv cs.CL

NLP/LLMsScore 85

Enhancing Multi-Modal LLMs Reasoning via Difficulty-Aware Group Normalization

arXiv:2602.21743v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) and Group Relative Policy Optimization (GRPO) have significantly advanced the reasoning capabilities of large language models. Extending these methods to multimodal settings, however, faces a critical challenge: the instability of std-based normalization, which is easily distorted by extreme samples with nearly positive or negative rewards. Unlike pure-text LLMs, multimodal models are particularly sensitive to such distortions, as both perceptual and reasoning errors influence their responses. To address this, we characterize each sample by its difficulty, defined through perceptual complexity (measured via visual entropy) and reasoning uncertainty (captured by model confidence). Building on this characterization, we propose difficulty-aware group normalization (Durian), which re-groups samples by difficulty levels and shares the std within each group. Our approach preserves GRPO's intra-group distinctions while eliminating sensitivity to extreme cases, yielding significant performance gains across multiple multimodal reasoning benchmarks.

Fonte: arXiv cs.CV

NLP/LLMsScore 85

Distill and Align Decomposition for Enhanced Claim Verification

arXiv:2602.21857v1 Announce Type: new Abstract: Complex claim verification requires decomposing sentences into verifiable subclaims, yet existing methods struggle to align decomposition quality with verification performance. We propose a reinforcement learning (RL) approach that jointly optimizes decomposition quality and verifier alignment using Group Relative Policy Optimization (GRPO). Our method integrates: (i) structured sequential reasoning; (ii) supervised finetuning on teacher-distilled exemplars; and (iii) a multi-objective reward balancing format compliance, verifier alignment, and decomposition quality. Across six evaluation settings, our trained 8B decomposer improves downstream verification performance to (71.75%) macro-F1, outperforming prompt-based approaches ((+1.99), (+6.24)) and existing RL methods ((+5.84)). Human evaluation confirms the high quality of the generated subclaims. Our framework enables smaller language models to achieve state-of-the-art claim verification by jointly optimising for verification accuracy and decomposition quality.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

RADAR: Reasoning as Discrimination with Aligned Representations for LLM-based Knowledge Graph Reasoning

arXiv:2602.21951v1 Announce Type: new Abstract: Knowledge graph reasoning (KGR) infers missing facts, with recent advances increasingly harnessing the semantic priors and reasoning abilities of Large Language Models (LLMs). However, prevailing generative paradigms are prone to memorizing surface-level co-occurrences rather than learning genuine relational semantics, limiting out-of-distribution generalization. To address this, we propose RADAR, which reformulates KGR from generative pattern matching to discriminative relational reasoning. We recast KGR as discriminative entity selection, where reinforcement learning enforces relative entity separability beyond token-likelihood imitation. Leveraging this separability, inference operates directly in representation space, ensuring consistency with the discriminative optimization and bypassing generation-induced hallucinations. Across four benchmarks, RADAR achieves 5-6% relative gains on link prediction and triple classification over strong LLM baselines, while increasing task-relevant mutual information in intermediate representations by 62.9%, indicating more robust and transferable relational reasoning.

Fonte: arXiv cs.CL

NLP/LLMsScore 85

CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning

arXiv:2602.21655v1 Announce Type: new Abstract: Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references. Because human annotations reflect subjective preferences and expertise, ground-truth captions are often incomplete or even incorrect, which in turn limits caption models. We argue that caption quality should be assessed by two objective aspects: completeness (does the caption cover all salient visual facts?) and correctness (are the descriptions true with respect to the image?). To this end, we introduce CCCaption: a dual-reward reinforcement learning framework with a dedicated fine-tuning corpus that explicitly optimizes these properties to generate \textbf{C}omplete and \textbf{C}orrect \textbf{Captions}. For completeness, we use diverse LVLMs to disentangle the image into a set of visual queries, and reward captions that answer more of these queries, with a dynamic query sampling strategy to improve training efficiency. For correctness, we penalize captions that contain hallucinations by validating the authenticity of sub-caption queries, which are derived from the caption decomposition. Our symmetric dual-reward optimization jointly maximizes completeness and correctness, guiding models toward captions that better satisfy these objective criteria. Extensive experiments across standard captioning benchmarks show consistent improvements, offering a principled path to training caption models beyond human-annotation imitation.

Fonte: arXiv cs.CV

MultimodalScore 85

Which Tool Response Should I Trust? Tool-Expertise-Aware Chest X-ray Agent with Multimodal Agentic Learning

arXiv:2602.21517v1 Announce Type: new Abstract: AI agents with tool-use capabilities show promise for integrating the domain expertise of various tools. In the medical field, however, tools are usually AI models that are inherently error-prone and can produce contradictory responses. Existing research on medical agents lacks sufficient understanding of the tools' realistic reliability and thus cannot effectively resolve tool conflicts. To address this gap, this paper introduces a framework that enables an agent to interact with tools and empirically learn their practical trustworthiness across different types of multimodal queries via agentic learning. As a concrete instantiation, we focus on chest X-ray analysis and present a tool-expertise-aware chest X-ray agent (TEA-CXA). When tool outputs disagree, the agent experimentally accepts or rejects multimodal tool results, receives rewards, and learns which tool to trust for each query type. Importantly, TEA-CXA extends existing codebases for reinforcement learning with multi-turn tool-calling that focus on textual inputs, to support multimodal contexts effectively. In addition, we enhance the codebase for medical use scenarios by supporting multiple tool calls in one turn, parallel tool inference, and multi-image accommodation within a single user query. Our code framework is applicable to general medical research on multi-turn tool-calling reinforcement learning in multimodal settings. Experiments show that TEA-CXA outperforms the state-of-the-art methods and a comprehensive set of baselines. Code will be released.

Fonte: arXiv cs.CV

RLScore 85

Decoupling Return-to-Go for Efficient Decision Transformer

arXiv:2601.15953v1 Announce Type: new Abstract: The Decision Transformer (DT) has established a powerful sequence modeling approach to offline reinforcement learning. It conditions its action predictions on Return-to-Go (RTG), using it both to distinguish trajectory quality during training and to guide action generation at inference. In this work, we identify a critical redundancy in this design: feeding the entire sequence of RTGs into the Transformer is theoretically unnecessary, as only the most recent RTG affects action prediction. We show that this redundancy can impair DT's performance through experiments. To resolve this, we propose the Decoupled DT (DDT). DDT simplifies the architecture by processing only observation and action sequences through the Transformer, using the latest RTG to guide the action prediction. This streamlined approach not only improves performance but also reduces computational cost. Our experiments show that DDT significantly outperforms DT and establishes competitive performance against state-of-the-art DT variants across multiple offline RL tasks.

Fonte: arXiv cs.AI

RLScore 85

Off-Policy Actor-Critic with Sigmoid-Bounded Entropy for Real-World Robot Learning

arXiv:2601.15761v1 Announce Type: new Abstract: Deploying reinforcement learning in the real world remains challenging due to sample inefficiency, sparse rewards, and noisy visual observations. Prior work leverages demonstrations and human feedback to improve learning efficiency and robustness. However, offline-to-online methods need large datasets and can be unstable, while VLA-assisted RL relies on large-scale pretraining and fine-tuning. As a result, a low-cost real-world RL method with minimal data requirements has yet to emerge. We introduce \textbf{SigEnt-SAC}, an off-policy actor-critic method that learns from scratch using a single expert trajectory. Our key design is a sigmoid-bounded entropy term that prevents negative-entropy-driven optimization toward out-of-distribution actions and reduces Q-function oscillations. We benchmark SigEnt-SAC on D4RL tasks against representative baselines. Experiments show that SigEnt-SAC substantially alleviates Q-function oscillations and reaches a 100\% success rate faster than prior methods. Finally, we validate SigEnt-SAC on four real-world robotic tasks across multiple embodiments, where agents learn from raw images and sparse rewards; results demonstrate that SigEnt-SAC can learn successful policies with only a small number of real-world interactions, suggesting a low-cost and practical pathway for real-world RL deployment.

Fonte: arXiv cs.AI

RLScore 95

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

arXiv:2601.16163v1 Announce Type: new Abstract: Recent video generation models demonstrate remarkable ability to capture complex physical interactions and scene evolution over time. To leverage their spatiotemporal priors, robotics works have adapted video models for policy learning but introduce complexity by requiring multiple stages of post-training and new architectural components for action generation. In this work, we introduce Cosmos Policy, a simple approach for adapting a large pretrained video model (Cosmos-Predict2) into an effective robot policy through a single stage of post-training on the robot demonstration data collected on the target platform, with no architectural modifications. Cosmos Policy learns to directly generate robot actions encoded as latent frames within the video model's latent diffusion process, harnessing the model's pretrained priors and core learning algorithm to capture complex action distributions. Additionally, Cosmos Policy generates future state images and values (expected cumulative rewards), which are similarly encoded as latent frames, enabling test-time planning of action trajectories with higher likelihood of success. In our evaluations, Cosmos Policy achieves state-of-the-art performance on the LIBERO and RoboCasa simulation benchmarks (98.5% and 67.1% average success rates, respectively) and the highest average score in challenging real-world bimanual manipulation tasks, outperforming strong diffusion policies trained from scratch, video model-based policies, and state-of-the-art vision-language-action models fine-tuned on the same robot demonstrations. Furthermore, given policy rollout data, Cosmos Policy can learn from experience to refine its world model and value function and leverage model-based planning to achieve even higher success rates in challenging tasks. We release code, models, and training data at https://research.nvidia.com/labs/dir/cosmos-policy/

Fonte: arXiv cs.AI

NLP/LLMsScore 85

From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantification in Large Language Models

arXiv:2601.15690v1 Announce Type: new Abstract: While Large Language Models (LLMs) show remarkable capabilities, their unreliability remains a critical barrier to deployment in high-stakes domains. This survey charts a functional evolution in addressing this challenge: the evolution of uncertainty from a passive diagnostic metric to an active control signal guiding real-time model behavior. We demonstrate how uncertainty is leveraged as an active control signal across three frontiers: in \textbf{advanced reasoning} to optimize computation and trigger self-correction; in \textbf{autonomous agents} to govern metacognitive decisions about tool use and information seeking; and in \textbf{reinforcement learning} to mitigate reward hacking and enable self-improvement via intrinsic rewards. By grounding these advancements in emerging theoretical frameworks like Bayesian methods and Conformal Prediction, we provide a unified perspective on this transformative trend. This survey provides a comprehensive overview, critical analysis, and practical design patterns, arguing that mastering the new trend of uncertainty is essential for building the next generation of scalable, reliable, and trustworthy AI.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

On the Generalization Gap in LLM Planning: Tests and Verifier-Reward RL

arXiv:2601.14456v1 Announce Type: new Abstract: Recent work shows that fine-tuned Large Language Models (LLMs) can achieve high valid plan rates on PDDL planning tasks. However, it remains unclear whether this reflects transferable planning competence or domain-specific memorization. In this work, we fine-tune a 1.7B-parameter LLM on 40,000 domain-problem-plan tuples from 10 IPC 2023 domains, and evaluate both in-domain and cross-domain generalization. While the model reaches 82.9% valid plan rate in in-domain conditions, it achieves 0% on two unseen domains. To analyze this failure, we introduce three diagnostic interventions, namely (i) instance-wise symbol anonymization, (ii) compact plan serialization, and (iii) verifier-reward fine-tuning using the VAL validator as a success-focused reinforcement signal. Symbol anonymization and compact serialization cause significant performance drops despite preserving plan semantics, thus revealing strong sensitivity to surface representations. Verifier-reward fine-tuning reaches performance saturation in half the supervised training epochs, but does not improve cross-domain generalization. For the explored configurations, in-domain performance plateaus around 80%, while cross-domain performance collapses, suggesting that our fine-tuned model relies heavily on domain-specific patterns rather than transferable planning competence in this setting. Our results highlight a persistent generalization gap in LLM-based planning and provide diagnostic tools for studying its causes.

Fonte: arXiv cs.AI

RLScore 85

Vehicle Routing with Finite Time Horizon using Deep Reinforcement Learning with Improved Network Embedding

arXiv:2601.15131v1 Announce Type: new Abstract: In this paper, we study the vehicle routing problem with a finite time horizon. In this routing problem, the objective is to maximize the number of customer requests served within a finite time horizon. We present a novel routing network embedding module which creates local node embedding vectors and a context-aware global graph representation. The proposed Markov decision process for the vehicle routing problem incorporates the node features, the network adjacency matrix and the edge features as components of the state space. We incorporate the remaining finite time horizon into the network embedding module to provide a proper routing context to the embedding module. We integrate our embedding module with a policy gradient-based deep Reinforcement Learning framework to solve the vehicle routing problem with finite time horizon. We trained and validated our proposed routing method on real-world routing networks, as well as synthetically generated Euclidean networks. Our experimental results show that our method achieves a higher customer service rate than the existing routing methods. Additionally, the solution time of our method is significantly lower than that of the existing methods.

Fonte: arXiv cs.AI

RLScore 85

DARA: Few-shot Budget Allocation in Online Advertising via In-Context Decision Making with RL-Finetuned LLMs

arXiv:2601.14711v1 Announce Type: new Abstract: Optimizing the advertiser's cumulative value of winning impressions under budget constraints poses a complex challenge in online advertising, under the paradigm of AI-Generated Bidding (AIGB). Advertisers often have personalized objectives but limited historical interaction data, resulting in few-shot scenarios where traditional reinforcement learning (RL) methods struggle to perform effectively. Large Language Models (LLMs) offer a promising alternative for AIGB by leveraging their in-context learning capabilities to generalize from limited data. However, they lack the numerical precision required for fine-grained optimization. To address this limitation, we introduce GRPO-Adaptive, an efficient LLM post-training strategy that enhances both reasoning and numerical precision by dynamically updating the reference policy during training. Built upon this foundation, we further propose DARA, a novel dual-phase framework that decomposes the decision-making process into two stages: a few-shot reasoner that generates initial plans via in-context prompting, and a fine-grained optimizer that refines these plans using feedback-driven reasoning. This separation allows DARA to combine LLMs' in-context learning strengths with precise adaptability required by AIGB tasks. Extensive experiments on both real-world and synthetic data environments demonstrate that our approach consistently outperforms existing baselines in terms of cumulative advertiser value under budget constraints.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Large Language Model-Powered Evolutionary Code Optimization on a Phylogenetic Tree

arXiv:2601.14523v1 Announce Type: new Abstract: Optimizing scientific computing algorithms for modern GPUs is a labor-intensive and iterative process involving repeated code modification, benchmarking, and tuning across complex hardware and software stacks. Recent work has explored large language model (LLM)-assisted evolutionary methods for automated code optimization, but these approaches primarily rely on outcome-based selection and random mutation, underutilizing the rich trajectory information generated during iterative optimization. We propose PhyloEvolve, an LLM-agent system that reframes GPU-oriented algorithm optimization as an In-Context Reinforcement Learning (ICRL) problem. This formulation enables trajectory-conditioned reuse of optimization experience without model retraining. PhyloEvolve integrates Algorithm Distillation and prompt-based Decision Transformers into an iterative workflow, treating sequences of algorithm modifications and performance feedback as first-class learning signals. To organize optimization history, we introduce a phylogenetic tree representation that captures inheritance, divergence, and recombination among algorithm variants, enabling backtracking, cross-lineage transfer, and reproducibility. The system combines elite trajectory pooling, multi-island parallel exploration, and containerized execution to balance exploration and exploitation across heterogeneous hardware. We evaluate PhyloEvolve on scientific computing workloads including PDE solvers, manifold learning, and spectral graph algorithms, demonstrating consistent improvements in runtime, memory efficiency, and correctness over baseline and evolutionary methods. Code is published at: https://github.com/annihi1ation/phylo_evolve

Fonte: arXiv cs.AI

NLP/LLMsScore 90

Knowledge Graphs are Implicit Reward Models: Path-Derived Signals Enable Compositional Reasoning

arXiv:2601.15160v1 Announce Type: new Abstract: Large language models have achieved near-expert performance in structured reasoning domains like mathematics and programming, yet their ability to perform compositional multi-hop reasoning in specialized scientific fields remains limited. We propose a bottom-up learning paradigm in which models are grounded in axiomatic domain facts and compose them to solve complex, unseen tasks. To this end, we present a post-training pipeline, based on a combination of supervised fine-tuning and reinforcement learning (RL), in which knowledge graphs act as implicit reward models. By deriving novel reward signals from knowledge graph paths, we provide verifiable, scalable, and grounded supervision that encourages models to compose intermediate axioms rather than optimize only final answers during RL. We validate this approach in the medical domain, training a 14B model on short-hop reasoning paths (1-3 hops) and evaluating its zero-shot generalization to complex multi-hop queries (4-5 hops). Our experiments show that path-derived rewards act as a "compositional bridge", enabling our model to significantly outperform much larger models and frontier systems like GPT-5.2 and Gemini 3 Pro, on the most difficult reasoning tasks. Furthermore, we demonstrate the robustness of our approach to adversarial perturbations against option-shuffling stress tests. This work suggests that grounding the reasoning process in structured knowledge is a scalable and efficient path toward intelligent reasoning.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks

arXiv:2601.14652v1 Announce Type: new Abstract: While multi-agent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design under-deliver. Such shortcomings stem from two key factors: (1) methodological complexity - agent orchestration is performed using sequential, code-level execution that limits global system-level holistic reasoning and scales poorly with agent complexity - and (2) efficacy uncertainty - MAS are deployed without understanding if there are tangible benefits compared to single-agent systems (SAS). We propose MAS-Orchestra, a training-time framework that formulates MAS orchestration as a function-calling reinforcement learning problem with holistic orchestration, generating an entire MAS at once. In MAS-Orchestra, complex, goal-oriented sub-agents are abstracted as callable functions, enabling global reasoning over system structure while hiding internal execution details. To rigorously study when and why MAS are beneficial, we introduce MASBENCH, a controlled benchmark that characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness. Our analysis reveals that MAS gains depend critically on task structure, verification protocols, and the capabilities of both orchestrator and sub-agents, rather than holding universally. Guided by these insights, MAS-Orchestra achieves consistent improvements on public benchmarks including mathematical reasoning, multi-hop QA, and search-based QA. Together, MAS-Orchestra and MASBENCH enable better training and understanding of MAS in the pursuit of multi-agent intelligence.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

CI4A: Semantic Component Interfaces for Agents Empowering Web Automation

arXiv:2601.14790v1 Announce Type: new Abstract: While Large Language Models demonstrate remarkable proficiency in high-level semantic planning, they remain limited in handling fine-grained, low-level web component manipulations. To address this limitation, extensive research has focused on enhancing model grounding capabilities through techniques such as Reinforcement Learning. However, rather than compelling agents to adapt to human-centric interfaces, we propose constructing interaction interfaces specifically optimized for agents. This paper introduces Component Interface for Agent (CI4A), a semantic encapsulation mechanism that abstracts the complex interaction logic of UI components into a set of unified tool primitives accessible to agents. We implemented CI4A within Ant Design, an industrial-grade front-end framework, covering 23 categories of commonly used UI components. Furthermore, we developed a hybrid agent featuring an action space that dynamically updates according to the page state, enabling flexible invocation of available CI4A tools. Leveraging the CI4A-integrated Ant Design, we refactored and upgraded the WebArena benchmark to evaluate existing SoTA methods. Experimental results demonstrate that the CI4A-based agent significantly outperforms existing approaches, achieving a new SoTA task success rate of 86.3%, alongside substantial improvements in execution efficiency.

Fonte: arXiv cs.AI

RLScore 85

Survival is the Only Reward: Sustainable Self-Training Through Environment-Mediated Selection

arXiv:2601.12310v1 Announce Type: new Abstract: Self-training systems often degenerate due to the lack of an external criterion for judging data quality, leading to reward hacking and semantic drift. This paper provides a proof-of-concept system architecture for stable self-training under sparse external feedback and bounded memory, and empirically characterises its learning dynamics and failure modes. We introduce a self-training architecture in which learning is mediated exclusively by environmental viability, rather than by reward, objective functions, or externally defined fitness criteria. Candidate behaviours are executed under real resource constraints, and only those whose environmental effects both persist and preserve the possibility of future interaction are propagated. The environment does not provide semantic feedback, dense rewards, or task-specific supervision; selection operates solely through differential survival of behaviours as world-altering events, making proxy optimisation impossible and rendering reward-hacking evolutionarily unstable. Analysis of semantic dynamics shows that improvement arises primarily through the persistence of effective and repeatable strategies under a regime of consolidation and pruning, a paradigm we refer to as negative-space learning (NSL), and that models develop meta-learning strategies (such as deliberate experimental failure in order to elicit informative error messages) without explicit instruction. This work establishes that environment-grounded selection enables sustainable open-ended self-improvement, offering a viable path toward more robust and generalisable autonomous systems without reliance on human-curated data or complex reward shaping.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning

arXiv:2601.11957v1 Announce Type: new Abstract: Overlapping calendar invitations force busy professionals to repeatedly decide which meetings to attend, reschedule, or decline. We refer to this preference-driven decision process as calendar conflict resolution. Automating such process is crucial yet challenging. Scheduling logistics drain hours, and human delegation often fails at scale, which motivate we to ask: Can we trust large language model (LLM) or language agent to manager time? To enable systematic study of this question, we introduce CalConflictBench, a benchmark for long-horizon calendar conflict resolution. Conflicts are presented sequentially and agents receive feedback after each round, requiring them to infer and adapt to user preferences progressively. Our experiments show that current LLM agents perform poorly with high error rates, e.g., Qwen-3-30B-Think has 35% average error rate. To address this gap, we propose PEARL, a reinforcement-learning framework that augments language agent with an external memory module and optimized round-wise reward design, enabling agent to progressively infer and adapt to user preferences on-the-fly. Experiments on CalConflictBench shows that PEARL achieves 0.76 error reduction rate, and 55% improvement in average error rate compared to the strongest baseline.

Fonte: arXiv cs.CL

NLP/LLMsScore 85

UniMo: Unified Motion Generation and Understanding with Chain of Thought

arXiv:2601.12126v1 Announce Type: new Abstract: Existing 3D human motion generation and understanding methods often exhibit limited interpretability, restricting effective mutual enhancement between these inherently related tasks. While current unified frameworks based on large language models (LLMs) leverage linguistic priors, they frequently encounter challenges in semantic alignment and task coherence. Moreover, the next-token prediction paradigm in LLMs is ill-suited for motion sequences, causing cumulative prediction errors. To address these limitations, we propose UniMo, a novel framework that integrates motion-language information and interpretable chain of thought (CoT) reasoning into the LLM via supervised fine-tuning (SFT). We further introduce reinforcement learning with Group Relative Policy Optimization (GRPO) as a post-training strategy that optimizes over groups of tokens to enforce structural correctness and semantic alignment, mitigating cumulative errors in motion token prediction. Extensive experiments demonstrate that UniMo significantly outperforms existing unified and task-specific models, achieving state-of-the-art performance in both motion generation and understanding.

Fonte: arXiv cs.AI

RLScore 85

Multi-agent DRL-based Lane Change Decision Model for Cooperative Planning in Mixed Traffic

arXiv:2601.11809v1 Announce Type: new Abstract: Connected automated vehicles (CAVs) possess the ability to communicate and coordinate with one another, enabling cooperative platooning that enhances both energy efficiency and traffic flow. However, during the initial stage of CAV deployment, the sparse distribution of CAVs among human-driven vehicles reduces the likelihood of forming effective cooperative platoons. To address this challenge, this study proposes a hybrid multi-agent lane change decision model aimed at increasing CAV participation in cooperative platooning and maximizing its associated benefits. The proposed model employs the QMIX framework, integrating traffic data processed through a convolutional neural network (CNN-QMIX). This architecture addresses a critical issue in dynamic traffic scenarios by enabling CAVs to make optimal decisions irrespective of the varying number of CAVs present in mixed traffic. Additionally, a trajectory planner and a model predictive controller are designed to ensure smooth and safe lane-change execution. The proposed model is trained and evaluated within a microsimulation environment under varying CAV market penetration rates. The results demonstrate that the proposed model efficiently manages fluctuating traffic agent numbers, significantly outperforming the baseline rule-based models. Notably, it enhances cooperative platooning rates up to 26.2\%, showcasing its potential to optimize CAV cooperation and traffic dynamics during the early stage of deployment.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

LIBRA: Language Model Informed Bandit Recourse Algorithm for Personalized Treatment Planning

arXiv:2601.11905v1 Announce Type: new Abstract: We introduce a unified framework that seamlessly integrates algorithmic recourse, contextual bandits, and large language models (LLMs) to support sequential decision-making in high-stakes settings such as personalized medicine. We first introduce the recourse bandit problem, where a decision-maker must select both a treatment action and a feasible, minimal modification to mutable patient features. To address this problem, we develop the Generalized Linear Recourse Bandit (GLRB) algorithm. Building on this foundation, we propose LIBRA, a Language Model-Informed Bandit Recourse Algorithm that strategically combines domain knowledge from LLMs with the statistical rigor of bandit learning. LIBRA offers three key guarantees: (i) a warm-start guarantee, showing that LIBRA significantly reduces initial regret when LLM recommendations are near-optimal; (ii) an LLM-effort guarantee, proving that the algorithm consults the LLM only $O(\log^2 T)$ times, where $T$ is the time horizon, ensuring long-term autonomy; and (iii) a robustness guarantee, showing that LIBRA never performs worse than a pure bandit algorithm even when the LLM is unreliable. We further establish matching lower bounds that characterize the fundamental difficulty of the recourse bandit problem and demonstrate the near-optimality of our algorithms. Experiments on synthetic environments and a real hypertension-management case study confirm that GLRB and LIBRA improve regret, treatment quality, and sample efficiency compared with standard contextual bandits and LLM-only benchmarks. Our results highlight the promise of recourse-aware, LLM-assisted bandit algorithms for trustworthy LLM-bandits collaboration in personalized high-stakes decision-making.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Towards AGI A Pragmatic Approach Towards Self Evolving Agent

arXiv:2601.11658v1 Announce Type: new Abstract: Large Language Model (LLM) based agents are powerful yet fundamentally static after deployment, lacking the ability to autonomously expand capabilities, generate new tools, or evolve their reasoning. This work introduces a hierarchical self-evolving multi-agent framework that integrates a Base LLM, an operational SLM agent, a Code-Generation LLM, and a Teacher-LLM to enable continuous adaptation. The workflow begins with the agent attempting a task using reasoning and existing tools; if unsuccessful, it escalates to tool synthesis through the Code-Gen LLM, and when failures persist, it triggers an evolution phase using Curriculum Learning (CL), Reward-Based Learning (RL), or Genetic Algorithm (GA) evolution. Using the TaskCraft dataset rich in hierarchical tasks, tool-use traces, and difficulty scaling we evaluate these paradigms. CL delivers fast recovery and strong generalization, RL excels on high-difficulty tasks, and GA offers high behavioral diversity. Across all settings, evolved agents outperform their originals, demonstrating robust, autonomous, self-improving agentic evolution.

Fonte: arXiv cs.CL

RLScore 85

Risk-Aware Human-in-the-Loop Framework with Adaptive Intrusion Response for Autonomous Vehicles

arXiv:2601.11781v1 Announce Type: new Abstract: Autonomous vehicles must remain safe and effective when encountering rare long-tailed scenarios or cyber-physical intrusions during driving. We present RAIL, a risk-aware human-in-the-loop framework that turns heterogeneous runtime signals into calibrated control adaptations and focused learning. RAIL fuses three cues (curvature actuation integrity, time-to-collision proximity, and observation-shift consistency) into an Intrusion Risk Score (IRS) via a weighted Noisy-OR. When IRS exceeds a threshold, actions are blended with a cue-specific shield using a learned authority, while human override remains available; when risk is low, the nominal policy executes. A contextual bandit arbitrates among shields based on the cue vector, improving mitigation choices online. RAIL couples Soft Actor-Critic (SAC) with risk-prioritized replay and dual rewards so that takeovers and near misses steer learning while nominal behavior remains covered. On MetaDrive, RAIL achieves a Test Return (TR) of 360.65, a Test Success Rate (TSR) of 0.85, a Test Safety Violation (TSV) of 0.75, and a Disturbance Rate (DR) of 0.0027, while logging only 29.07 training safety violations, outperforming RL, safe RL, offline/imitation learning, and prior HITL baselines. Under Controller Area Network (CAN) injection and LiDAR spoofing attacks, it improves Success Rate (SR) to 0.68 and 0.80, lowers the Disengagement Rate under Attack (DRA) to 0.37 and 0.03, and reduces the Attack Success Rate (ASR) to 0.34 and 0.11. In CARLA, RAIL attains a TR of 1609.70 and TSR of 0.41 with only 8000 steps.

Fonte: arXiv cs.AI

RLScore 85

ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents

arXiv:2601.12294v1 Announce Type: new Abstract: Reward-guided search methods have demonstrated strong potential in enhancing tool-using agents by effectively guiding sampling and exploration over complex action spaces. As a core design, those search methods utilize process reward models (PRMs) to provide step-level rewards, enabling more fine-grained monitoring. However, there is a lack of systematic and reliable evaluation benchmarks for PRMs in tool-using settings. In this paper, we introduce ToolPRMBench, a large-scale benchmark specifically designed to evaluate PRMs for tool-using agents. ToolPRMBench is built on top of several representative tool-using benchmarks and converts agent trajectories into step-level test cases. Each case contains the interaction history, a correct action, a plausible but incorrect alternative, and relevant tool metadata. We respectively utilize offline sampling to isolate local single-step errors and online sampling to capture realistic multi-step failures from full agent rollouts. A multi-LLM verification pipeline is proposed to reduce label noise and ensure data quality. We conduct extensive experiments across large language models, general PRMs, and tool-specialized PRMs on ToolPRMBench. The results reveal clear differences in PRM effectiveness and highlight the potential of specialized PRMs for tool-using. Code and data will be released at https://github.com/David-Li0406/ToolPRMBench.

Fonte: arXiv cs.AI

RLScore 85

Optimal Power Allocation and Sub-Optimal Channel Assignment for Downlink NOMA Systems Using Deep Reinforcement Learning

arXiv:2601.12242v1 Announce Type: new Abstract: In recent years, Non-Orthogonal Multiple Access (NOMA) system has emerged as a promising candidate for multiple access frameworks due to the evolution of deep machine learning, trying to incorporate deep machine learning into the NOMA system. The main motivation for such active studies is the growing need to optimize the utilization of network resources as the expansion of the internet of things (IoT) caused a scarcity of network resources. The NOMA addresses this need by power multiplexing, allowing multiple users to access the network simultaneously. Nevertheless, the NOMA system has few limitations. Several works have proposed to mitigate this, including the optimization of power allocation known as joint resource allocation(JRA) method, and integration of the JRA method and deep reinforcement learning (JRA-DRL). Despite this, the channel assignment problem remains unclear and requires further investigation. In this paper, we propose a deep reinforcement learning framework incorporating replay memory with an on-policy algorithm, allocating network resources in a NOMA system to generalize the learning. Also, we provide extensive simulations to evaluate the effects of varying the learning rate, batch size, type of model, and the number of features in the state.

Fonte: arXiv cs.AI

RLScore 85

Policy-Based Deep Reinforcement Learning Hyperheuristics for Job-Shop Scheduling Problems

arXiv:2601.11189v1 Announce Type: new Abstract: This paper proposes a policy-based deep reinforcement learning hyper-heuristic framework for solving the Job Shop Scheduling Problem. The hyper-heuristic agent learns to switch scheduling rules based on the system state dynamically. We extend the hyper-heuristic framework with two key mechanisms. First, action prefiltering restricts decision-making to feasible low-level actions, enabling low-level heuristics to be evaluated independently of environmental constraints and providing an unbiased assessment. Second, a commitment mechanism regulates the frequency of heuristic switching. We investigate the impact of different commitment strategies, from step-wise switching to full-episode commitment, on both training behavior and makespan. Additionally, we compare two action selection strategies at the policy level: deterministic greedy selection and stochastic sampling. Computational experiments on standard JSSP benchmarks demonstrate that the proposed approach outperforms traditional heuristics, metaheuristics, and recent neural network-based scheduling methods

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Explore with Long-term Memory: A Benchmark and Multimodal LLM-based Reinforcement Learning Framework for Embodied Exploration

arXiv:2601.10744v1 Announce Type: new Abstract: An ideal embodied agent should possess lifelong learning capabilities to handle long-horizon and complex tasks, enabling continuous operation in general environments. This not only requires the agent to accurately accomplish given tasks but also to leverage long-term episodic memory to optimize decision-making. However, existing mainstream one-shot embodied tasks primarily focus on task completion results, neglecting the crucial process of exploration and memory utilization. To address this, we propose Long-term Memory Embodied Exploration (LMEE), which aims to unify the agent's exploratory cognition and decision-making behaviors to promote lifelong learning.We further construct a corresponding dataset and benchmark, LMEE-Bench, incorporating multi-goal navigation and memory-based question answering to comprehensively evaluate both the process and outcome of embodied exploration. To enhance the agent's memory recall and proactive exploration capabilities, we propose MemoryExplorer, a novel method that fine-tunes a multimodal large language model through reinforcement learning to encourage active memory querying. By incorporating a multi-task reward function that includes action prediction, frontier selection, and question answering, our model achieves proactive exploration. Extensive experiments against state-of-the-art embodied exploration models demonstrate that our approach achieves significant advantages in long-horizon embodied tasks.

Fonte: arXiv cs.AI

MultimodalScore 85

TANDEM: Temporal-Aware Neural Detection for Multimodal Hate Speech

arXiv:2601.11178v1 Announce Type: new Abstract: Social media platforms are increasingly dominated by long-form multimodal content, where harmful narratives are constructed through a complex interplay of audio, visual, and textual cues. While automated systems can flag hate speech with high accuracy, they often function as "black boxes" that fail to provide the granular, interpretable evidence, such as precise timestamps and target identities, required for effective human-in-the-loop moderation. In this work, we introduce TANDEM, a unified framework that transforms audio-visual hate detection from a binary classification task into a structured reasoning problem. Our approach employs a novel tandem reinforcement learning strategy where vision-language and audio-language models optimize each other through self-constrained cross-modal context, stabilizing reasoning over extended temporal sequences without requiring dense frame-level supervision. Experiments across three benchmark datasets demonstrate that TANDEM significantly outperforms zero-shot and context-augmented baselines, achieving 0.73 F1 in target identification on HateMM (a 30% improvement over state-of-the-art) while maintaining precise temporal grounding. We further observe that while binary detection is robust, differentiating between offensive and hateful content remains challenging in multi-class settings due to inherent label ambiguity and dataset imbalance. More broadly, our findings suggest that structured, interpretable alignment is achievable even in complex multimodal settings, offering a blueprint for the next generation of transparent and actionable online safety moderation tools.

Fonte: arXiv cs.AI

RLScore 85

BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search

arXiv:2601.11037v1 Announce Type: new Abstract: RL-based agentic search enables LLMs to solve complex questions via dynamic planning and external search. While this approach significantly enhances accuracy with agent policies optimized via large-scale reinforcement learning, we identify a critical gap in reliability: these agents fail to recognize their reasoning boundaries and rarely admit ``I DON'T KNOW'' (IDK) even when evidence is insufficient or reasoning reaches its limit. The lack of reliability often leads to plausible but unreliable answers, introducing significant risks in many real-world scenarios. To this end, we propose Boundary-Aware Policy Optimization (BAPO), a novel RL framework designed to cultivate reliable boundary awareness without compromising accuracy. BAPO introduces two key components: (i) a group-based boundary-aware reward that encourages an IDK response only when the reasoning reaches its limit, and (ii) an adaptive reward modulator that strategically suspends this reward during early exploration, preventing the model from exploiting IDK as a shortcut. Extensive experiments on four benchmarks demonstrate that BAPO substantially enhances the overall reliability of agentic search.

Fonte: arXiv cs.AI

RLScore 85

Evidence-Augmented Policy Optimization with Reward Co-Evolution for Long-Context Reasoning

arXiv:2601.10306v1 Announce Type: new Abstract: While Reinforcement Learning (RL) has advanced LLM reasoning, applying it to long-context scenarios is hindered by sparsity of outcome rewards. This limitation fails to penalize ungrounded "lucky guesses," leaving the critical process of needle-in-a-haystack evidence retrieval largely unsupervised. To address this, we propose EAPO (Evidence-Augmented Policy Optimization). We first establish the Evidence-Augmented Reasoning paradigm, validating via Tree-Structured Evidence Sampling that precise evidence extraction is the decisive bottleneck for long-context reasoning. Guided by this insight, EAPO introduces a specialized RL algorithm where a reward model computes a Group-Relative Evidence Reward, providing dense process supervision to explicitly improve evidence quality. To sustain accurate supervision throughout training, we further incorporate an Adaptive Reward-Policy Co-Evolution mechanism. This mechanism iteratively refines the reward model using outcome-consistent rollouts, sharpening its discriminative capability to ensure precise process guidance. Comprehensive evaluations across eight benchmarks demonstrate that EAPO significantly enhances long-context reasoning performance compared to SOTA baselines.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

DecisionLLM: Large Language Models for Long Sequence Decision Exploration

arXiv:2601.10148v1 Announce Type: new Abstract: Long-sequence decision-making, which is usually addressed through reinforcement learning (RL), is a critical component for optimizing strategic operations in dynamic environments, such as real-time bidding in computational advertising. The Decision Transformer (DT) introduced a powerful paradigm by framing RL as an autoregressive sequence modeling problem. Concurrently, Large Language Models (LLMs) have demonstrated remarkable success in complex reasoning and planning tasks. This inspires us whether LLMs, which share the same Transformer foundation, but operate at a much larger scale, can unlock new levels of performance in long-horizon sequential decision-making problem. This work investigates the application of LLMs to offline decision making tasks. A fundamental challenge in this domain is the LLMs' inherent inability to interpret continuous values, as they lack a native understanding of numerical magnitude and order when values are represented as text strings. To address this, we propose treating trajectories as a distinct modality. By learning to align trajectory data with natural language task descriptions, our model can autoregressively predict future decisions within a cohesive framework we term DecisionLLM. We establish a set of scaling laws governing this paradigm, demonstrating that performance hinges on three factors: model scale, data volume, and data quality. In offline experimental benchmarks and bidding scenarios, DecisionLLM achieves strong performance. Specifically, DecisionLLM-3B outperforms the traditional Decision Transformer (DT) by 69.4 on Maze2D umaze-v1 and by 0.085 on AuctionNet. It extends the AIGB paradigm and points to promising directions for future exploration in online bidding.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

PaperScout: An Autonomous Agent for Academic Paper Search with Process-Aware Sequence-Level Policy Optimization

arXiv:2601.10029v1 Announce Type: new Abstract: Academic paper search is a fundamental task in scientific research, yet most existing approaches rely on rigid, predefined workflows that struggle with complex, conditional queries. To address this limitation, we propose PaperScout, an autonomous agent that reformulates paper search as a sequential decision-making process. Unlike static workflows, PaperScout dynamically decides whether, when, and how to invoke search and expand tools based on accumulated retrieval context. However, training such agents presents a fundamental challenge: standard reinforcement learning methods, typically designed for single-turn tasks, suffer from a granularity mismatch when applied to multi-turn agentic tasks, where token-level optimization diverges from the granularity of sequence-level interactions, leading to noisy credit assignment. We introduce Proximal Sequence Policy Optimization (PSPO), a process-aware, sequence-level policy optimization method that aligns optimization with agent-environment interaction. Comprehensive experiments on both synthetic and real-world benchmarks demonstrate that PaperScout significantly outperforms strong workflow-driven and RL baselines in both recall and relevance, validating the effectiveness of our adaptive agentic framework and optimization strategy.

Fonte: arXiv cs.AI

VisionScore 85

GUI-Eyes: Tool-Augmented Perception for Visual Grounding in GUI Agents

arXiv:2601.09770v1 Announce Type: new Abstract: Recent advances in vision-language models (VLMs) and reinforcement learning (RL) have driven progress in GUI automation. However, most existing methods rely on static, one-shot visual inputs and passive perception, lacking the ability to adaptively determine when, whether, and how to observe the interface. We present GUI-Eyes, a reinforcement learning framework for active visual perception in GUI tasks. To acquire more informative observations, the agent learns to make strategic decisions on both whether and how to invoke visual tools, such as cropping or zooming, within a two-stage reasoning process. To support this behavior, we introduce a progressive perception strategy that decomposes decision-making into coarse exploration and fine-grained grounding, coordinated by a two-level policy. In addition, we design a spatially continuous reward function tailored to tool usage, which integrates both location proximity and region overlap to provide dense supervision and alleviate the reward sparsity common in GUI environments. On the ScreenSpot-Pro benchmark, GUI-Eyes-3B achieves 44.8% grounding accuracy using only 3k labeled samples, significantly outperforming both supervised and RL-based baselines. These results highlight that tool-aware active perception, enabled by staged policy reasoning and fine-grained reward feedback, is critical for building robust and data-efficient GUI agents.

Fonte: arXiv cs.AI

RLScore 85

RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation

arXiv:2601.08430v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has driven substantial progress in reasoning-intensive domains like mathematics. However, optimizing open-ended generation remains challenging due to the lack of ground truth. While rubric-based evaluation offers a structured proxy for verification, existing methods suffer from scalability bottlenecks and coarse criteria, resulting in a supervision ceiling effect. To address this, we propose an automated Coarse-to-Fine Rubric Generation framework. By synergizing principle-guided synthesis, multi-model aggregation, and difficulty evolution, our approach produces comprehensive and highly discriminative criteria capable of capturing the subtle nuances. Based on this framework, we introduce RubricHub, a large-scale ($\sim$110k) and multi-domain dataset. We validate its utility through a two-stage post-training pipeline comprising Rubric-based Rejection Sampling Fine-Tuning (RuFT) and Reinforcement Learning (RuRL). Experimental results demonstrate that RubricHub unlocks significant performance gains: our post-trained Qwen3-14B achieves state-of-the-art (SOTA) results on HealthBench (69.3), surpassing proprietary frontier models such as GPT-5. The code and data will be released soon.

Fonte: arXiv cs.AI

RLScore 85

AtomMem : Learnable Dynamic Agentic Memory with Atomic Memory Operation

arXiv:2601.08323v1 Announce Type: new Abstract: Equipping agents with memory is essential for solving real-world long-horizon problems. However, most existing agent memory mechanisms rely on static and hand-crafted workflows. This limits the performance and generalization ability of these memory designs, which highlights the need for a more flexible, learning-based memory framework. In this paper, we propose AtomMem, which reframes memory management as a dynamic decision-making problem. We deconstruct high-level memory processes into fundamental atomic CRUD (Create, Read, Update, Delete) operations, transforming the memory workflow into a learnable decision process. By combining supervised fine-tuning with reinforcement learning, AtomMem learns an autonomous, task-aligned policy to orchestrate memory behaviors tailored to specific task demands. Experimental results across 3 long-context benchmarks demonstrate that the trained AtomMem-8B consistently outperforms prior static-workflow memory methods. Further analysis of training dynamics shows that our learning-based formulation enables the agent to discover structured, task-aligned memory management strategies, highlighting a key advantage over predefined routines.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Reasoning over Precedents Alongside Statutes: Case-Augmented Deliberative Alignment for LLM Safety

arXiv:2601.08000v1 Announce Type: new Abstract: Ensuring that Large Language Models (LLMs) adhere to safety principles without refusing benign requests remains a significant challenge. While OpenAI introduces deliberative alignment (DA) to enhance the safety of its o-series models through reasoning over detailed ``code-like'' safety rules, the effectiveness of this approach in open-source LLMs, which typically lack advanced reasoning capabilities, is understudied. In this work, we systematically evaluate the impact of explicitly specifying extensive safety codes versus demonstrating them through illustrative cases. We find that referencing explicit codes inconsistently improves harmlessness and systematically degrades helpfulness, whereas training on case-augmented simple codes yields more robust and generalized safety behaviors. By guiding LLMs with case-augmented reasoning instead of extensive code-like safety rules, we avoid rigid adherence to narrowly enumerated rules and enable broader adaptability. Building on these insights, we propose CADA, a case-augmented deliberative alignment method for LLMs utilizing reinforcement learning on self-generated safety reasoning chains. CADA effectively enhances harmlessness, improves robustness against attacks, and reduces over-refusal while preserving utility across diverse benchmarks, offering a practical alternative to rule-only DA for improving safety while maintaining helpfulness.

Fonte: arXiv cs.AI

RLScore 85

Forecast Aware Deep Reinforcement Learning for Efficient Electricity Load Scheduling in Dairy Farms

arXiv:2601.08052v1 Announce Type: new Abstract: Dairy farming is an energy intensive sector that relies heavily on grid electricity. With increasing renewable energy integration, sustainable energy management has become essential for reducing grid dependence and supporting the United Nations Sustainable Development Goal 7 on affordable and clean energy. However, the intermittent nature of renewables poses challenges in balancing supply and demand in real time. Intelligent load scheduling is therefore crucial to minimize operational costs while maintaining reliability. Reinforcement Learning has shown promise in improving energy efficiency and reducing costs. However, most RL-based scheduling methods assume complete knowledge of future prices or generation, which is unrealistic in dynamic environments. Moreover, standard PPO variants rely on fixed clipping or KL divergence thresholds, often leading to unstable training under variable tariffs. To address these challenges, this study proposes a Deep Reinforcement Learning framework for efficient load scheduling in dairy farms, focusing on battery storage and water heating under realistic operational constraints. The proposed Forecast Aware PPO incorporates short term forecasts of demand and renewable generation using hour of day and month based residual calibration, while the PID KL PPO variant employs a proportional integral derivative controller to regulate KL divergence for stable policy updates adaptively. Trained on real world dairy farm data, the method achieves up to 1% lower electricity cost than PPO, 4.8% than DQN, and 1.5% than SAC. For battery scheduling, PPO reduces grid imports by 13.1%, demonstrating scalability and effectiveness for sustainable energy management in modern dairy farming.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Large Artificial Intelligence Model Guided Deep Reinforcement Learning for Resource Allocation in Non Terrestrial Networks

arXiv:2601.08254v1 Announce Type: new Abstract: Large AI Model (LAM) have been proposed to applications of Non-Terrestrial Networks (NTN), that offer better performance with its great generalization and reduced task specific trainings. In this paper, we propose a Deep Reinforcement Learning (DRL) agent that is guided by a Large Language Model (LLM). The LLM operates as a high level coordinator that generates textual guidance that shape the reward of the DRL agent during training. The results show that the LAM-DRL outperforms the traditional DRL by 40% in nominal weather scenarios and 64% in extreme weather scenarios compared to heuristics in terms of throughput, fairness, and outage probability.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Sparsity Is Necessary: Polynomial-Time Stability for Agentic LLMs in Large Action Spaces

arXiv:2601.08271v1 Announce Type: new Abstract: Tool-augmented LLM systems expose a control regime that learning theory has largely ignored: sequential decision-making with a massive discrete action universe (tools, APIs, documents) in which only a small, unknown subset is relevant for any fixed task distribution. We formalize this setting as Sparse Agentic Control (SAC), where policies admit block-sparse representations over M >> 1 actions and rewards depend on sparse main effects and (optionally) sparse synergies. We study ell_{1,2}-regularized policy learning through a convex surrogate and establish sharp, compressed-sensing-style results: (i) estimation and value suboptimality scale as k (log M / T)^{1/2} under a Policy-RSC condition; (ii) exact tool-support recovery holds via primal-dual witness arguments when T > k log M under incoherence and beta-min; and (iii) any dense policy class requires Omega(M) samples, explaining the instability of prompt-only controllers. We further show that under partial observability, LLMs matter only through a belief/representation error epsilon_b, yielding an additive O(epsilon_b) degradation while preserving logarithmic dependence on M. Extensions cover tuning-free, online, robust, group-sparse, and interaction-aware SAC.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Owen-Shapley Policy Optimization (OSPO): A Principled RL Algorithm for Generative Search LLMs

arXiv:2601.08403v1 Announce Type: new Abstract: Large language models are increasingly trained via reinforcement learning for personalized recommendation tasks, but standard methods like GRPO rely on sparse, sequence-level rewards that create a credit assignment gap, obscuring which tokens drive success. This gap is especially problematic when models must infer latent user intent from under-specified language without ground truth labels, a reasoning pattern rarely seen during pretraining. We introduce Owen-Shapley Policy Optimization (OSPO), a framework that redistributes sequence-level advantages based on tokens' marginal contributions to outcomes. Unlike value-model-based methods requiring additional computation, OSPO employs potential-based reward shaping via Shapley-Owen attributions to assign segment-level credit while preserving the optimal policy, learning directly from task feedback without parametric value models. By forming coalitions of semantically coherent units (phrases describing product attributes or sentences capturing preferences), OSPO identifies which response parts drive performance. Experiments on Amazon ESCI and H&M Fashion datasets show consistent gains over baselines, with notable test-time robustness to out-of-distribution retrievers unseen during training.

Fonte: arXiv cs.AI

RLScore 85

ZeroDVFS: Zero-Shot LLM-Guided Core and Frequency Allocation for Embedded Platforms

arXiv:2601.08166v1 Announce Type: new Abstract: Dynamic voltage and frequency scaling (DVFS) and task-to-core allocation are critical for thermal management and balancing energy and performance in embedded systems. Existing approaches either rely on utilization-based heuristics that overlook stall times, or require extensive offline profiling for table generation, preventing runtime adaptation. We propose a model-based hierarchical multi-agent reinforcement learning (MARL) framework for thermal- and energy-aware scheduling on multi-core platforms. Two collaborative agents decompose the exponential action space, achieving 358ms latency for subsequent decisions. First decisions require 3.5 to 8.0s including one-time LLM feature extraction. An accurate environment model leverages regression techniques to predict thermal dynamics and performance states. When combined with LLM-extracted semantic features, the environment model enables zero-shot deployment for new workloads on trained platforms by generating synthetic training data without requiring workload-specific profiling samples. We introduce LLM-based semantic feature extraction that characterizes OpenMP programs through 13 code-level features without execution. The Dyna-Q-inspired framework integrates direct reinforcement learning with model-based planning, achieving 20x faster convergence than model-free methods. Experiments on BOTS and PolybenchC benchmarks across NVIDIA Jetson TX2, Jetson Orin NX, RubikPi, and Intel Core i7 demonstrate 7.09x better energy efficiency and 4.0x better makespan than Linux ondemand governor. First-decision latency is 8,300x faster than table-based profiling, enabling practical deployment in dynamic embedded systems.

Fonte: arXiv cs.AI

RLScore 85

Project Synapse: A Hierarchical Multi-Agent Framework with Hybrid Memory for Autonomous Resolution of Last-Mile Delivery Disruptions

arXiv:2601.08156v1 Announce Type: new Abstract: This paper introduces Project Synapse, a novel agentic framework designed for the autonomous resolution of last-mile delivery disruptions. Synapse employs a hierarchical multi-agent architecture in which a central Resolution Supervisor agent performs strategic task decomposition and delegates subtasks to specialized worker agents responsible for tactical execution. The system is orchestrated using LangGraph to manage complex and cyclical workflows. To validate the framework, a benchmark dataset of 30 complex disruption scenarios was curated from a qualitative analysis of over 6,000 real-world user reviews. System performance is evaluated using an LLM-as-a-Judge protocol with explicit bias mitigation.

Fonte: arXiv cs.AI

RLScore 85

The End of Reward Engineering: How LLMs Are Redefining Multi-Agent Coordination

arXiv:2601.08237v1 Announce Type: new Abstract: Reward engineering, the manual specification of reward functions to induce desired agent behavior, remains a fundamental challenge in multi-agent reinforcement learning. This difficulty is amplified by credit assignment ambiguity, environmental non-stationarity, and the combinatorial growth of interaction complexity. We argue that recent advances in large language models (LLMs) point toward a shift from hand-crafted numerical rewards to language-based objective specifications. Prior work has shown that LLMs can synthesize reward functions directly from natural language descriptions (e.g., EUREKA) and adapt reward formulations online with minimal human intervention (e.g., CARD). In parallel, the emerging paradigm of Reinforcement Learning from Verifiable Rewards (RLVR) provides empirical evidence that language-mediated supervision can serve as a viable alternative to traditional reward engineering. We conceptualize this transition along three dimensions: semantic reward specification, dynamic reward adaptation, and improved alignment with human intent, while noting open challenges related to computational overhead, robustness to hallucination, and scalability to large multi-agent systems. We conclude by outlining a research direction in which coordination arises from shared semantic representations rather than explicitly engineered numerical signals.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Reinforcement Learning of Large Language Models for Interpretable Credit Card Fraud Detection

arXiv:2601.05578v1 Announce Type: new Abstract: E-commerce platforms and payment solution providers face increasingly sophisticated fraud schemes, ranging from identity theft and account takeovers to complex money laundering operations that exploit the speed and anonymity of digital transactions. However, despite their theoretical promise, the application of Large Language Models (LLMs) to fraud detection in real-world financial contexts remains largely unexploited, and their practical effectiveness in handling domain-specific e-commerce transaction data has yet to be empirically validated. To bridge this gap between conventional machine learning limitations and the untapped potential of LLMs in fraud detection, this paper proposes a novel approach that employs Reinforcement Learning (RL) to post-train lightweight language models specifically for fraud detection tasks using only raw transaction data. We utilize the Group Sequence Policy Optimization (GSPO) algorithm combined with a rule-based reward system to fine-tune language models of various sizes on a real-life transaction dataset provided by a Chinese global payment solution company. Through this reinforcement learning framework, the language models are encouraged to explore diverse trust and risk signals embedded within the textual transaction data, including patterns in customer information, shipping details, product descriptions, and order history. Our experimental results demonstrate the effectiveness of this approach, with post-trained language models achieving substantial F1-score improvements on held-out test data. Our findings demonstrate that the observed performance improvements are primarily attributable to the exploration mechanism inherent in reinforcement learning, which allows models to discover novel fraud indicators beyond those captured by traditional engineered features.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

StackPlanner: Um Sistema Multi-Agent Hierárquico Centralizado com Gerenciamento de Memória de Experiência de Tarefa

Sistemas multi-agente baseados em grandes modelos de linguagem, especialmente arquiteturas centralizadas, mostraram recentemente um forte potencial para tarefas complexas e intensivas em conhecimento. No entanto, agentes centrais frequentemente enfrentam problemas de colaboração instável a longo prazo devido à falta de gerenciamento de memória. Propomos o StackPlanner, um framework multi-agente hierárquico com controle de memória explícito.

Fonte: arXiv cs.AI

RLScore 85

CHDP: Políticas de Difusão Híbrida Cooperativa para Aprendizado por Reforço em Espaço de Ação Parametrizado

O espaço de ação híbrido, que combina escolhas discretas e parâmetros contínuos, é comum em domínios como controle de robôs e IA em jogos. No entanto, modelar e otimizar eficientemente esse espaço de ação híbrido continua sendo um desafio fundamental. Para resolver isso, propomos um framework de extbf{Políticas de Difusão Híbrida Cooperativa (CHDP)} que utiliza dois agentes cooperativos.

Fonte: arXiv cs.AI

RLScore 85

From Off-Policy to On-Policy: Enhancing GUI Agents via Bi-level Expert-to-Policy Assimilation

arXiv:2601.05787v1 Announce Type: new Abstract: Vision-language models are increasingly deployed as computer-use agents (CUAs) that operate desktops and browsers. Top-performing CUAs are framework-based systems that decompose planning and execution, while end-to-end screenshot-to-action policies are easier to deploy but lag behind on benchmarks such as OSWorld-Verified. GUI datasets like OSWorld pose two bottlenecks: they expose only a few hundred interactive, verifiable tasks and environments, and expert trajectories must be gathered by interacting with these environments, making such data hard to scale. We therefore ask how reinforcement learning from verifiable rewards (RLVR) can best exploit a small pool of exist expert trajectories to train end-to-end policies. Naively mixing these off-policy traces into on-policy RLVR is brittle: even after format conversion, expert trajectories exhibit structural mismatch and distribution shift from the learner. We propose BEPA (Bi-Level Expert-to-Policy Assimilation), which turns static expert traces into policy-aligned guidance via self-rolled reachable trajectories under the base policy (LEVEL-1) and a per-task, dynamically updated cache used in RLVR (LEVEL-2). On OSWorld-Verified, BEPA improves UITARS1.5-7B success from 22.87% to 32.13% and raises a held-out split from 5.74% to 10.30%, with consistent gains on MMBench-GUI and Online-Mind2Web. Our code and data are available at: https://github.com/LEON-gittech/Verl_GUI.git

Fonte: arXiv cs.AI

NLP/LLMsScore 85

WildSci: Advancing Scientific Reasoning from In-the-Wild Literature

arXiv:2601.05567v1 Announce Type: new Abstract: Recent progress in large language model (LLM) reasoning has focused on domains like mathematics and coding, where abundant high-quality data and objective evaluation metrics are readily available. In contrast, progress in LLM reasoning models remains limited in scientific domains such as medicine and materials science due to limited dataset coverage and the inherent complexity of open-ended scientific questions. To address these challenges, we introduce WildSci, a new dataset of domain-specific science questions automatically synthesized from peer-reviewed literature, covering 9 scientific disciplines and 26 subdomains. By framing complex scientific reasoning tasks in a multiple-choice format, we enable scalable training with well-defined reward signals. We further apply reinforcement learning to finetune models on these data and analyze the resulting training dynamics, including domain-specific performance changes, response behaviors, and generalization trends. Experiments on a suite of scientific benchmarks demonstrate the effectiveness of our dataset and approach. We release WildSci to enable scalable and sustainable research in scientific reasoning, available at https://huggingface.co/datasets/JustinTX/WildSci.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents

arXiv:2601.05899v1 Announce Type: new Abstract: Recent breakthroughs in Large Language Models (LLMs) have positioned them as a promising paradigm for agents, with long-term planning and decision-making emerging as core general-purpose capabilities for adapting to diverse scenarios and tasks. Real-time strategy (RTS) games serve as an ideal testbed for evaluating these two capabilities, as their inherent gameplay requires both macro-level strategic planning and micro-level tactical adaptation and action execution. Existing RTS game-based environments either suffer from relatively high computational demands or lack support for textual observations, which has constrained the use of RTS games for LLM evaluation. Motivated by this, we present TowerMind, a novel environment grounded in the tower defense (TD) subgenre of RTS games. TowerMind preserves the key evaluation strengths of RTS games for assessing LLMs, while featuring low computational demands and a multimodal observation space, including pixel-based, textual, and structured game-state representations. In addition, TowerMind supports the evaluation of model hallucination and provides a high degree of customizability. We design five benchmark levels to evaluate several widely used LLMs under different multimodal input settings. The results reveal a clear performance gap between LLMs and human experts across both capability and hallucination dimensions. The experiments further highlight key limitations in LLM behavior, such as inadequate planning validation, a lack of multifinality in decision-making, and inefficient action use. We also evaluate two classic reinforcement learning algorithms: Ape-X DQN and PPO. By offering a lightweight and multimodal design, TowerMind complements the existing RTS game-based environment landscape and introduces a new benchmark for the AI agent field. The source code is publicly available on GitHub(https://github.com/tb6147877/TowerMind).

Fonte: arXiv cs.AI

NLP/LLMsScore 85

KP-Agent: Keyword Pruning in Sponsored Search Advertising via LLM-Powered Contextual Bandits

arXiv:2601.05257v1 Announce Type: cross Abstract: Sponsored search advertising (SSA) requires advertisers to constantly adjust keyword strategies. While bid adjustment and keyword generation are well-studied, keyword pruning-refining keyword sets to enhance campaign performance-remains under-explored. This paper addresses critical inefficiencies in current practices as evidenced by a dataset containing 0.5 million SSA records from a pharmaceutical advertiser on search engine Meituan, China's largest delivery platform. We propose KP-Agent, an LLM agentic system with domain tool set and a memory module. By modeling keyword pruning within a contextual bandit framework, KP-Agent generates code snippets to refine keyword sets through reinforcement learning. Experiments show KP-Agent improves cumulative profit by up to 49.28% over baselines.

Fonte: arXiv cs.AI

RLScore 90

PRISMA: Reinforcement Learning Guided Two-Stage Policy Optimization in Multi-Agent Architecture for Open-Domain Multi-Hop Question Answering

arXiv:2601.05465v1 Announce Type: new Abstract: Answering real-world open-domain multi-hop questions over massive corpora is a critical challenge in Retrieval-Augmented Generation (RAG) systems. Recent research employs reinforcement learning (RL) to end-to-end optimize the retrieval-augmented reasoning process, directly enhancing its capacity to resolve complex queries. However, reliable deployment is hindered by two obstacles. 1) Retrieval Collapse: iterative retrieval over large corpora fails to locate intermediate evidence containing bridge answers without reasoning-guided planning, causing downstream reasoning to collapse. 2) Learning Instability: end-to-end trajectory training suffers from weak credit assignment across reasoning chains and poor error localization across modules, causing overfitting to benchmark-specific heuristics that limit transferability and stability. To address these problems, we propose PRISMA, a decoupled RL-guided framework featuring a Plan-Retrieve-Inspect-Solve-Memoize architecture. PRISMA's strength lies in reasoning-guided collaboration: the Inspector provides reasoning-based feedback to refine the Planner's decomposition and fine-grained retrieval, while enforcing evidence-grounded reasoning in the Solver. We optimize individual agent capabilities via Two-Stage Group Relative Policy Optimization (GRPO). Stage I calibrates the Planner and Solver as specialized experts in planning and reasoning, while Stage II utilizes Observation-Aware Residual Policy Optimization (OARPO) to enhance the Inspector's ability to verify context and trigger targeted recovery. Experiments show that PRISMA achieves state-of-the-art performance on ten benchmarks and can be deployed efficiently in real-world scenarios.

Fonte: arXiv cs.AI

RLScore 85

Simulation-Free PSRO: Removing Game Simulation from Policy Space Response Oracles

arXiv:2601.05279v1 Announce Type: cross Abstract: Policy Space Response Oracles (PSRO) combines game-theoretic equilibrium computation with learning and is effective in approximating Nash Equilibrium in zero-sum games. However, the computational cost of PSRO has become a significant limitation to its practical application. Our analysis shows that game simulation is the primary bottleneck in PSRO's runtime. To address this issue, we conclude the concept of Simulation-Free PSRO and summarize existing methods that instantiate this concept. Additionally, we propose a novel Dynamic Window-based Simulation-Free PSRO, which introduces the concept of a strategy window to replace the original strategy set maintained in PSRO. The number of strategies in the strategy window is limited, thereby simplifying opponent strategy selection and improving the robustness of the best response. Moreover, we use Nash Clustering to select the strategy to be eliminated, ensuring that the number of strategies within the strategy window is effectively limited. Our experiments across various environments demonstrate that the Dynamic Window mechanism significantly reduces exploitability compared to existing methods, while also exhibiting excellent compatibility. Our code is available at https://github.com/enochliu98/SF-PSRO.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Agentes Atuais Não Conseguem Aproveitar Modelos de Mundo como Ferramenta para Previsão

Agentes baseados em modelos de visão-linguagem enfrentam tarefas que exigem antecipação de estados futuros. Modelos de mundo generativos oferecem uma solução promissora, permitindo que agentes simulem resultados antes de agir. Este artigo examina empiricamente a capacidade dos agentes atuais de utilizar esses modelos como ferramentas para melhorar sua cognição.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Raciocínio Intercalado de Chamadas a Ferramentas para a Compreensão da Função de Proteínas

Avanços recentes em large language models (LLMs) destacaram a eficácia do raciocínio em cadeia em domínios simbólicos como matemática e programação. No entanto, nosso estudo mostra que transferir diretamente esses paradigmas de raciocínio para a compreensão da função de proteínas é ineficaz. Propomos PFUA, um agente de raciocínio sobre proteínas que integra ferramentas específicas do domínio para gerar evidências intermediárias verificáveis.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Sandwich Reasoning: An Answer-Reasoning-Answer Approach for Low-Latency Query Correction

arXiv:2601.03672v1 Announce Type: new Abstract: Query correction is a critical entry point in modern search pipelines, demanding high accuracy strictly within real-time latency constraints. Chain-of-Thought (CoT) reasoning improves accuracy but incurs prohibitive latency for real-time query correction. A potential solution is to output an answer before reasoning to reduce latency; however, under autoregressive decoding, the early answer is independent of subsequent reasoning, preventing the model from leveraging its reasoning capability to improve accuracy. To address this issue, we propose Sandwich Reasoning (SandwichR), a novel approach that explicitly aligns a fast initial answer with post-hoc reasoning, enabling low-latency query correction without sacrificing reasoning-aware accuracy. SandwichR follows an Answer-Reasoning-Answer paradigm, producing an initial correction, an explicit reasoning process, and a final refined correction. To align the initial answer with post-reasoning insights, we design a consistency-aware reinforcement learning (RL) strategy: a dedicated consistency reward enforces alignment between the initial and final corrections, while margin-based rejection sampling prioritizes borderline samples where reasoning drives the most impactful corrective gains. Additionally, we construct a high-quality query correction dataset, addressing the lack of specialized benchmarks for complex query correction. Experimental results demonstrate that SandwichR achieves SOTA accuracy comparable to standard CoT while delivering a 40-70% latency reduction, resolving the latency-accuracy trade-off in online search.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

MobileDreamer: Modelo de Mundo Generativo para Agentes GUI

Agentes GUI móveis demonstraram forte potencial em automação e aplicações práticas. No entanto, a maioria dos agentes existentes permanece reativa, limitando seu desempenho em tarefas de longo prazo. Neste artigo, propomos o MobileDreamer, um framework eficiente baseado em modelo de mundo para equipar agentes GUI com imaginação futura.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

ROI-Reasoning: Rational Optimization for Inference via Pre-Computation Meta-Cognition

arXiv:2601.03822v1 Announce Type: new Abstract: Large language models (LLMs) can achieve strong reasoning performance with sufficient computation, but they do not inherently know how much computation a task requires. We study budgeted inference-time reasoning for multiple tasks under a strict global token constraint and formalize it as a Ordered Stochastic Multiple-Choice Knapsack Problem(OS-MCKP). This perspective highlights a meta-cognitive requirement -- anticipating task difficulty, estimating return over investment (ROI), and allocating computation strategically. We propose ROI-Reasoning, a two-stage framework that endows LLMs with intrinsic, budget-aware rationality. In the first stage, Meta-Cognitive Fine-Tuning teaches models to predict reasoning cost and expected utility before generation, enabling explicit solve-or-skip decisions. Next, Rationality-Aware Reinforcement Learning optimizes sequential decision making under a hard token budget, allowing models to learn long-horizon allocation strategies. Across budgeted mathematical reasoning benchmarks, ROI-Reasoning consistently improves overall score while substantially reducing regret under tight computation budgets.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models

arXiv:2601.03555v1 Announce Type: new Abstract: Training reliable tool-augmented agents remains a significant challenge, largely due to the difficulty of credit assignment in multi-step reasoning. While process-level reward models offer a promising direction, existing LLM-based judges often produce noisy and inconsistent signals because they lack fine-grained, task-specific rubrics to distinguish high-level planning from low-level execution. In this work, we introduce SCRIBE (Skill-Conditioned Reward with Intermediate Behavioral Evaluation), a reinforcement learning framework that intervenes at a novel mid-level abstraction. SCRIBE grounds reward modeling in a curated library of skill prototypes, transforming open-ended LLM evaluation into a constrained verification problem. By routing each subgoal to a corresponding prototype, the reward model is equipped with precise, structured rubrics that substantially reduce reward variance. Experimental results show that SCRIBE achieves state-of-the-art performance across a range of reasoning and tool-use benchmarks. In particular, it improves the AIME25 accuracy of a Qwen3-4B model from 43.3% to 63.3%, and significantly increases success rates in complex multi-turn tool interactions. Further analysis of training dynamics reveals a co-evolution across abstraction levels, where mastery of mid-level skills consistently precedes the emergence of effective high-level planning behaviors. Finally, we demonstrate that SCRIBE is additive to low-level tool optimizations, providing a scalable and complementary pathway toward more autonomous and reliable tool-using agents.

Fonte: arXiv cs.AI

RLScore 85

Dominando o Jogo de Go com Replay de Experiência de Auto-jogo

O jogo de Go tem sido um benchmark para inteligência artificial, exigindo raciocínio estratégico sofisticado e planejamento de longo prazo. Apresentamos o QZero, um novo algoritmo de aprendizado por reforço sem modelo que aprende uma política de equilíbrio de Nash através de auto-jogo e replay de experiência off-policy. Após 5 meses de treinamento, QZero alcançou um nível de desempenho comparável ao AlphaGo.

Fonte: arXiv cs.AI

RLScore 85

Exploração Através da Introspecção: Um Modelo de Recompensa Autoconsciente

Entender como agentes artificiais modelam estados mentais internos é fundamental para avançar a Teoria da Mente em IA. Este trabalho investiga a autoconsciência em agentes de aprendizado por reforço, introduzindo um componente de exploração introspectiva inspirado pela dor biológica como sinal de aprendizado.

Fonte: arXiv cs.AI

RLScore 85

Trade-R1: Bridging Verifiable Rewards to Stochastic Environments via Process-Level Reasoning Verification

arXiv:2601.03948v1 Announce Type: new Abstract: Reinforcement Learning (RL) has enabled Large Language Models (LLMs) to achieve remarkable reasoning in domains like mathematics and coding, where verifiable rewards provide clear signals. However, extending this paradigm to financial decision is challenged by the market's stochastic nature: rewards are verifiable but inherently noisy, causing standard RL to degenerate into reward hacking. To address this, we propose Trade-R1, a model training framework that bridges verifiable rewards to stochastic environments via process-level reasoning verification. Our key innovation is a verification method that transforms the problem of evaluating reasoning over lengthy financial documents into a structured Retrieval-Augmented Generation (RAG) task. We construct a triangular consistency metric, assessing pairwise alignment between retrieved evidence, reasoning chains, and decisions to serve as a validity filter for noisy market returns. We explore two reward integration strategies: Fixed-effect Semantic Reward (FSR) for stable alignment signals, and Dynamic-effect Semantic Reward (DSR) for coupled magnitude optimization. Experiments on different country asset selection demonstrate that our paradigm reduces reward hacking, with DSR achieving superior cross-market generalization while maintaining the highest reasoning consistency.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Anti-Length Shift: Dynamic Outlier Truncation for Training Efficient Reasoning Models

arXiv:2601.03969v1 Announce Type: new Abstract: Large reasoning models enhanced by reinforcement learning with verifiable rewards have achieved significant performance gains by extending their chain-of-thought. However, this paradigm incurs substantial deployment costs as models often exhibit excessive verbosity on simple queries. Existing efficient reasoning methods relying on explicit length penalties often introduce optimization conflicts and leave the generative mechanisms driving overthinking largely unexamined. In this paper, we identify a phenomenon termed length shift where models increasingly generate unnecessary reasoning on trivial inputs during training. To address this, we introduce Dynamic Outlier Truncation (DOT), a training-time intervention that selectively suppresses redundant tokens. This method targets only the extreme tail of response lengths within fully correct rollout groups while preserving long-horizon reasoning capabilities for complex problems. To complement this intervention and ensure stable convergence, we further incorporate auxiliary KL regularization and predictive dynamic sampling. Experimental results across multiple model scales demonstrate that our approach significantly pushes the efficiency-performance Pareto frontier outward. Notably, on the AIME-24, our method reduces inference token usage by 78% while simultaneously increasing accuracy compared to the initial policy and surpassing state-of-the-art efficient reasoning methods.

Fonte: arXiv cs.AI

ApplicationsScore 85

PsychEval: A Multi-Session and Multi-Therapy Benchmark for High-Realism AI Psychological Counselor

arXiv:2601.01802v2 Announce Type: new Abstract: To develop a reliable AI for psychological assessment, we introduce \texttt{PsychEval}, a multi-session, multi-therapy, and highly realistic benchmark designed to address three key challenges: \textbf{1) Can we train a highly realistic AI counselor?} Realistic counseling is a longitudinal task requiring sustained memory and dynamic goal tracking. We propose a multi-session benchmark (spanning 6-10 sessions across three distinct stages) that demands critical capabilities such as memory continuity, adaptive reasoning, and longitudinal planning. The dataset is annotated with extensive professional skills, comprising over 677 meta-skills and 4577 atomic skills. \textbf{2) How to train a multi-therapy AI counselor?} While existing models often focus on a single therapy, complex cases frequently require flexible strategies among various therapies. We construct a diverse dataset covering five therapeutic modalities (Psychodynamic, Behaviorism, CBT, Humanistic Existentialist, and Postmodernist) alongside an integrative therapy with a unified three-stage clinical framework across six core psychological topics. \textbf{3) How to systematically evaluate an AI counselor?} We establish a holistic evaluation framework with 18 therapy-specific and therapy-shared metrics across Client-Level and Counselor-Level dimensions. To support this, we also construct over 2,000 diverse client profiles. Extensive experimental analysis fully validates the superior quality and clinical fidelity of our dataset. Crucially, \texttt{PsychEval} transcends static benchmarking to serve as a high-fidelity reinforcement learning environment that enables the self-evolutionary training of clinically responsible and adaptive AI counselors.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

OpenSocInt: A Multi-modal Training Environment for Human-Aware Social Navigation

arXiv:2601.01939v1 Announce Type: new Abstract: In this paper, we introduce OpenSocInt, an open-source software package providing a simulator for multi-modal social interactions and a modular architecture to train social agents. We described the software package and showcased its interest via an experimental protocol based on the task of social navigation. Our framework allows for exploring the use of different perceptual features, their encoding and fusion, as well as the use of different agents. The software is already publicly available under GPL at https://gitlab.inria.fr/robotlearn/OpenSocInt/.

Fonte: arXiv cs.AI

RLScore 85

Regularização de Ações de Ordem Superior em Aprendizado por Reforço Profundo: Do Controle Contínuo à Gestão de Energia em Edifícios

arXiv:2601.02061v1 Tipo de Anúncio: novo Resumo: Agentes de aprendizado por reforço profundo frequentemente exibem comportamentos de controle erráticos e de alta frequência que dificultam a implantação no mundo real devido ao consumo excessivo de energia e desgaste mecânico. Investigamos sistematicamente a regularização da suavidade das ações por meio de penalidades de derivadas de ordem superior, progredindo da compreensão teórica em benchmarks de controle contínuo para validação prática na gestão de energia em edifícios. Nossa avaliação abrangente em quatro ambientes de controle contínuo demonstra que penalidades de derivadas de terceira ordem (minimização de jerk) alcançam consistentemente uma suavidade superior enquanto mantêm um desempenho competitivo. Estendemos essas descobertas a sistemas de controle HVAC, onde políticas suaves reduzem a comutação de equipamentos em 60%, traduzindo-se em benefícios operacionais significativos. Nosso trabalho estabelece a regularização de ações de ordem superior como uma ponte eficaz entre a otimização de RL e as restrições operacionais em aplicações críticas de energia.

Fonte: arXiv cs.AI

RLScore 85

Acelerando a Busca em Árvores de Monte-Carlo com Políticas Posteriores Otimizadas

arXiv:2601.01301v1 Tipo de Anúncio: novo Resumo: Introduzimos um algoritmo recursivo de busca em árvores de Monte-Carlo no estilo AlphaZero, chamado "RMCTS". A vantagem do RMCTS sobre o MCTS-UCB do AlphaZero é a velocidade. No RMCTS, a árvore de busca é explorada de maneira em largura, de modo que as inferências da rede ocorrem naturalmente em grandes lotes. Isso reduz significativamente o custo de latência da GPU. Descobrimos que o RMCTS é frequentemente mais de 40 vezes mais rápido que o MCTS-UCB ao buscar um único estado raiz, e cerca de 3 vezes mais rápido ao buscar um grande lote de estados raiz. A recursão no RMCTS é baseada no cálculo de políticas posteriores otimizadas em cada estado do jogo na árvore de busca, começando pelas folhas e trabalhando de volta até a raiz. Aqui usamos a política posterior explorada em "Busca em Árvores de Monte-Carlo como otimização de política regularizada" (Grill, et al.). A política posterior deles é a política única que maximiza a recompensa esperada dada as recompensas de ação estimadas menos uma penalidade por divergir da política anterior. A árvore explorada pelo RMCTS não é definida de maneira adaptativa, como é no MCTS-UCB. Em vez disso, a árvore RMCTS é definida seguindo as políticas da rede anterior em cada nó. Esta é uma desvantagem, mas a vantagem de aceleração é mais significativa, e na prática descobrimos que redes treinadas com RMCTS correspondem à qualidade de redes treinadas com MCTS-UCB em aproximadamente um terço do tempo de treinamento. Incluímos comparações de tempo e qualidade do RMCTS vs. MCTS-UCB para três jogos: Connect-4, Dots-and-Boxes e Othello.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

Yuan3.0 Flash: An Open Multimodal Large Language Model for Enterprise Applications

arXiv:2601.01718v1 Announce Type: new Abstract: We introduce Yuan3.0 Flash, an open-source Mixture-of-Experts (MoE) MultiModal Large Language Model featuring 3.7B activated parameters and 40B total parameters, specifically designed to enhance performance on enterprise-oriented tasks while maintaining competitive capabilities on general-purpose tasks. To address the overthinking phenomenon commonly observed in Large Reasoning Models (LRMs), we propose Reflection-aware Adaptive Policy Optimization (RAPO), a novel RL training algorithm that effectively regulates overthinking behaviors. In enterprise-oriented tasks such as retrieval-augmented generation (RAG), complex table understanding, and summarization, Yuan3.0 Flash consistently achieves superior performance. Moreover, it also demonstrates strong reasoning capabilities in domains such as mathematics, science, etc., attaining accuracy comparable to frontier model while requiring only approximately 1/4 to 1/2 of the average tokens. Yuan3.0 Flash has been fully open-sourced to facilitate further research and real-world deployment: https://github.com/Yuan-lab-LLM/Yuan3.0.

Fonte: arXiv cs.AI

RLScore 85

Reinforcement Learning Enhanced Multi-hop Reasoning for Temporal Knowledge Question Answering

arXiv:2601.01195v1 Announce Type: new Abstract: Temporal knowledge graph question answering (TKGQA) involves multi-hop reasoning over temporally constrained entity relationships in the knowledge graph to answer a given question. However, at each hop, large language models (LLMs) retrieve subgraphs with numerous temporally similar and semantically complex relations, increasing the risk of suboptimal decisions and error propagation. To address these challenges, we propose the multi-hop reasoning enhanced (MRE) framework, which enhances both forward and backward reasoning to improve the identification of globally optimal reasoning trajectories. Specifically, MRE begins with prompt engineering to guide the LLM in generating diverse reasoning trajectories for a given question. Valid reasoning trajectories are then selected for supervised fine-tuning, serving as a cold-start strategy. Finally, we introduce Tree-Group Relative Policy Optimization (T-GRPO), a recursive, tree-structured learning-by-exploration approach. At each hop, exploration establishes strong causal dependencies on the previous hop, while evaluation is informed by multi-path exploration feedback from subsequent hops. Experimental results on two TKGQA benchmarks indicate that the proposed MRE-based model consistently surpasses state-of-the-art (SOTA) approaches in handling complex multi-hop queries. Further analysis highlights improved interpretability and robustness to noisy temporal annotations.

Fonte: arXiv cs.AI

NLP/LLMsScore 85

AI Agentiva para Tomada de Decisão de Risco de Crédito Autônoma, Explicável e em Tempo Real

arXiv:2601.00818v1 Tipo de Anúncio: novo Resumo: A digitalização significativa dos serviços financeiros em um curto período de tempo levou a uma demanda urgente por sistemas de tomada de decisão de risco de crédito autônomos, transparentes e em tempo real. Os modelos tradicionais de machine learning são eficazes em reconhecimento de padrões, mas não possuem o raciocínio adaptativo, a consciência situacional e a autonomia necessárias nas operações financeiras modernas. Como proposta, este artigo apresenta uma estrutura de AI Agentiva, ou um sistema onde agentes de IA visualizam o mundo do crédito dinâmico independentemente de observadores humanos, que então tomam ações com base em seus caminhos de tomada de decisão articuláveis. A pesquisa introduz um sistema multiagente com aprendizado por reforço, raciocínio em linguagem natural, módulos de AI explicável e pipelines de absorção de dados em tempo real como meio de avaliar os perfis de risco dos tomadores de empréstimos com pouca intervenção humana. Os processos consistem em protocolo de colaboração de agentes, motores de pontuação de risco, camadas de interpretabilidade e ciclos de aprendizado com feedback contínuo. As descobertas indicam que a velocidade de decisão, transparência e capacidade de resposta são melhores do que os modelos tradicionais de pontuação de crédito. No entanto, ainda existem algumas limitações práticas, como riscos de desvio do modelo, inconsistências na interpretação de dados de alta dimensão e incertezas regulatórias, além de limitações de infraestrutura em ambientes com poucos recursos. O sistema sugerido tem um alto potencial para transformar a análise de crédito e futuros estudos devem ser direcionados a mobilizadores de conformidade regulatória dinâmica, novas colaborações entre agentes, robustez adversarial e implementação em larga escala em ecossistemas de crédito entre países.

Fonte: arXiv cs.AI

NLP/LLMsScore 95

A Chain-of-Thought Approach to Semantic Query Categorization in e-Commerce Taxonomies

arXiv:2601.00510v1 Announce Type: cross Abstract: Search in e-Commerce is powered at the core by a structured representation of the inventory, often formulated as a category taxonomy. An important capability in e-Commerce with hierarchical taxonomies is to select a set of relevant leaf categories that are semantically aligned with a given user query. In this scope, we address a fundamental problem of search query categorization in real-world e-Commerce taxonomies. A correct categorization of a query not only provides a way to zoom into the correct inventory space, but opens the door to multiple intent understanding capabilities for a query. A practical and accurate solution to this problem has many applications in e-commerce, including constraining retrieved items and improving the relevance of the search results. For this task, we explore a novel Chain-of-Thought (CoT) paradigm that combines simple tree-search with LLM semantic scoring. Assessing its classification performance on human-judged query-category pairs, relevance tests, and LLM-based reference methods, we find that the CoT approach performs better than a benchmark that uses embedding-based query category predictions. We show how the CoT approach can detect problems within a hierarchical taxonomy. Finally, we also propose LLM-based approaches for query-categorization of the same spirit, but which scale better at the range of millions of queries.

Fonte: arXiv cs.CL

VisionScore 94

Fluxos de Kernel Orientados a Tarefas: Compressão de Classificação de Rótulos e Filtragem Espectral Laplaciana

Apresentamos uma teoria de aprendizado de características em redes amplas regularizadas por L2, mostrando que o aprendizado supervisionado é inerentemente compressivo. Derivamos uma ODE de kernel que prevê uma evolução espectral de 'preenchimento de água' e provamos que, para qualquer estado estacionário estável, a classificação do kernel é limitada pelo número de classes ($C$).

Fonte: arXiv cs.LG

NLP/LLMsScore 95

CPPO: Contrastive Perception for Vision Language Policy Optimization

arXiv:2601.00501v1 Announce Type: new Abstract: We introduce CPPO, a Contrastive Perception Policy Optimization method for finetuning vision-language models (VLMs). While reinforcement learning (RL) has advanced reasoning in language models, extending it to multimodal reasoning requires improving both the perception and reasoning aspects. Prior works tackle this challenge mainly with explicit perception rewards, but disentangling perception tokens from reasoning tokens is difficult, requiring extra LLMs, ground-truth data, forced separation of perception from reasoning by policy model, or applying rewards indiscriminately to all output tokens. CPPO addresses this problem by detecting perception tokens via entropy shifts in the model outputs under perturbed input images. CPPO then extends the RL objective function with a Contrastive Perception Loss (CPL) that enforces consistency under information-preserving perturbations and sensitivity under information-removing ones. Experiments show that CPPO surpasses previous perception-rewarding methods, while avoiding extra models, making training more efficient and scalable.

Fonte: arXiv cs.CV

NLP/LLMsScore 93

Quando Modelos Pequenos Estão Certos por Motivos Errados: Verificação de Processos para Agentes Confiáveis

A implementação de pequenos modelos de linguagem (7-9B parâmetros) como agentes autônomos exige confiança em seu raciocínio, não apenas em suas saídas. Revelamos uma crise crítica de confiabilidade: 50-69% das respostas corretas desses modelos contêm raciocínio fundamentalmente falho, um fenômeno 'Certo por Motivos Errados' invisível às métricas de precisão padrão.

Fonte: arXiv cs.LG

RLScore 93

E-GRPO: Passos de Alta Entropia Impulsionam Aprendizado por Reforço Eficaz para Modelos de Fluxo

O aprendizado por reforço recente aprimorou os modelos de correspondência de fluxo na alinhamento de preferências humanas. Observamos que passos de alta entropia permitem uma exploração mais eficiente, enquanto passos de baixa entropia resultam em roll-outs indistintos. Propomos o E-GRPO, uma otimização de política relativa em grupo consciente da entropia para aumentar a entropia dos passos de amostragem de SDE.

Fonte: arXiv cs.LG

RLScore 95

Decomposição Tucker Esparsa e Regularização Gráfica para Previsão de Séries Temporais de Alta Dimensionalidade

Métodos existentes de modelos autorregressivos vetoriais para análise de séries temporais multivariadas utilizam aproximação de matriz de baixa classificação ou decomposição Tucker para reduzir a dimensão do problema de superparametrização. Este artigo propõe um método de decomposição Tucker esparsa com regularização gráfica para séries temporais autorregressivas vetoriais de alta dimensionalidade.

Fonte: arXiv stat.ML

VisionScore 96

Um Estudo Comparativo de Estratégias de Adaptação para Modelos Fundamentais de Séries Temporais na Detecção de Anomalias

A detecção de anomalias em séries temporais é essencial para a operação confiável de sistemas complexos, mas a maioria dos métodos existentes requer treinamento extenso e específico. Este estudo investiga se modelos fundamentais de séries temporais (TSFMs), pré-treinados em grandes dados heterogêneos, podem servir como bases universais para detecção de anomalias.

Fonte: arXiv cs.LG

RLScore 93

GRL-SNAM: Aprendizado de Reforço Geométrico com Hamiltonianos Diferenciais de Caminho para Navegação e Mapeamento Simultâneos em Ambientes Desconhecidos

Apresentamos o GRL-SNAM, um framework de aprendizado de reforço geométrico para Navegação e Mapeamento Simultâneos (SNAM) em ambientes desconhecidos. O problema de SNAM é desafiador, pois requer o design de políticas hierárquicas ou conjuntas de múltiplos agentes que controlam o movimento de um robô em um ambiente sem mapa, adquirindo informações por meio de sensores.

Fonte: arXiv cs.LG

NLP/LLMsScore 95

Universal Adaptive Constraint Propagation: Scaling Structured Inference for Large Language Models via Meta-Reinforcement Learning

arXiv:2601.00095v1 Announce Type: new Abstract: Large language models increasingly require structured inference, from JSON schema enforcement to multi-lingual parsing, where outputs must satisfy complex constraints. We introduce MetaJuLS, a meta-reinforcement learning approach that learns universal constraint propagation policies applicable across languages and tasks without task-specific retraining. By formulating structured inference as adaptive constraint propagation and training a Graph Attention Network with meta-learning, MetaJuLS achieves 1.5--2.0$\times$ speedups over GPU-optimized baselines while maintaining within 0.2\% accuracy of state-of-the-art parsers. On Universal Dependencies across 10 languages and LLM-constrained generation (LogicBench, GSM8K-Constrained), MetaJuLS demonstrates rapid cross-domain adaptation: a policy trained on English parsing adapts to new languages and tasks with 5--10 gradient steps (5--15 seconds) rather than requiring hours of task-specific training. Mechanistic analysis reveals the policy discovers human-like parsing strategies (easy-first) and novel non-intuitive heuristics. By reducing propagation steps in LLM deployments, MetaJuLS contributes to Green AI by directly reducing inference carbon footprint.

Fonte: arXiv cs.CL

RLScore 96

IA Generativa Nativa em Nuvem para Síntese Automatizada de Planogramas: Uma Abordagem de Modelo de Difusão para Otimização de Varejo em Múltiplas Lojas

A criação de planogramas é um desafio significativo para o varejo, exigindo em média 30 horas por layout complexo. Este artigo apresenta uma arquitetura nativa em nuvem utilizando modelos de difusão para gerar automaticamente planogramas específicos para cada loja, aprendendo com arranjos de prateleiras bem-sucedidos em várias localizações de varejo.

Fonte: arXiv cs.LG

VisionScore 96

Campos Cerebrais Neurais: Uma Abordagem Inspirada em NeRF para Gerar Eletrodos de EEG Inexistentes

Os dados de eletroencefalografia (EEG) apresentam desafios únicos de modelagem devido à variação de comprimento, baixíssima relação sinal-ruído e diferenças significativas entre participantes. Este trabalho apresenta um novo método inspirado em Neural Radiance Fields (NeRF) para processar sinais de EEG, permitindo a visualização contínua da atividade cerebral e a simulação de dados de eletrodos inexistentes.

Fonte: arXiv cs.AI

RLScore 92

Aprendizado ativo para modelos reduzidos baseados em dados de sistemas diferenciais paramétricos com inferência bayesiana de operadores

Este trabalho desenvolve um framework de aprendizado ativo para enriquecer de forma inteligente modelos reduzidos de ordem (ROMs) de sistemas dinâmicos paramétricos, que podem servir como base para ativos virtuais em um digital twin. Os ROMs baseados em dados são modelos de machine learning científicos explicáveis e computacionalmente eficientes que visam preservar a física subjacente de simulações dinâmicas complexas.

Fonte: arXiv stat.ML

RLScore 96

ClinicalReTrial: Um Agente de IA Autoevolutivo para Otimização de Protocolos de Ensaios Clínicos

arXiv:2601.00290v1 Tipo de Anúncio: novo Resumo: A falha em ensaios clínicos continua sendo um gargalo central no desenvolvimento de medicamentos, onde pequenas falhas no design do protocolo podem comprometer irreversivelmente os resultados, apesar de terapias promissoras. Embora métodos de IA de ponta alcancem um desempenho forte na previsão de sucesso de ensaios, eles são inerentemente reativos, limitando-se a diagnosticar riscos sem oferecer remédios acionáveis uma vez que a falha é antecipada. Para preencher essa lacuna, este artigo propõe o ClinicalReTrial, uma estrutura de agente de IA autoevolutivo que aborda essa questão ao considerar o raciocínio de ensaios clínicos como um problema iterativo de redesign de protocolo. Nosso método integra diagnóstico de falhas, modificação consciente da segurança e avaliação de candidatos em uma estrutura de otimização orientada por recompensas em loop fechado. Servindo como um ambiente de simulação para o modelo de previsão de resultados, o ClinicalReTrial permite a avaliação de baixo custo de modificações de protocolo e fornece sinais de recompensa densos para autoaperfeiçoamento contínuo. Para apoiar uma exploração eficiente, a estrutura mantém uma memória hierárquica que captura feedback em nível de iteração dentro dos ensaios e destila padrões de redesign transferíveis entre ensaios. Empiricamente, o ClinicalReTrial melhora 83,3% dos protocolos de ensaio com um ganho médio de probabilidade de sucesso de 5,7%, e estudos de caso retrospectivos demonstram uma forte correspondência entre as estratégias de redesign descobertas e as modificações reais de ensaios clínicos.

Fonte: arXiv cs.AI

RLScore 96

Modelagem de Estratégia Baseada em Regras Quantitativas no Classic Indian Rummy: Uma Abordagem de Otimização Métrica

arXiv:2601.00024v1 Tipo de Anúncio: novo Resumo: A variante de 13 cartas do Classic Indian Rummy é um jogo sequencial de informação incompleta que requer raciocínio probabilístico e tomada de decisão combinatória. Este artigo propõe uma estrutura baseada em regras para o jogo estratégico, impulsionada por uma nova métrica de avaliação de mãos chamada MinDist. A métrica modifica a métrica MinScore ao quantificar a distância de edição entre uma mão e a configuração válida mais próxima, capturando assim a proximidade estrutural para a conclusão. Projetamos um algoritmo computacionalmente eficiente derivado do algoritmo MinScore, aproveitando a poda dinâmica e o cache de padrões para calcular exatamente essa métrica durante o jogo. A modelagem das mãos dos oponentes também é incorporada dentro de uma estrutura de simulação de soma zero para dois jogadores, e as estratégias resultantes são avaliadas usando testes de hipótese estatística. Resultados empíricos mostram uma melhoria significativa nas taxas de vitória para agentes baseados em MinDist em relação a heurísticas tradicionais, fornecendo um passo formal e interpretável em direção ao design de estratégias algorítmicas para Rummy.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

Agentes Potencializados por LLMs Tendem a Ter Viés Contra Humanos? Explorando a Vulnerabilidade Dependente da Crença

Agentes potencializados por LLMs podem apresentar não apenas viés demográfico, mas também viés intergrupal desencadeado por pistas mínimas de 'nós' versus 'eles'. Este estudo investiga como a crença de um agente sobre a presença de humanos pode influenciar seu comportamento, introduzindo um novo vetor de ataque chamado Belief Poisoning Attack (BPA).

Fonte: arXiv cs.AI

NLP/LLMsScore 96

FlashInfer-Bench: Construindo o Ciclo Virtuoso para Sistemas LLM Impulsionados por IA

Avanços recentes mostram que modelos de linguagem de grande escala (LLMs) podem atuar como agentes autônomos capazes de gerar kernels de GPU, mas integrar esses kernels gerados por IA em sistemas de inferência do mundo real continua sendo um desafio. O FlashInfer-Bench aborda essa lacuna ao estabelecer um framework padronizado e de ciclo fechado que conecta geração de kernels, benchmarking e implantação.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

MethConvTransformer: Um Framework de Deep Learning para Detecção de Doença de Alzheimer em Múltiplos Tecidos

A doença de Alzheimer (AD) é um distúrbio neurodegenerativo multifatorial caracterizado por declínio cognitivo progressivo. O MethConvTransformer é um framework de deep learning baseado em transformer que integra perfis de metilação de DNA de tecidos cerebrais e periféricos, permitindo a descoberta de biomarcadores. O modelo supera as abordagens convencionais de machine learning, oferecendo biomarcadores epigenéticos robustos e interpretabilidade multi-resolução.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

SSI-GAN: Redes Geradoras Adversariais Semi-Supervisionadas Inspiradas no Swin para Classificação de Espículas Neurais

Os mosquitos são os principais agentes transmissores de doenças arbovirais. A classificação manual de seus padrões de espículas neurais é muito trabalhosa e cara. Para resolver a escassez de dados rotulados, propomos uma nova arquitetura de Rede Geradora Adversarial (GAN) chamada SSI-GAN, que alcançou 99,93% de precisão na classificação com apenas 3% de dados rotulados.

Fonte: arXiv cs.AI

RLScore 96

Métodos Semânticos Podem Aprimorar Táticas em Esportes Coletivos? Uma Metodologia para Futebol com Aplicações Mais Amplas

Este artigo explora como o raciocínio em espaço semântico, tradicionalmente utilizado em linguística computacional, pode ser estendido à tomada de decisão tática em esportes coletivos. A metodologia proposta modela configurações táticas como estruturas semânticas composicionais, representando cada jogador como um vetor multidimensional que integra atributos técnicos, físicos e psicológicos.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

Uma Avaliação Empírica de Abordagens Baseadas em LLM para Detecção de Vulnerabilidades de Código: RAG, SFT e Sistemas de Agentes Duplos

O rápido avanço dos Large Language Models (LLMs) apresenta novas oportunidades para a detecção automatizada de vulnerabilidades de software, uma tarefa crucial para a segurança de bases de código modernas. Este artigo apresenta um estudo comparativo sobre a eficácia de técnicas baseadas em LLM para detectar vulnerabilidades de software, avaliando três abordagens: Retrieval-Augmented Generation (RAG), Supervised Fine-Tuning (SFT) e um framework de LLM de Agente Duplo.

Fonte: arXiv cs.AI

VisionScore 95

A Cascaded Information Interaction Network for Precise Image Segmentation

arXiv:2601.00562v1 Announce Type: new Abstract: Visual perception plays a pivotal role in enabling autonomous behavior, offering a cost-effective and efficient alternative to complex multi-sensor systems. However, robust segmentation remains a challenge in complex scenarios. To address this, this paper proposes a cascaded convolutional neural network integrated with a novel Global Information Guidance Module. This module is designed to effectively fuse low-level texture details with high-level semantic features across multiple layers, thereby overcoming the inherent limitations of single-scale feature extraction. This architectural innovation significantly enhances segmentation accuracy, particularly in visually cluttered or blurred environments where traditional methods often fail. Experimental evaluations on benchmark image segmentation datasets demonstrate that the proposed framework achieves superior precision, outperforming existing state-of-the-art methods. The results highlight the effectiveness of the approach and its promising potential for deployment in practical robotic applications.

Fonte: arXiv cs.CV

VisionScore 95

OmniVaT: Single Domain Generalization for Multimodal Visual-Tactile Learning

arXiv:2601.00352v1 Announce Type: new Abstract: Visual-tactile learning (VTL) enables embodied agents to perceive the physical world by integrating visual (VIS) and tactile (TAC) sensors. However, VTL still suffers from modality discrepancies between VIS and TAC images, as well as domain gaps caused by non-standardized tactile sensors and inconsistent data collection procedures. We formulate these challenges as a new task, termed single domain generalization for multimodal VTL (SDG-VTL). In this paper, we propose an OmniVaT framework that, for the first time, successfully addresses this task. On the one hand, OmniVaT integrates a multimodal fractional Fourier adapter (MFFA) to map VIS and TAC embeddings into a unified embedding-frequency space, thereby effectively mitigating the modality gap without multi-domain training data or careful cross-modal fusion strategies. On the other hand, it also incorporates a discrete tree generation (DTG) module that obtains diverse and reliable multimodal fractional representations through a hierarchical tree structure, thereby enhancing its adaptivity to fluctuating domain shifts in unseen domains. Extensive experiments demonstrate the superior cross-domain generalization performance of OmniVaT on the SDG-VTL task.

Fonte: arXiv cs.CV

RLScore 95

Disentangling Hardness from Noise: An Uncertainty-Driven Model-Agnostic Framework for Long-Tailed Remote Sensing Classification

arXiv:2601.00278v1 Announce Type: new Abstract: Long-Tailed distributions are pervasive in remote sensing due to the inherently imbalanced occurrence of grounded objects. However, a critical challenge remains largely overlooked, i.e., disentangling hard tail data samples from noisy ambiguous ones. Conventional methods often indiscriminately emphasize all low-confidence samples, leading to overfitting on noisy data. To bridge this gap, building upon Evidential Deep Learning, we propose a model-agnostic uncertainty-aware framework termed DUAL, which dynamically disentangles prediction uncertainty into Epistemic Uncertainty (EU) and Aleatoric Uncertainty (AU). Specifically, we introduce EU as an indicator of sample scarcity to guide a reweighting strategy for hard-to-learn tail samples, while leveraging AU to quantify data ambiguity, employing an adaptive label smoothing mechanism to suppress the impact of noise. Extensive experiments on multiple datasets across various backbones demonstrate the effectiveness and generalization of our framework, surpassing strong baselines such as TGN and SADE. Ablation studies provide further insights into the crucial choices of our design.

Fonte: arXiv cs.CV

NLP/LLMsScore 95

From Sight to Insight: Improving Visual Reasoning Capabilities of Multimodal Models via Reinforcement Learning

arXiv:2601.00215v1 Announce Type: cross Abstract: Reinforcement learning (RL) has emerged as a promising approach for eliciting reasoning chains before generating final answers. However, multimodal large language models (MLLMs) generate reasoning that lacks integration of visual information. This limits their ability to solve problems that demand accurate visual perception, such as visual puzzles. We show that visual perception is the key bottleneck in such tasks: converting images into textual descriptions significantly improves performance, yielding gains of 26.7% for Claude 3.5 and 23.6% for Claude 3.7. To address this, we investigate reward-driven RL as a mechanism to unlock long visual reasoning in open-source MLLMs without requiring costly supervision. We design and evaluate six reward functions targeting different reasoning aspects, including image understanding, thinking steps, and answer accuracy. Using group relative policy optimization (GRPO), our approach explicitly incentivizes longer, structured reasoning and mitigates bypassing of visual information. Experiments on Qwen-2.5-VL-7B achieve 5.56% improvements over the base model, with consistent gains across both in-domain and out-of-domain settings.

Fonte: arXiv cs.CL

VisionScore 95

NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos

arXiv:2601.00393v1 Announce Type: new Abstract: In this paper, we propose NeoVerse, a versatile 4D world model that is capable of 4D reconstruction, novel-trajectory video generation, and rich downstream applications. We first identify a common limitation of scalability in current 4D world modeling methods, caused either by expensive and specialized multi-view 4D data or by cumbersome training pre-processing. In contrast, our NeoVerse is built upon a core philosophy that makes the full pipeline scalable to diverse in-the-wild monocular videos. Specifically, NeoVerse features pose-free feed-forward 4D reconstruction, online monocular degradation pattern simulation, and other well-aligned techniques. These designs empower NeoVerse with versatility and generalization to various domains. Meanwhile, NeoVerse achieves state-of-the-art performance in standard reconstruction and generation benchmarks. Our project page is available at https://neoverse-4d.github.io

Fonte: arXiv cs.CV

NLP/LLMsScore 95

Intelligent Traffic Surveillance for Real-Time Vehicle Detection, License Plate Recognition, and Speed Estimation

arXiv:2601.00344v1 Announce Type: new Abstract: Speeding is a major contributor to road fatalities, particularly in developing countries such as Uganda, where road safety infrastructure is limited. This study proposes a real-time intelligent traffic surveillance system tailored to such regions, using computer vision techniques to address vehicle detection, license plate recognition, and speed estimation. The study collected a rich dataset using a speed gun, a Canon Camera, and a mobile phone to train the models. License plate detection using YOLOv8 achieved a mean average precision (mAP) of 97.9%. For character recognition of the detected license plate, the CNN model got a character error rate (CER) of 3.85%, while the transformer model significantly reduced the CER to 1.79%. Speed estimation used source and target regions of interest, yielding a good performance of 10 km/h margin of error. Additionally, a database was established to correlate user information with vehicle detection data, enabling automated ticket issuance via SMS via Africa's Talking API. This system addresses critical traffic management needs in resource-constrained environments and shows potential to reduce road accidents through automated traffic enforcement in developing countries where such interventions are urgently needed.

Fonte: arXiv cs.CV

NLP/LLMsScore 96

FaithSCAN: Detecção de Alucinações em Uma Única Passagem Baseada em Modelos para Respostas Visuais de Perguntas Fiéis

As alucinações de fidelidade em VQA ocorrem quando modelos de visão-linguagem produzem respostas fluentes, mas visualmente não fundamentadas, comprometendo sua confiabilidade em aplicações críticas de segurança. Propomos o FaithSCAN, uma rede leve que detecta alucinações explorando sinais internos ricos dos VLMs, superando limitações de métodos existentes em eficiência e desempenho de detecção.

Fonte: arXiv cs.AI

VisionScore 95

SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting

arXiv:2601.00285v1 Announce Type: new Abstract: Reconstructing a dynamic target moving over a large area is challenging. Standard approaches for dynamic object reconstruction require dense coverage in both the viewing space and the temporal dimension, typically relying on multi-view videos captured at each time step. However, such setups are only possible in constrained environments. In real-world scenarios, observations are often sparse over time and captured sparsely from diverse viewpoints (e.g., from security cameras), making dynamic reconstruction highly ill-posed. We present SV-GS, a framework that simultaneously estimates a deformation model and the object's motion over time under sparse observations. To initialize SV-GS, we leverage a rough skeleton graph and an initial static reconstruction as inputs to guide motion estimation. (Later, we show that this input requirement can be relaxed.) Our method optimizes a skeleton-driven deformation field composed of a coarse skeleton joint pose estimator and a module for fine-grained deformations. By making only the joint pose estimator time-dependent, our model enables smooth motion interpolation while preserving learned geometric details. Experiments on synthetic datasets show that our method outperforms existing approaches under sparse observations by up to 34% in PSNR, and achieves comparable performance to dense monocular video methods on real-world datasets despite using significantly fewer frames. Moreover, we demonstrate that the input initial static reconstruction can be replaced by a diffusion-based generative prior, making our method more practical for real-world scenarios.

Fonte: arXiv cs.CV

VisionScore 95

A Spatially Masked Adaptive Gated Network for multimodal post-flood water extent mapping using SAR and incomplete multispectral data

arXiv:2601.00123v1 Announce Type: new Abstract: Mapping water extent during a flood event is essential for effective disaster management throughout all phases: mitigation, preparedness, response, and recovery. In particular, during the response stage, when timely and accurate information is important, Synthetic Aperture Radar (SAR) data are primarily employed to produce water extent maps. Recently, leveraging the complementary characteristics of SAR and MSI data through a multimodal approach has emerged as a promising strategy for advancing water extent mapping using deep learning models. This approach is particularly beneficial when timely post-flood observations, acquired during or shortly after the flood peak, are limited, as it enables the use of all available imagery for more accurate post-flood water extent mapping. However, the adaptive integration of partially available MSI data into the SAR-based post-flood water extent mapping process remains underexplored. To bridge this research gap, we propose the Spatially Masked Adaptive Gated Network (SMAGNet), a multimodal deep learning model that utilizes SAR data as the primary input for post-flood water extent mapping and integrates complementary MSI data through feature fusion. In experiments on the C2S-MS Floods dataset, SMAGNet consistently outperformed other multimodal deep learning models in prediction performance across varying levels of MSI data availability. Furthermore, we found that even when MSI data were completely missing, the performance of SMAGNet remained statistically comparable to that of a U-Net model trained solely on SAR data. These findings indicate that SMAGNet enhances the model robustness to missing data as well as the applicability of multimodal deep learning in real-world flood management scenarios.

Fonte: arXiv cs.CV

NLP/LLMsScore 96

FCMBench: Um Benchmark Multimodal Abrangente de Crédito Financeiro para Aplicações do Mundo Real

À medida que a IA multimodal se torna amplamente utilizada para avaliação de risco de crédito e revisão de documentos, um benchmark específico do domínio é urgentemente necessário. Apresentamos o FCMBench-V1.0, um benchmark multimodal de crédito financeiro em larga escala, cobrindo 18 tipos de certificados principais, com 4.043 imagens em conformidade com a privacidade e 8.446 amostras de QA.

Fonte: arXiv cs.AI

RLScore 95

TeleDoCTR: Domain-Specific and Contextual Troubleshooting for Telecommunications

arXiv:2601.00691v1 Announce Type: cross Abstract: Ticket troubleshooting refers to the process of analyzing and resolving problems that are reported through a ticketing system. In large organizations offering a wide range of services, this task is highly complex due to the diversity of submitted tickets and the need for specialized domain knowledge. In particular, troubleshooting in telecommunications (telecom) is a very time-consuming task as it requires experts to interpret ticket content, consult documentation, and search historical records to identify appropriate resolutions. This human-intensive approach not only delays issue resolution but also hinders overall operational efficiency. To enhance the effectiveness and efficiency of ticket troubleshooting in telecom, we propose TeleDoCTR, a novel telecom-related, domain-specific, and contextual troubleshooting system tailored for end-to-end ticket resolution in telecom. TeleDoCTR integrates both domain-specific ranking and generative models to automate key steps of the troubleshooting workflow which are: routing tickets to the appropriate expert team responsible for resolving the ticket (classification task), retrieving contextually and semantically similar historical tickets (retrieval task), and generating a detailed fault analysis report outlining the issue, root cause, and potential solutions (generation task). We evaluate TeleDoCTR on a real-world dataset from a telecom infrastructure and demonstrate that it achieves superior performance over existing state-of-the-art methods, significantly enhancing the accuracy and efficiency of the troubleshooting process.

Fonte: arXiv cs.CL

NLP/LLMsScore 96

A Coleira Agente: Extraindo Mapas Cognitivos Fuzzy de Feedback Causal com LLMs

Desenvolvemos um agente de modelo de linguagem grande (LLM) que extrai mapas cognitivos fuzzy de feedback causal (FCMs) a partir de texto bruto. O processo de aprendizado ou extração causal é agente tanto pela semi-autonomia do LLM quanto pela dinâmica do sistema FCM, que orienta os agentes LLM a buscar e processar texto causal.

Fonte: arXiv cs.AI

RLScore 96

Yahtzee: Técnicas de Aprendizado por Reforço para Jogos Combinatórios Estocásticos

Yahtzee é um jogo clássico de dados com uma estrutura estocástica e combinatória, apresentando recompensas atrasadas, o que o torna um interessante benchmark de RL em escala média. Este trabalho formula Yahtzee como um Processo de Decisão de Markov (MDP) e treina agentes de auto-jogo utilizando diversos métodos de gradiente de política.

Fonte: arXiv cs.AI

RLScore 93

Imitação a partir de Observações com Embeddings Gerativos em Nível de Trajetória

Consideramos o aprendizado de imitação offline a partir de observações (LfO) onde as demonstrações de especialistas são escassas e os dados offline disponíveis são subótimos e distantes do comportamento do especialista. Propomos o TGE, um embedding gerativo em nível de trajetória que constrói uma recompensa substituta densa e suave, estimando a densidade de estados do especialista em um modelo de difusão temporal treinado com dados de trajetória offline.

Fonte: arXiv cs.LG

NLP/LLMsScore 95

StockBot 2.0: Vanilla LSTMs Outperform Transformer-based Forecasting for Stock Prices

arXiv:2601.00197v1 Announce Type: cross Abstract: Accurate forecasting of financial markets remains a long-standing challenge due to complex temporal and often latent dependencies, non-linear dynamics, and high volatility. Building on our earlier recurrent neural network framework, we present an enhanced StockBot architecture that systematically evaluates modern attention-based, convolutional, and recurrent time-series forecasting models within a unified experimental setting. While attention-based and transformer-inspired models offer increased modeling flexibility, extensive empirical evaluation reveals that a carefully constructed vanilla LSTM consistently achieves superior predictive accuracy and more stable buy/sell decision-making when trained under a common set of default hyperparameters. These results highlight the robustness and data efficiency of recurrent sequence models for financial time-series forecasting, particularly in the absence of extensive hyperparameter tuning or the availability of sufficient data when discretized to single-day intervals. Additionally, these results underscore the importance of architectural inductive bias in data-limited market prediction tasks.

Fonte: arXiv cs.CL

NLP/LLMsScore 96

Raciocínio em Ação: Recuperação de Conhecimento Orientada por MCTS para Modelos de Linguagem Grandes

Modelos de linguagem grandes (LLMs) geralmente melhoram seu desempenho por meio da recuperação de informações semanticamente semelhantes ou da melhoria de suas capacidades de raciocínio. Este artigo apresenta um método de recuperação de conhecimento consciente do raciocínio que enriquece os LLMs com informações alinhadas à estrutura lógica das conversas, superando a similaridade semântica superficial.

Fonte: arXiv cs.AI

NLP/LLMsScore 95

Adapting Natural Language Processing Models Across Jurisdictions: A pilot Study in Canadian Cancer Registries

arXiv:2601.00787v1 Announce Type: new Abstract: Population-based cancer registries depend on pathology reports as their primary diagnostic source, yet manual abstraction is resource-intensive and contributes to delays in cancer data. While transformer-based NLP systems have improved registry workflows, their ability to generalize across jurisdictions with differing reporting conventions remains poorly understood. We present the first cross-provincial evaluation of adapting BCCRTron, a domain-adapted transformer model developed at the British Columbia Cancer Registry, alongside GatorTron, a biomedical transformer model, for cancer surveillance in Canada. Our training dataset consisted of approximately 104,000 and 22,000 de-identified pathology reports from the Newfoundland & Labrador Cancer Registry (NLCR) for Tier 1 (cancer vs. non-cancer) and Tier 2 (reportable vs. non-reportable) tasks, respectively. Both models were fine-tuned using complementary synoptic and diagnosis focused report section input pipelines. Across NLCR test sets, the adapted models maintained high performance, demonstrating transformers pretrained in one jurisdiction can be localized to another with modest fine-tuning. To improve sensitivity, we combined the two models using a conservative OR-ensemble achieving a Tier 1 recall of 0.99 and reduced missed cancers to 24, compared with 48 and 54 for the standalone models. For Tier 2, the ensemble achieved 0.99 recall and reduced missed reportable cancers to 33, compared with 54 and 46 for the individual models. These findings demonstrate that an ensemble combining complementary text representations substantially reduce missed cancers and improve error coverage in cancer-registry NLP. We implement a privacy-preserving workflow in which only model weights are shared between provinces, supporting interoperable NLP infrastructure and a future pan-Canadian foundation model for cancer pathology and registry workflows.

Fonte: arXiv cs.CL

VisionScore 96

Avaliação de Detectores de Anomalias para Problemas de Classificação Industrial Altamente Desequilibrados Simulados

O machine learning oferece soluções potenciais para problemas atuais em sistemas industriais, como controle de qualidade e manutenção preditiva, mas enfrenta barreiras únicas em aplicações industriais. Este artigo apresenta uma avaliação abrangente de algoritmos de detecção de anomalias usando um conjunto de dados simulado que reflete restrições de engenharia do mundo real.

Fonte: arXiv cs.AI

NLP/LLMsScore 92

Probabilistic Guarantees for Reducing Contextual Hallucinations in LLMs

arXiv:2601.00641v1 Announce Type: new Abstract: Large language models (LLMs) frequently produce contextual hallucinations, where generated content contradicts or ignores information explicitly stated in the prompt. Such errors are particularly problematic in deterministic automation workflows, where inputs are fixed and correctness is unambiguous. We introduce a simple and model-agnostic framework that provides explicit probabilistic guarantees for reducing hallucinations in this setting. We formalize the notion of a specific task, defined by a fixed input and a deterministic correctness criterion, and show that issuing the same prompt in independent context windows yields an exponential reduction in the probability that all model outputs are incorrect. To identify a correct answer among repeated runs, we incorporate an LLM-as-a-judge and prove that the probability that the judged pipeline fails decays at a rate determined by the judge's true- and false-positive probabilities. When the judge is imperfect, we strengthen it through majority vote over independent judge calls, obtaining ensemble-level error rates that decrease exponentially in the number of votes. This yields an explicit bound on the probability that the pipeline selects a hallucinated answer. Experiments on controlled extraction tasks with synthetic noisy judges match these predictions exactly: pipeline failure decreases exponentially with the number of repetitions, and hallucination-selection decreases exponentially with the number of judges in the ensemble. Together, these results provide a lightweight, modular, and theoretically grounded method for driving hallucination probabilities arbitrarily low in fixed-input LLM workflows-without modifying model weights, decoding strategies, or prompt engineering.

Fonte: arXiv cs.CL

NLP/LLMsScore 95

CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns

arXiv:2601.00588v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed in cost-sensitive and on-device scenarios, and safety guardrails have advanced mainly in English. However, real-world Chinese malicious queries typically conceal intent via homophones, pinyin, symbol-based splitting, and other Chinese-specific patterns. These Chinese-specific adversarial patterns create the safety evaluation gap that is not well captured by existing benchmarks focused on English. This gap is particularly concerning for lightweight models, which may be more vulnerable to such specific adversarial perturbations. To bridge this gap, we introduce the Chinese-Specific Safety Benchmark (CSSBench) that emphasizes these adversarial patterns and evaluates the safety of lightweight LLMs in Chinese. Our benchmark covers six domains that are common in real Chinese scenarios, including illegal activities and compliance, privacy leakage, health and medical misinformation, fraud and hate, adult content, and public and political safety, and organizes queries into multiple task types. We evaluate a set of popular lightweight LLMs and measure over-refusal behavior to assess safety-induced performance degradation. Our results show that the Chinese-specific adversarial pattern is a critical challenge for lightweight LLMs. This benchmark offers a comprehensive evaluation of LLM safety in Chinese, assisting robust deployments in practice.

Fonte: arXiv cs.CL

NLP/LLMsScore 95

Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak

arXiv:2601.00213v1 Announce Type: cross Abstract: The widespread deployment of large language models (LLMs) has raised growing concerns about their misuse risks and associated safety issues. While prior studies have examined the safety of LLMs in general usage, code generation, and agent-based applications, their vulnerabilities in automated algorithm design remain underexplored. To fill this gap, this study investigates this overlooked safety vulnerability, with a particular focus on intelligent optimization algorithm design, given its prevalent use in complex decision-making scenarios. We introduce MalOptBench, a benchmark consisting of 60 malicious optimization algorithm requests, and propose MOBjailbreak, a jailbreak method tailored for this scenario. Through extensive evaluation of 13 mainstream LLMs including the latest GPT-5 and DeepSeek-V3.1, we reveal that most models remain highly susceptible to such attacks, with an average attack success rate of 83.59% and an average harmfulness score of 4.28 out of 5 on original harmful prompts, and near-complete failure under MOBjailbreak. Furthermore, we assess state-of-the-art plug-and-play defenses that can be applied to closed-source models, and find that they are only marginally effective against MOBjailbreak and prone to exaggerated safety behaviors. These findings highlight the urgent need for stronger alignment techniques to safeguard LLMs against misuse in algorithm design.

Fonte: arXiv cs.CL

NLP/LLMsScore 95

Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach

arXiv:2601.00388v1 Announce Type: new Abstract: Recent advances in vision-language models have opened up new possibilities for reasoning-driven image geolocalization. However, existing approaches often rely on synthetic reasoning annotations or external image retrieval, which can limit interpretability and generalizability. In this paper, we present Geo-R, a retrieval-free framework that uncovers structured reasoning paths from existing ground-truth coordinates and optimizes geolocation accuracy via reinforcement learning. We propose the Chain of Region, a rule-based hierarchical reasoning paradigm that generates precise, interpretable supervision by mapping GPS coordinates to geographic entities (e.g., country, province, city) without relying on model-generated or synthetic labels. Building on this, we introduce a lightweight reinforcement learning strategy with coordinate-aligned rewards based on Haversine distance, enabling the model to refine predictions through spatially meaningful feedback. Our approach bridges structured geographic reasoning with direct spatial supervision, yielding improved localization accuracy, stronger generalization, and more transparent inference. Experimental results across multiple benchmarks confirm the effectiveness of Geo-R, establishing a new retrieval-free paradigm for scalable and interpretable image geolocalization. To facilitate further research and ensure reproducibility, both the model and code will be made publicly available.

Fonte: arXiv cs.CL

NLP/LLMsScore 95

Exploring the Performance of Large Language Models on Subjective Span Identification Tasks

arXiv:2601.00736v1 Announce Type: new Abstract: Identifying relevant text spans is important for several downstream tasks in NLP, as it contributes to model explainability. While most span identification approaches rely on relatively smaller pre-trained language models like BERT, a few recent approaches have leveraged the latest generation of Large Language Models (LLMs) for the task. Current work has focused on explicit span identification like Named Entity Recognition (NER), while more subjective span identification with LLMs in tasks like Aspect-based Sentiment Analysis (ABSA) has been underexplored. In this paper, we fill this important gap by presenting an evaluation of the performance of various LLMs on text span identification in three popular tasks, namely sentiment analysis, offensive language identification, and claim verification. We explore several LLM strategies like instruction tuning, in-context learning, and chain of thought. Our results indicate underlying relationships within text aid LLMs in identifying precise text spans.

Fonte: arXiv cs.CL

NLP/LLMsScore 95

Beyond IVR: Benchmarking Customer Support LLM Agents for Business-Adherence

arXiv:2601.00596v1 Announce Type: new Abstract: Traditional customer support systems, such as Interactive Voice Response (IVR), rely on rigid scripts and lack the flexibility required for handling complex, policy-driven tasks. While large language model (LLM) agents offer a promising alternative, evaluating their ability to act in accordance with business rules and real-world support workflows remains an open challenge. Existing benchmarks primarily focus on tool usage or task completion, overlooking an agent's capacity to adhere to multi-step policies, navigate task dependencies, and remain robust to unpredictable user or environment behavior. In this work, we introduce JourneyBench, a benchmark designed to assess policy-aware agents in customer support. JourneyBench leverages graph representations to generate diverse, realistic support scenarios and proposes the User Journey Coverage Score, a novel metric to measure policy adherence. We evaluate multiple state-of-the-art LLMs using two agent designs: a Static-Prompt Agent (SPA) and a Dynamic-Prompt Agent (DPA) that explicitly models policy control. Across 703 conversations in three domains, we show that DPA significantly boosts policy adherence, even allowing smaller models like GPT-4o-mini to outperform more capable ones like GPT-4o. Our findings demonstrate the importance of structured orchestration and establish JourneyBench as a critical resource to advance AI-driven customer support beyond IVR-era limitations.

Fonte: arXiv cs.CL

VisionScore 96

Aprendizado por Reforço Multiagente para Jogos de Liquidez

Este trabalho explora o uso de métodos de enxame na modelagem de liquidez dos mercados financeiros, unindo Jogos de Liquidez e Enxames Racionais. A pesquisa propõe um modelo teórico onde agentes independentes maximizam a liquidez do mercado sem necessidade de coordenação, contribuindo para a eficiência do mercado e a lucratividade individual.

Fonte: arXiv cs.AI

NLP/LLMsScore 95

Learning Speech Representations with Variational Predictive Coding

arXiv:2601.00100v1 Announce Type: cross Abstract: Despite being the best known objective for learning speech representations, the HuBERT objective has not been further developed and improved. We argue that it is the lack of an underlying principle that stalls the development, and, in this paper, we show that predictive coding under a variational view is the principle behind the HuBERT objective. Due to its generality, our formulation provides opportunities to improve parameterization and optimization, and we show two simple modifications that bring immediate improvements to the HuBERT objective. In addition, the predictive coding formulation has tight connections to various other objectives, such as APC, CPC, wav2vec, and BEST-RQ. Empirically, the improvement in pre-training brings significant improvements to four downstream tasks: phone classification, f0 tracking, speaker recognition, and automatic speech recognition, highlighting the importance of the predictive coding interpretation.

Fonte: arXiv cs.CL

NLP/LLMsScore 95

ECR: Manifold-Guided Semantic Cues for Compact Language Models

arXiv:2601.00543v1 Announce Type: new Abstract: Compact models often lose the structure of their embedding space. The issue shows up when the capacity is tight or the data spans several languages. Such collapse makes it difficult for downstream tasks to build on the resulting representation. Existing compression methods focus on aligning model outputs at a superficial level but fail to preserve the underlying manifold structure. This mismatch often leads to semantic drift in the compact model, causing both task behavior and linguistic properties to deviate from the reference model. To address those issues, we provide a new framework called Embedding Consistency Regulation (ECR). This framework first derives a set of semantic anchors from teacher embeddings (computed once offline). Then, the compact model learns to maintain consistent geometry around these anchors, without relying on matching logits or internal features. ECR adds only a small projection step at inference, without altering the decoding architecture or its runtime behavior. In experiments on a 100K multilingual corpus, ECR consistently stabilizes training and preserves semantic structure across tasks and languages. It also produces a more compact and task-aligned representation space, enabling low-capacity models to learn cleaner manifolds than conventional baselines. ECR works without teacher outputs and is compatible with, but independent of, distillation. Taken together, our results show that ECR helps compact models better follow task requirements and makes them easier to deploy under strict efficiency or privacy limits.

Fonte: arXiv cs.CL

NLP/LLMsScore 96

Além de APIs Perfeitas: Uma Avaliação Abrangente de Agentes LLM Sob a Complexidade Real de APIs

Apresentamos o WildAGTEval, um benchmark projetado para avaliar as capacidades de chamada de função de agentes de modelos de linguagem grande (LLM) sob a complexidade realista de APIs. Ao contrário de trabalhos anteriores que assumem um sistema de API idealizado, WildAGTEval considera a especificação e a execução de APIs, oferecendo cenários de complexidade variados para avaliar o desempenho dos LLMs.

Fonte: arXiv cs.AI

RLScore 95

Retrieval--Reasoning Processes for Multi-hop Question Answering: A Four-Axis Design Framework and Empirical Trends

arXiv:2601.00536v1 Announce Type: new Abstract: Multi-hop question answering (QA) requires systems to iteratively retrieve evidence and reason across multiple hops. While recent RAG and agentic methods report strong results, the underlying retrieval--reasoning \emph{process} is often left implicit, making procedural choices hard to compare across model families. This survey takes the execution procedure as the unit of analysis and introduces a four-axis framework covering (A) overall execution plan, (B) index structure, (C) next-step control (strategies and triggers), and (D) stop/continue criteria. Using this schema, we map representative multi-hop QA systems and synthesize reported ablations and tendencies on standard benchmarks (e.g., HotpotQA, 2WikiMultiHopQA, MuSiQue), highlighting recurring trade-offs among effectiveness, efficiency, and evidence faithfulness. We conclude with open challenges for retrieval--reasoning agents, including structure-aware planning, transferable control policies, and robust stopping under distribution shift.

Fonte: arXiv cs.CL

VisionScore 95

DepFlow: Disentangled Speech Generation to Mitigate Semantic Bias in Depression Detection

arXiv:2601.00303v1 Announce Type: new Abstract: Speech is a scalable and non-invasive biomarker for early mental health screening. However, widely used depression datasets like DAIC-WOZ exhibit strong coupling between linguistic sentiment and diagnostic labels, encouraging models to learn semantic shortcuts. As a result, model robustness may be compromised in real-world scenarios, such as Camouflaged Depression, where individuals maintain socially positive or neutral language despite underlying depressive states. To mitigate this semantic bias, we propose DepFlow, a three-stage depression-conditioned text-to-speech framework. First, a Depression Acoustic Encoder learns speaker- and content-invariant depression embeddings through adversarial training, achieving effective disentanglement while preserving depression discriminability (ROC-AUC: 0.693). Second, a flow-matching TTS model with FiLM modulation injects these embeddings into synthesis, enabling control over depressive severity while preserving content and speaker identity. Third, a prototype-based severity mapping mechanism provides smooth and interpretable manipulation across the depression continuum. Using DepFlow, we construct a Camouflage Depression-oriented Augmentation (CDoA) dataset that pairs depressed acoustic patterns with positive/neutral content from a sentiment-stratified text bank, creating acoustic-semantic mismatches underrepresented in natural data. Evaluated across three depression detection architectures, CDoA improves macro-F1 by 9%, 12%, and 5%, respectively, consistently outperforming conventional augmentation strategies in depression Detection. Beyond enhancing robustness, DepFlow provides a controllable synthesis platform for conversational systems and simulation-based evaluation, where real clinical data remains limited by ethical and coverage constraints.

Fonte: arXiv cs.CL

RLScore 95

Clustering por Denoising: Difusão latente plug-and-play para dados de célula única

O sequenciamento de RNA de célula única (scRNA-seq) permite o estudo da heterogeneidade celular. No entanto, a precisão do clustering e as análises subsequentes baseadas em rótulos celulares ainda são desafiadoras devido ao ruído de medição e à variabilidade biológica. Apresentamos um framework de difusão plug-and-play que separa o espaço de observação e o espaço de denoising.

Fonte: arXiv stat.ML

NLP/LLMsScore 96

Aprendizado por Reforço com Aproximação de Função para Processos Não-Markovianos

Estudamos métodos de aprendizado por reforço com aproximação de função linear sob processos de estado e custo não-Markovianos. Consideramos inicialmente o método de avaliação de política e demonstramos que o algoritmo converge sob condições adequadas de ergodicidade. Além disso, mostramos que o limite corresponde ao ponto fixo de um operador conjunto composto por uma projeção ortogonal e o operador de Bellman de um processo de decisão auxiliar extit{Markov}.

Fonte: arXiv cs.LG

VisionScore 95

Simulação como Supervisão: Pré-treinamento Mecânico para Descoberta Científica

A modelagem científica enfrenta um trade-off entre a interpretabilidade da teoria mecanicista e o poder preditivo do machine learning. Apresentamos as Simulation-Grounded Neural Networks (SGNNs), um framework que incorpora conhecimento de domínio nos dados de treinamento, permitindo que o modelo aprenda padrões amplos de possibilidade física e seja mais robusto a erros de especificação do modelo.

Fonte: arXiv stat.ML

NLP/LLMsScore 96

Grande Estudo de Caso Empírico: Go-Explore adaptado para Testes de Red Team de IA

Agentes LLM de produção com capacidades de uso de ferramentas requerem testes de segurança, apesar de seu treinamento em segurança. Adaptamos o Go-Explore para avaliar o GPT-4o-mini em 28 execuções experimentais, abordando seis questões de pesquisa. Nossos resultados mostram que a variação de sementes aleatórias domina os parâmetros algorítmicos, resultando em uma variação de 8x nos resultados.

Fonte: arXiv cs.AI

RLScore 95

Projetando uma Rede de Sensores Ótima Através da Minimização da Perda de Informação

O design experimental ótimo é um tópico clássico em estatística, com muitos problemas e soluções bem estudados. Este trabalho investiga o posicionamento de sensores para monitorar processos espaço-temporais, considerando a dimensão temporal em nossa modelagem e otimização. Apresentamos um novo critério de posicionamento de sensores baseado em modelo, juntamente com um algoritmo de otimização altamente eficiente.

Fonte: arXiv stat.ML

RLScore 95

Mitigando o viés otimista na estimativa e otimização de risco entrópico

A medida de risco entrópico é amplamente utilizada em decisões críticas em economia, ciência da gestão, finanças e sistemas de controle críticos, pois captura riscos extremos associados a perdas incertas. Este trabalho apresenta um procedimento de bootstrap paramétrico que corrige o viés do estimador empírico de risco entrópico, melhorando a precisão na tomada de decisões.

Fonte: arXiv stat.ML

NLP/LLMsScore 96

Construindo um Matemático Neuro-Simbólico a Partir de Princípios Fundamentais

Modelos de Linguagem de Grande Escala (LLMs) apresentam falhas lógicas persistentes em raciocínios complexos devido à falta de um framework axiomático interno. Propomos o Mathesis, uma arquitetura neuro-simbólica que codifica estados matemáticos como hipergrafos de ordem superior e utiliza um Kernel de Raciocínio Simbólico (SRK), um motor lógico diferenciável que mapeia restrições para uma paisagem de energia contínua.

Fonte: arXiv cs.AI

VisionScore 96

Detecção Adaptativa de Coordenação Causal para Mídias Sociais: Um Framework Guiado por Memória com Aprendizado Semi-Supervisionado

Detectar comportamentos inautênticos coordenados em mídias sociais é um desafio crítico. Propomos o framework Adaptive Causal Coordination Detection (ACCD), que utiliza uma arquitetura progressiva em três estágios para aprender e reter configurações de detecção otimizadas. O ACCD melhora a identificação de relações causais e reduz a necessidade de rotulagem manual, alcançando um F1-score de 87,3% em detecção de ataques coordenados.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

Ajuste Fino Online de Decision Transformers com Gradientes de RL Puro

Os Decision Transformers (DTs) surgiram como um poderoso framework para tomada de decisão sequencial, formulando o aprendizado por reforço offline (RL) como um problema de modelagem de sequência. No entanto, a extensão dos DTs para configurações online com gradientes de RL puro permanece amplamente inexplorada. Identificamos o relabeling de retorno retrospectivo como um obstáculo crítico para o ajuste fino baseado em RL.

Fonte: arXiv cs.AI

NLP/LLMsScore 95

Language as Mathematical Structure: Examining Semantic Field Theory Against Language Games

arXiv:2601.00448v1 Announce Type: new Abstract: Large language models (LLMs) offer a new empirical setting in which long-standing theories of linguistic meaning can be examined. This paper contrasts two broad approaches: social constructivist accounts associated with language games, and a mathematically oriented framework we call Semantic Field Theory. Building on earlier work by the author, we formalize the notions of lexical fields (Lexfelder) and linguistic fields (Lingofelder) as interacting structures in a continuous semantic space. We then analyze how core properties of transformer architectures-such as distributed representations, attention mechanisms, and geometric regularities in embedding spaces-relate to these concepts. We argue that the success of LLMs in capturing semantic regularities supports the view that language exhibits an underlying mathematical structure, while their persistent limitations in pragmatic reasoning and context sensitivity are consistent with the importance of social grounding emphasized in philosophical accounts of language use. On this basis, we suggest that mathematical structure and language games can be understood as complementary rather than competing perspectives. The resulting framework clarifies the scope and limits of purely statistical models of language and motivates new directions for theoretically informed AI architectures.

Fonte: arXiv cs.CL

RLScore 96

Rumo a uma Teoria Física da Inteligência

Apresentamos uma teoria física da inteligência fundamentada no processamento irreversível de informações em sistemas sujeitos a leis de conservação. Um sistema inteligente é modelado como um processo acoplado agente-ambiente, cuja evolução transforma informações em trabalho direcionado a objetivos. Introduzimos o framework Conservation-Congruent Encoding (CCE) para conectar informações ao estado físico.

Fonte: arXiv cs.AI

RLScore 92

Efeitos da Alocação Estrutural da Diversidade de Tarefas Geométricas em Modelos Lineares de Meta-Aprendizado

O meta-aprendizado busca aproveitar informações de tarefas relacionadas para melhorar a previsão em dados não rotulados para novas tarefas com um número limitado de observações rotuladas. Embora a diversidade de tarefas seja considerada benéfica, estudos recentes mostram que ela pode degradar o desempenho de previsão em meta-aprendizado, dependendo da alocação da variabilidade geométrica das tarefas.

Fonte: arXiv stat.ML

RLScore 93

Ideação Progressiva usando um Framework de IA Agente para Co-Criação Humano-IA

A geração de ideias verdadeiramente novas e diversificadas é crucial para o design de engenharia contemporâneo, mas continua sendo um desafio cognitivo significativo para designers novatos. Propomos o MIDAS (Meta-cognitive Ideation through Distributed Agentic AI System), um framework inovador que substitui o paradigma de IA única por uma 'equipe' distribuída de agentes de IA especializados, projetados para emular o fluxo de trabalho de ideação meta-cognitiva humana.

Fonte: arXiv cs.AI

NLP/LLMsScore 95

The Role of Mixed-Language Documents for Multilingual Large Language Model Pretraining

arXiv:2601.00364v1 Announce Type: new Abstract: Multilingual large language models achieve impressive cross-lingual performance despite largely monolingual pretraining. While bilingual data in pretraining corpora is widely believed to enable these abilities, details of its contributions remain unclear. We investigate this question by pretraining models from scratch under controlled conditions, comparing the standard web corpus with a monolingual-only version that removes all multilingual documents. Despite constituting only 2% of the corpus, removing bilingual data causes translation performance to drop 56% in BLEU, while behaviour on cross-lingual QA and general reasoning tasks remains stable, with training curves largely overlapping the baseline. To understand this asymmetry, we categorize bilingual data into parallel (14%), code-switching (72%), and miscellaneous documents (14%) based on the semantic relevance of content in different languages. We then conduct granular ablations by reintroducing parallel or code-switching data into the monolingual-only corpus. Our experiments reveal that parallel data almost fully restores translation performance (91% of the unfiltered baseline), whereas code-switching contributes minimally. Other cross-lingual tasks remain largely unaffected by either type. These findings reveal that translation critically depends on systematic token-level alignments from parallel data, whereas cross-lingual understanding and reasoning appear to be achievable even without bilingual data.

Fonte: arXiv cs.CL

RLScore 95

Integração de Multi-Armed Bandit, Aprendizado Ativo e Computação Distribuída para Otimização Escalável

Problemas modernos de otimização em domínios científicos e de engenharia frequentemente dependem de avaliações black-box caras. Propomos o ALMAB-DC, um framework modular e unificado para otimização black-box escalável que integra aprendizado ativo, multi-armed bandits e computação distribuída, com aceleração opcional por GPU. Resultados empíricos mostram que ALMAB-DC supera consistentemente otimizadores black-box de última geração.

Fonte: arXiv stat.ML

RLScore 93

Implantações Econômicas Inteligentes de Baixa Altitude da Próxima Geração: A Perspectiva O-RAN

Apesar do crescente interesse em aplicações de economia de baixa altitude (LAE), como logística baseada em UAV e resposta a emergências, desafios fundamentais permanecem na orquestração dessas missões em ambientes complexos e com restrições de sinal. Este artigo apresenta um framework LAE habilitado para O-RAN que otimiza operações críticas por meio de coordenação entre a arquitetura RAN desagregada e controladores inteligentes.

Fonte: arXiv cs.AI

NLP/LLMsScore 95

From Evidence-Based Medicine to Knowledge Graph: Retrieval-Augmented Generation for Sports Rehabilitation and a Domain Benchmark

arXiv:2601.00216v1 Announce Type: new Abstract: In medicine, large language models (LLMs) increasingly rely on retrieval-augmented generation (RAG) to ground outputs in up-to-date external evidence. However, current RAG approaches focus primarily on performance improvements while overlooking evidence-based medicine (EBM) principles. This study addresses two key gaps: (1) the lack of PICO alignment between queries and retrieved evidence, and (2) the absence of evidence hierarchy considerations during reranking. We present a generalizable strategy for adapting EBM to graph-based RAG, integrating the PICO framework into knowledge graph construction and retrieval, and proposing a Bayesian-inspired reranking algorithm to calibrate ranking scores by evidence grade without introducing predefined weights. We validated this framework in sports rehabilitation, a literature-rich domain currently lacking RAG systems and benchmarks. We released a knowledge graph (357,844 nodes and 371,226 edges) and a reusable benchmark of 1,637 QA pairs. The system achieved 0.830 nugget coverage, 0.819 answer faithfulness, 0.882 semantic similarity, and 0.788 PICOT match accuracy. In a 5-point Likert evaluation, five expert clinicians rated the system 4.66-4.84 across factual accuracy, faithfulness, relevance, safety, and PICO alignment. These findings demonstrate that the proposed EBM adaptation strategy improves retrieval and answer quality and is transferable to other clinical domains. The released resources also help address the scarcity of RAG datasets in sports rehabilitation.

Fonte: arXiv cs.CL

NLP/LLMsScore 96

Pergunte, Esclareça, Otimize: Colaboração Humano-LLM para um Controle de Estoque Mais Inteligente

arXiv:2601.00121v1 Tipo de Anúncio: novo Resumo: A gestão de estoque continua sendo um desafio para muitas pequenas e médias empresas que carecem da expertise para implementar métodos avançados de otimização. Este artigo investiga se os Large Language Models (LLMs) podem ajudar a preencher essa lacuna. Mostramos que empregar LLMs como solucionadores diretos, de ponta a ponta, acarreta um significativo "custo de alucinação": uma diferença de desempenho decorrente da incapacidade do modelo de realizar raciocínio estocástico fundamentado. Para abordar isso, propomos uma estrutura híbrida de agente que desacopla estritamente o raciocínio semântico do cálculo matemático. Nesta arquitetura, o LLM funciona como uma interface inteligente, extraindo parâmetros da linguagem natural e interpretando resultados enquanto chama automaticamente algoritmos rigorosos para construir o motor de otimização. Para avaliar este sistema interativo em relação à ambiguidade e inconsistência do diálogo gerencial do mundo real, introduzimos o Human Imitator, um "gêmeo digital" ajustado de um gerente racionalmente limitado que permite testes de estresse escaláveis e reproduzíveis. Nossa análise empírica revela que a estrutura híbrida de agente reduz os custos totais de estoque em 32,1% em relação a uma linha de base interativa usando o GPT-4o como solucionador de ponta a ponta. Além disso, descobrimos que fornecer informações perfeitas de verdade fundamental por si só é insuficiente para melhorar o desempenho do GPT-4o, confirmando que o gargalo é fundamentalmente computacional e não informacional. Nossos resultados posicionam os LLMs não como substitutos da pesquisa operacional, mas como interfaces de linguagem natural que tornam políticas rigorosas baseadas em solucionadores acessíveis a não especialistas.

Fonte: arXiv cs.AI

RLScore 96

Amostras Adversariais Não São Criadas Iguais

No último década, diversas teorias foram propostas para explicar a vulnerabilidade generalizada das redes neurais profundas a ataques de evasão adversariais. Este trabalho defende que amostras que utilizam características frágeis, mas preditivas, e aquelas que não utilizam, representam dois tipos de fraquezas adversariais e devem ser diferenciadas na avaliação da robustez adversarial.

Fonte: arXiv cs.LG

RLScore 96

Uma abordagem multi-algoritmo para o balanceamento da carga de trabalho operacional de recursos humanos em um sistema de entrega urbana de última milha

A atribuição eficiente de carga de trabalho à força de trabalho é crucial em sistemas de entrega de pacotes de última milha. Este artigo aborda o problema do balanceamento da carga de trabalho operacional em sistemas de entrega urbana, propondo uma abordagem multi-algoritmo que otimiza o tempo de entrega e garante uma distribuição equilibrada da carga de trabalho entre os trabalhadores.

Fonte: arXiv cs.AI

RLScore 96

Colocação Ótima de Táxis Consciente do Tráfego Usando Aprendizado por Reforço Baseado em Redes Neurais Gráficas

No contexto do transporte em cidades inteligentes, o emparelhamento eficiente da oferta de táxis com a demanda de passageiros requer a integração em tempo real de dados da rede de tráfego urbano e padrões de mobilidade. Este artigo apresenta um framework de aprendizado por reforço (RL) baseado em grafos para a colocação ótima de táxis em ambientes metropolitanos.

Fonte: arXiv cs.LG

NLP/LLMsScore 95

RIMRULE: Improving Tool-Using Language Agents via MDL-Guided Rule Learning

arXiv:2601.00086v1 Announce Type: new Abstract: Large language models (LLMs) often struggle to use tools reliably in domain-specific settings, where APIs may be idiosyncratic, under-documented, or tailored to private workflows. This highlights the need for effective adaptation to task-specific tools. We propose RIMRULE, a neuro-symbolic approach for LLM adaptation based on dynamic rule injection. Compact, interpretable rules are distilled from failure traces and injected into the prompt during inference to improve task performance. These rules are proposed by the LLM itself and consolidated using a Minimum Description Length (MDL) objective that favors generality and conciseness. Each rule is stored in both natural language and a structured symbolic form, supporting efficient retrieval at inference time. Experiments on tool-use benchmarks show that this approach improves accuracy on both seen and unseen tools without modifying LLM weights. It outperforms prompting-based adaptation methods and complements finetuning. Moreover, rules learned from one LLM can be reused to improve others, including long reasoning LLMs, highlighting the portability of symbolic knowledge across architectures.

Fonte: arXiv cs.CL

NLP/LLMsScore 96

Trajectory Guard -- Um Modelo Leve e Consciente de Sequência para Detecção de Anomalias em Tempo Real em AI Agente

Agentes autônomos de LLM geram planos de ação de múltiplos passos que podem falhar devido a desalinhamento contextual ou incoerência estrutural. Métodos existentes de detecção de anomalias não são adequados para esse desafio. Apresentamos o Trajectory Guard, um Autoencoder Recorrente Siamês que aprende alinhamento de tarefa e trajetória, permitindo a detecção unificada de planos incorretos e estruturas de planos malformadas.

Fonte: arXiv cs.LG

RLScore 90

Regularização Geométrica em Mistura de Especialistas: A Desconexão Entre Pesos e Ativações

Modelos de Mistura de Especialistas (MoE) alcançam eficiência por meio de ativação esparsa, mas o papel da regularização geométrica na especialização dos especialistas permanece incerto. Aplicamos perda de ortogonalidade para impor diversidade entre especialistas e descobrimos que ela falha em vários aspectos, como aumento da sobreposição no espaço de pesos e resultados de desempenho inconsistentes.

Fonte: arXiv cs.LG

NLP/LLMsScore 96

Rumo a Sistemas de IA Potencializados por Fotônica em Grande Escala: Da Automação de Design Físico à Coexploração de Sistema e Algoritmo

Neste trabalho, identificamos três considerações essenciais para a realização de sistemas práticos de IA fotônica em escala: (1) suporte a operações tensorais dinâmicas para modelos modernos; (2) gerenciamento sistemático de sobrecargas de conversão, controle e movimentação de dados; e (3) robustez sob não idealidades de hardware. Desenvolvemos uma ferramenta de suporte ao design de IA fotônica desde a exploração inicial até a realização física.

Fonte: arXiv cs.AI

VisionScore 95

Bandidos Contextuais Aditivos Esparsos: Uma Abordagem Não Paramétrica para Tomada de Decisão Online com Covariáveis de Alta Dimensionalidade

Serviços personalizados são centrais para a economia digital atual, e suas decisões sequenciais são frequentemente modeladas como bandidos contextuais. Aplicações modernas enfrentam dois desafios principais: covariáveis de alta dimensionalidade e a necessidade de modelos não paramétricos para capturar relações complexas entre recompensa e covariáveis. Propomos um algoritmo de bandido contextual baseado em um modelo de recompensa aditiva esparsa que aborda ambos os desafios.

Fonte: arXiv stat.ML

RLScore 96

Uma Análise Comparativa de Métodos de Machine Learning Interpretabéis

Nos últimos anos, o Machine Learning (ML) tem sido amplamente adotado em diversos setores, incluindo áreas críticas como saúde, finanças e direito. Essa dependência crescente levantou preocupações sobre a interpretabilidade e a responsabilidade dos modelos, especialmente com a imposição de restrições legais e regulatórias sobre o uso de modelos black-box. Este estudo apresenta uma avaliação comparativa de 16 métodos inerentemente interpretabéis, abrangendo 216 conjuntos de dados tabulares do mundo real.

Fonte: arXiv cs.LG

RLScore 96

Dominação Quântica King-Ring no Xadrez: Uma Abordagem QAOA

O Quantum Approximate Optimization Algorithm (QAOA) é amplamente testado em instâncias aleatórias sintéticas, mas carece de estrutura semântica e interpretabilidade humana. Apresentamos a Dominação Quântica King-Ring (QKRD), um benchmark em escala NISQ derivado de posições táticas de xadrez, oferecendo 5.000 instâncias estruturadas. Usando QKRD, avaliamos escolhas de design do QAOA e mostramos que técnicas informadas por problemas revelam vantagens ocultas em instâncias aleatórias.

Fonte: arXiv cs.LG

NLP/LLMsScore 96

Ouça o Batimento em Fases: Biometria de ECG Consciente de Fases Fundamentada Fisiologicamente

A eletrocardiografia (ECG) é utilizada para autenticação de identidade em dispositivos vestíveis devido às suas características específicas de cada indivíduo e à sua natureza inerente de vivacidade. Propomos um framework Hierarchical Phase-Aware Fusion (HPAF) que evita explicitamente o entrelaçamento de características através de um design em três estágios.

Fonte: arXiv cs.AI

VisionScore 96

Atribuição de Conteúdo Gerado por IA Desconhecida e Consciente

O avanço rápido de modelos generativos fotorealistas tornou crucial atribuir a origem do conteúdo sintético, passando da detecção binária de real ou falso para identificar o modelo específico que produziu uma imagem. Este estudo investiga a distinção de saídas de um modelo gerador-alvo (ex: OpenAI Dalle 3) em relação a outras fontes.

Fonte: arXiv cs.LG

RLScore 96

Computação de Reservatório Sequencial para Previsão Espacial e Temporal de Alta Dimensionalidade de Forma Eficiente

A previsão de sistemas espaciais e temporais de alta dimensionalidade continua sendo um desafio computacional para redes neurais recorrentes (RNNs) e modelos de memória de longo e curto prazo (LSTM). Introduzimos uma arquitetura de Computação de Reservatório Sequencial (Sequential RC) que decompõe um grande reservatório em uma série de reservatórios menores e interconectados, melhorando a eficiência e reduzindo custos computacionais.

Fonte: arXiv cs.LG

RLScore 96

O Transporte Ótimo Pode Melhorar o Aprendizado por Reforço Inverso Federado?

Neste artigo, introduzimos uma abordagem baseada em transporte ótimo para o aprendizado por reforço inverso federado (IRL). Cada cliente realiza localmente um IRL de Máxima Entropia, respeitando suas limitações computacionais e de privacidade. As funções de recompensa resultantes são fundidas via um barycenter de Wasserstein, que considera sua estrutura geométrica subjacente. Este trabalho oferece um framework eficiente em comunicação para derivar uma recompensa compartilhada que se generaliza entre agentes e ambientes heterogêneos.

Fonte: arXiv cs.LG

VisionScore 96

Engenharia de Recursos Híbridos Otimizada para Detecção de Arritmias Eficiente em Recursos em Sinais de ECG: Um Framework de Otimização

As doenças cardiovasculares, especialmente as arritmias, continuam a ser uma das principais causas de mortalidade global, exigindo monitoramento contínuo via Internet das Coisas Médicas (IoMT). Este estudo propõe um framework centrado em dados e eficiente em recursos que prioriza a engenharia de características em vez da complexidade, alcançando alta precisão diagnóstica com um modelo leve.

Fonte: arXiv cs.LG

RLScore 90

Um Framework Agente para Programação Neuro-Simbólica

Integrar restrições simbólicas em modelos de deep learning pode torná-los mais robustos, interpretáveis e eficientes em termos de dados. No entanto, essa tarefa continua sendo desafiadora e demorada. Propomos o AgenticDomiKnowS (ADS) para eliminar essa dependência, permitindo que usuários construam rapidamente programas neuro-simbólicos.

Fonte: arXiv cs.AI

RLScore 93

Proteção de Erro Desigual Aprendida por Reforço para Embeddings Semânticos Quantizados

Este artigo aborda o desafio premente de preservar o significado semântico em sistemas de comunicação com largura de banda limitada. Introduzimos um novo framework de aprendizado por reforço que alcança proteção desigual por dimensão via codificação de repetição adaptativa, utilizando uma métrica de distorção semântica composta que equilibra a similaridade global de embeddings com a preservação em nível de entidade.

Fonte: arXiv cs.LG

RLScore 96

Predição Precoce de Cirrose Hepática com Antecedência de Até Três Anos: Um Estudo de Machine Learning Comparando com o FIB-4

Objetivo: Desenvolver e avaliar modelos de machine learning (ML) para prever a cirrose hepática incidente um, dois e três anos antes do diagnóstico, utilizando dados de registros eletrônicos de saúde (EHR) coletados rotineiramente, e comparar seu desempenho com o escore FIB-4. Métodos: Realizamos um estudo de coorte retrospectivo usando dados de EHR desidentificados de um grande sistema de saúde acadêmico.

Fonte: arXiv cs.LG

RLScore 93

Exploração nos Limites

No problema de identificação do melhor braço (BAI) com confiança fixa, o objetivo é identificar rapidamente a opção ótima enquanto controla a probabilidade de erro abaixo de um limite desejado. Introduzimos uma formulação relaxada que requer controle de erro válido assintoticamente em relação a um tamanho mínimo de amostra, permitindo uma melhor adequação a cenários do mundo real.

Fonte: arXiv cs.LG

RLScore 90

Bandit Kernelizado Laplaciano

Estudamos bandits contextuais multiusuário onde os usuários estão relacionados por um grafo e suas funções de recompensa apresentam comportamento não linear e homofilia gráfica. Introduzimos uma penalidade conjunta fundamentada para a coleção de funções de recompensa dos usuários, combinando um termo de suavidade gráfica com uma penalidade de rugosidade individual.

Fonte: arXiv cs.LG

VisionScore 96

IMBWatch -- uma abordagem de Rede Neural Gráfica Espacial-Temporal para detectar Negócios de Massagem Ilícitos

Os Negócios de Massagem Ilícitos (IMBs) são uma forma encoberta e persistente de exploração organizada que operam sob a fachada de serviços de bem-estar legítimos. A detecção de IMBs é difícil devido a anúncios digitais codificados e mudanças frequentes de pessoal e locais. Apresentamos o IMBWatch, um framework de rede neural gráfica espacial-temporal (ST-GNN) para a detecção em larga escala de IMBs, que combina operações de convolução gráfica com mecanismos de atenção temporal.

Fonte: arXiv cs.LG

NLP/LLMsScore 96

Framework Auto-reparador Agente Bio-inspirado para Sistemas de Computação Distribuída Resilientes

Este artigo apresenta o ReCiSt, um framework auto-reparador bio-inspirado projetado para alcançar resiliência em Sistemas de Computação Distribuída (DCCS). O ReCiSt reconstrói fases biológicas em camadas computacionais que realizam isolamento autônomo de falhas, diagnóstico causal, recuperação adaptativa e consolidação de conhecimento a partir de agentes impulsionados por Language Model (LM).

Fonte: arXiv cs.AI

RLScore 95

AEGIS: Exploring the Limit of World Knowledge Capabilities for Unified Mulitmodal Models

arXiv:2601.00561v1 Announce Type: new Abstract: The capability of Unified Multimodal Models (UMMs) to apply world knowledge across diverse tasks remains a critical, unresolved challenge. Existing benchmarks fall short, offering only siloed, single-task evaluations with limited diagnostic power. To bridge this gap, we propose AEGIS (\emph{i.e.}, \textbf{A}ssessing \textbf{E}diting, \textbf{G}eneration, \textbf{I}nterpretation-Understanding for \textbf{S}uper-intelligence), a comprehensive multi-task benchmark covering visual understanding, generation, editing, and interleaved generation. AEGIS comprises 1,050 challenging, manually-annotated questions spanning 21 topics (including STEM, humanities, daily life, etc.) and 6 reasoning types. To concretely evaluate the performance of UMMs in world knowledge scope without ambiguous metrics, we further propose Deterministic Checklist-based Evaluation (DCE), a protocol that replaces ambiguous prompt-based scoring with atomic ``Y/N'' judgments, to enhance evaluation reliability. Our extensive experiments reveal that most UMMs exhibit severe world knowledge deficits and that performance degrades significantly with complex reasoning. Additionally, simple plug-in reasoning modules can partially mitigate these vulnerabilities, highlighting a promising direction for future research. These results highlight the importance of world-knowledge-based reasoning as a critical frontier for UMMs.

Fonte: arXiv cs.CV

NLP/LLMsScore 95

Efficient Prediction of Dense Visual Embeddings via Distillation and RGB-D Transformers

arXiv:2601.00359v1 Announce Type: new Abstract: In domestic environments, robots require a comprehensive understanding of their surroundings to interact effectively and intuitively with untrained humans. In this paper, we propose DVEFormer - an efficient RGB-D Transformer-based approach that predicts dense text-aligned visual embeddings (DVE) via knowledge distillation. Instead of directly performing classical semantic segmentation with fixed predefined classes, our method uses teacher embeddings from Alpha-CLIP to guide our efficient student model DVEFormer in learning fine-grained pixel-wise embeddings. While this approach still enables classical semantic segmentation, e.g., via linear probing, it further enables flexible text-based querying and other applications, such as creating comprehensive 3D maps. Evaluations on common indoor datasets demonstrate that our approach achieves competitive performance while meeting real-time requirements, operating at 26.3 FPS for the full model and 77.0 FPS for a smaller variant on an NVIDIA Jetson AGX Orin. Additionally, we show qualitative results that highlight the effectiveness and possible use cases in real-world applications. Overall, our method serves as a drop-in replacement for traditional segmentation approaches while enabling flexible natural-language querying and seamless integration into 3D mapping pipelines for mobile robotics.

Fonte: arXiv cs.CV

NLP/LLMsScore 95

MotionPhysics: Learnable Motion Distillation for Text-Guided Simulation

arXiv:2601.00504v1 Announce Type: new Abstract: Accurately simulating existing 3D objects and a wide variety of materials often demands expert knowledge and time-consuming physical parameter tuning to achieve the desired dynamic behavior. We introduce MotionPhysics, an end-to-end differentiable framework that infers plausible physical parameters from a user-provided natural language prompt for a chosen 3D scene of interest, removing the need for guidance from ground-truth trajectories or annotated videos. Our approach first utilizes a multimodal large language model to estimate material parameter values, which are constrained to lie within plausible ranges. We further propose a learnable motion distillation loss that extracts robust motion priors from pretrained video diffusion models while minimizing appearance and geometry inductive biases to guide the simulation. We evaluate MotionPhysics across more than thirty scenarios, including real-world, human-designed, and AI-generated 3D objects, spanning a wide range of materials such as elastic solids, metals, foams, sand, and both Newtonian and non-Newtonian fluids. We demonstrate that MotionPhysics produces visually realistic dynamic simulations guided by natural language, surpassing the state of the art while automatically determining physically plausible parameters. The code and project page are available at: https://wangmiaowei.github.io/MotionPhysics.github.io/.

Fonte: arXiv cs.CV

RLScore 95

VisNet: Efficient Person Re-Identification via Alpha-Divergence Loss, Feature Fusion and Dynamic Multi-Task Learning

arXiv:2601.00307v1 Announce Type: new Abstract: Person re-identification (ReID) is an extremely important area in both surveillance and mobile applications, requiring strong accuracy with minimal computational cost. State-of-the-art methods give good accuracy but with high computational budgets. To remedy this, this paper proposes VisNet, a computationally efficient and effective re-identification model suitable for real-world scenarios. It is the culmination of conceptual contributions, including feature fusion at multiple scales with automatic attention on each, semantic clustering with anatomical body partitioning, a dynamic weight averaging technique to balance classification semantic regularization, and the use of loss function FIDI for improved metric learning tasks. The multiple scales fuse ResNet50's stages 1 through 4 without the use of parallel paths, with semantic clustering introducing spatial constraints through the use of rule-based pseudo-labeling. VisNet achieves 87.05% Rank-1 and 77.65% mAP on the Market-1501 dataset, having 32.41M parameters and 4.601 GFLOPs, hence, proposing a practical approach for real-time deployment in surveillance and mobile applications where computational resources are limited.

Fonte: arXiv cs.CV

NLP/LLMsScore 95

TotalFM: An Organ-Separated Framework for 3D-CT Vision Foundation Models

arXiv:2601.00260v1 Announce Type: new Abstract: While foundation models in radiology are expected to be applied to various clinical tasks, computational cost constraints remain a major challenge when training on 3D-CT volumetric data. In this study, we propose TotalFM, a radiological foundation model that efficiently learns the correspondence between 3D-CT images and linguistic expressions based on the concept of organ separation, utilizing a large-scale dataset of 140,000 series. By automating the creation of organ volume and finding-sentence pairs through segmentation techniques and Large Language Model (LLM)-based radiology report processing, and by combining self-supervised pre-training via VideoMAE with contrastive learning using volume-text pairs, we aimed to balance computational efficiency and representation capability. In zero-shot organ-wise lesion classification tasks, the proposed model achieved higher F1 scores in 83% (5/6) of organs compared to CT-CLIP and 64% (9/14) of organs compared to Merlin. These results suggest that the proposed model exhibits high generalization performance in a clinical evaluation setting using actual radiology report sentences. Furthermore, in zero-shot finding-wise lesion classification tasks, our model achieved a higher AUROC in 83% (25/30) of finding categories compared to Merlin. We also confirmed performance comparable to existing Vision-Language Models (VLMs) in radiology report generation tasks. Our results demonstrate that the organ-separated learning framework can serve as a realistic and effective design guideline for the practical implementation of 3D-CT foundation models.

Fonte: arXiv cs.CV

VisionScore 95

TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model

arXiv:2601.00051v1 Announce Type: new Abstract: World models aim to endow AI systems with the ability to represent, generate, and interact with dynamic environments in a coherent and temporally consistent manner. While recent video generation models have demonstrated impressive visual quality, they remain limited in real-time interaction, long-horizon consistency, and persistent memory of dynamic scenes, hindering their evolution into practical world models. In this report, we present TeleWorld, a real-time multimodal 4D world modeling framework that unifies video generation, dynamic scene reconstruction, and long-term world memory within a closed-loop system. TeleWorld introduces a novel generation-reconstruction-guidance paradigm, where generated video streams are continuously reconstructed into a dynamic 4D spatio-temporal representation, which in turn guides subsequent generation to maintain spatial, temporal, and physical consistency. To support long-horizon generation with low latency, we employ an autoregressive diffusion-based video model enhanced with Macro-from-Micro Planning (MMPL)--a hierarchical planning method that reduces error accumulation from frame-level to segment-level-alongside efficient Distribution Matching Distillation (DMD), enabling real-time synthesis under practical computational budgets. Our approach achieves seamless integration of dynamic object modeling and static scene representation within a unified 4D framework, advancing world models toward practical, interactive, and computationally accessible systems. Extensive experiments demonstrate that TeleWorld achieves strong performance in both static and dynamic world understanding, long-term consistency, and real-time generation efficiency, positioning it as a practical step toward interactive, memory-enabled world models for multimodal generation and embodied intelligence.

Fonte: arXiv cs.CV

VisionScore 95

All-in-One Video Restoration under Smoothly Evolving Unknown Weather Degradations

arXiv:2601.00533v1 Announce Type: new Abstract: All-in-one image restoration aims to recover clean images from diverse unknown degradations using a single model. But extending this task to videos faces unique challenges. Existing approaches primarily focus on frame-wise degradation variation, overlooking the temporal continuity that naturally exists in real-world degradation processes. In practice, degradation types and intensities evolve smoothly over time, and multiple degradations may coexist or transition gradually. In this paper, we introduce the Smoothly Evolving Unknown Degradations (SEUD) scenario, where both the active degradation set and degradation intensity change continuously over time. To support this scenario, we design a flexible synthesis pipeline that generates temporally coherent videos with single, compound, and evolving degradations. To address the challenges in the SEUD scenario, we propose an all-in-One Recurrent Conditional and Adaptive prompting Network (ORCANet). First, a Coarse Intensity Estimation Dehazing (CIED) module estimates haze intensity using physical priors and provides coarse dehazed features as initialization. Second, a Flow Prompt Generation (FPG) module extracts degradation features. FPG generates both static prompts that capture segment-level degradation types and dynamic prompts that adapt to frame-level intensity variations. Furthermore, a label-aware supervision mechanism improves the discriminability of static prompt representations under different degradations. Extensive experiments show that ORCANet achieves superior restoration quality, temporal consistency, and robustness over image and video-based baselines. Code is available at https://github.com/Friskknight/ORCANet-SEUD.

Fonte: arXiv cs.CV

NLP/LLMsScore 95

Memory Bank Compression for Continual Adaptation of Large Language Models

arXiv:2601.00756v1 Announce Type: cross Abstract: Large Language Models (LLMs) have become a mainstay for many everyday applications. However, as data evolve their knowledge quickly becomes outdated. Continual learning aims to update LLMs with new information without erasing previously acquired knowledge. Although methods such as full fine-tuning can incorporate new data, they are computationally expensive and prone to catastrophic forgetting, where prior knowledge is overwritten. Memory-augmented approaches address this by equipping LLMs with a memory bank, that is an external memory module which stores information for future use. However, these methods face a critical limitation, in particular, the memory bank constantly grows in the real-world scenario when large-scale data streams arrive. In this paper, we propose MBC, a model that compresses the memory bank through a codebook optimization strategy during online adaptation learning. To ensure stable learning, we also introduce an online resetting mechanism that prevents codebook collapse. In addition, we employ Key-Value Low-Rank Adaptation in the attention layers of the LLM, enabling efficient utilization of the compressed memory representations. Experiments with benchmark question-answering datasets demonstrate that MBC reduces the memory bank size to 0.3% when compared against the most competitive baseline, while maintaining high retention accuracy during online adaptation learning. Our code is publicly available at https://github.com/Thomkat/MBC.

Fonte: arXiv cs.CL

NLP/LLMsScore 95

Toward Better Temporal Structures for Geopolitical Events Forecasting

arXiv:2601.00430v1 Announce Type: new Abstract: Forecasting on geopolitical temporal knowledge graphs (TKGs) through the lens of large language models (LLMs) has recently gained traction. While TKGs and their generalization, hyper-relational temporal knowledge graphs (HTKGs), offer a straightforward structure to represent simple temporal relationships, they lack the expressive power to convey complex facts efficiently. One of the critical limitations of HTKGs is a lack of support for more than two primary entities in temporal facts, which commonly occur in real-world events. To address this limitation, in this work, we study a generalization of HTKGs, Hyper-Relational Temporal Knowledge Generalized Hypergraphs (HTKGHs). We first derive a formalization for HTKGHs, demonstrating their backward compatibility while supporting two complex types of facts commonly found in geopolitical incidents. Then, utilizing this formalization, we introduce the htkgh-polecat dataset, built upon the global event database POLECAT. Finally, we benchmark and analyze popular LLMs on the relation prediction task, providing insights into their adaptability and capabilities in complex forecasting scenarios.

Fonte: arXiv cs.CL

NLP/LLMsScore 95

Robust Uncertainty Quantification for Factual Generation of Large Language Models

arXiv:2601.00348v1 Announce Type: new Abstract: The rapid advancement of large language model(LLM) technology has facilitated its integration into various domains of professional and daily life. However, the persistent challenge of LLM hallucination has emerged as a critical limitation, significantly compromising the reliability and trustworthiness of AI-generated content. This challenge has garnered significant attention within the scientific community, prompting extensive research efforts in hallucination detection and mitigation strategies. Current methodological frameworks reveal a critical limitation: traditional uncertainty quantification approaches demonstrate effectiveness primarily within conventional question-answering paradigms, yet exhibit notable deficiencies when confronted with non-canonical or adversarial questioning strategies. This performance gap raises substantial concerns regarding the dependability of LLM responses in real-world applications requiring robust critical thinking capabilities. This study aims to fill this gap by proposing an uncertainty quantification scenario in the task of generating with multiple facts. We have meticulously constructed a set of trap questions contained with fake names. Based on this scenario, we innovatively propose a novel and robust uncertainty quantification method(RU). A series of experiments have been conducted to verify its effectiveness. The results show that the constructed set of trap questions performs excellently. Moreover, when compared with the baseline methods on four different models, our proposed method has demonstrated great performance, with an average increase of 0.1-0.2 in ROCAUC values compared to the best performing baseline method, providing new sights and methods for addressing the hallucination issue of LLMs.

Fonte: arXiv cs.CL

RLScore 93

Modelos de Gargalo de Conceito Controláveis

Os Modelos de Gargalo de Conceito (CBMs) têm atraído atenção por sua capacidade de esclarecer o processo de previsão através de uma camada de conceito compreensível para humanos. No entanto, a maioria dos estudos anteriores focou em cenários estáticos. Propomos os Modelos de Gargalo de Conceito Controláveis (CCBMs), que suportam edições em diferentes granularidades, permitindo manutenção contínua sem a necessidade de re-treinamento.

Fonte: arXiv cs.LG

VisionScore 95

Boosting Segment Anything Model to Generalize Visually Non-Salient Scenarios

arXiv:2601.00537v1 Announce Type: new Abstract: Segment Anything Model (SAM), known for its remarkable zero-shot segmentation capabilities, has garnered significant attention in the community. Nevertheless, its performance is challenged when dealing with what we refer to as visually non-salient scenarios, where there is low contrast between the foreground and background. In these cases, existing methods often cannot capture accurate contours and fail to produce promising segmentation results. In this paper, we propose Visually Non-Salient SAM (VNS-SAM), aiming to enhance SAM's perception of visually non-salient scenarios while preserving its original zero-shot generalizability. We achieve this by effectively exploiting SAM's low-level features through two designs: Mask-Edge Token Interactive decoder and Non-Salient Feature Mining module. These designs help the SAM decoder gain a deeper understanding of non-salient characteristics with only marginal parameter increments and computational requirements. The additional parameters of VNS-SAM can be optimized within 4 hours, demonstrating its feasibility and practicality. In terms of data, we established VNS-SEG, a unified dataset for various VNS scenarios, with more than 35K images, in contrast to previous single-task adaptations. It is designed to make the model learn more robust VNS features and comprehensively benchmark the model's segmentation performance and generalizability on VNS scenarios. Extensive experiments across various VNS segmentation tasks demonstrate the superior performance of VNS-SAM, particularly under zero-shot settings, highlighting its potential for broad real-world applications. Codes and datasets are publicly available at https://guangqian-guo.github.io/VNS-SAM.

Fonte: arXiv cs.CV

NLP/LLMsScore 96

TraCeR: Análise de Risco Competitivo Baseada em Transformer com Covariáveis Longitudinais

A análise de sobrevivência é uma ferramenta crítica para modelar dados de tempo até o evento. Modelos recentes baseados em deep learning reduziram várias suposições de modelagem, mas ainda há desafios na incorporação de covariáveis longitudinais. Apresentamos o TraCeR, um framework de análise de sobrevivência baseado em transformer que lida com covariáveis longitudinais e melhora a calibração do modelo.

Fonte: arXiv cs.LG

RLScore 89

Perspectivas para vantagem quântica em machine learning a partir da representabilidade de funções

Demonstrar vantagem quântica em tarefas de machine learning requer navegar por um complexo cenário de modelos e algoritmos propostos. Introduzimos um framework que conecta a estrutura de circuitos quânticos parametrizados à natureza matemática das funções que eles podem realmente aprender, revelando distinções críticas entre modelos simuláveis e aqueles que permanecem robustamente quânticos.

Fonte: arXiv stat.ML

RLScore 89

Sobre Interpolação Estocástica Condicional para Redução de Dimensão Suficiente Não Linear Generativa

Identificar estruturas suficientes de baixa dimensão na redução de dimensão suficiente (SDR) não linear é um problema fundamental e desafiador. Propomos um novo método, generative sufficient dimension reduction (GenSDR), que utiliza modelos generativos modernos e demonstra a capacidade de recuperar completamente a informação contida no campo central $ au$-field em níveis populacional e amostral.

Fonte: arXiv stat.ML

RLScore 93

Qual é o Preço da Monotonicidade? Um Benchmark Multi-Conjunto de Dados do Gradient Boosting com Restrições Monotônicas para PD de Crédito

Instituições financeiras enfrentam um trade-off entre precisão preditiva e interpretabilidade ao implantar modelos de machine learning para risco de crédito. Este artigo avalia modelos de gradient boosting com e sem restrições monotônicas em cinco conjuntos de dados públicos e três bibliotecas, definindo o Preço da Monotonicidade (PoM) como a mudança relativa em métricas de desempenho padrão.

Fonte: arXiv cs.LG

NLP/LLMsScore 92

Um Framework de IA Agente para Treinamento de Habilidades de Estudantes de Medicina Geral

Avanços em grandes modelos de linguagem oferecem forte potencial para aprimorar pacientes simulados virtuais (VSPs) na educação médica, proporcionando alternativas escaláveis a métodos tradicionais que consomem muitos recursos. Apresentamos um framework agente para treinar habilidades de estudantes de medicina geral que unifica geração de vinhetas configuráveis, diálogo controlado com pacientes e feedback estruturado baseado em padrões.

Fonte: arXiv cs.CL

RLScore 92

Garantias de Convergência Teórica para Autoencoders Variacionais

Os Autoencoders Variacionais (VAE) são modelos generativos populares usados para amostrar de distribuições de dados complexas. Este artigo busca preencher lacunas na compreensão das propriedades teóricas dos VAE, fornecendo garantias de convergência não assintótica para VAE treinados com os algoritmos Stochastic Gradient Descent e Adam, derivando uma taxa de convergência de $ ext{O}( rac{ ext{log} n}{ ext{sqrt}(n)})$.

Fonte: arXiv stat.ML

NLP/LLMsScore 95

CoPE: A Small Language Model for Steerable and Scalable Content Labeling

arXiv:2512.18027v1 Announce Type: new Abstract: This paper details the methodology behind CoPE, a policy-steerable small language model capable of fast and accurate content labeling. We present a novel training curricula called Contradictory Example Training that enables the model to learn policy interpretation rather than mere policy memorization. We also present a novel method for generating content policies, called Binocular Labeling, which enables rapid construction of unambiguous training datasets. When evaluated across seven different harm areas, CoPE exhibits equal or superior accuracy to frontier models at only 1% of their size. We openly release a 9 billion parameter version of the model that can be run on a single consumer-grade GPU. Models like CoPE represent a paradigm shift for classifier systems. By turning an ML task into a policy writing task, CoPE opens up new design possibilities for the governance of online platforms.

Fonte: arXiv cs.CL

RLScore 95

Defesa Certificada sobre a Justiça das Redes Neurais Gráficas

As Redes Neurais Gráficas (GNNs) se destacaram como um modelo proeminente de aprendizado em grafos, mas são vulneráveis a ataques que podem corromper a justiça de suas previsões. Neste artigo, propomos um framework chamado ELEGANT, que oferece uma análise teórica detalhada para certificar a justiça das GNNs, sem exigir re-treinamento e sem suposições sobre a estrutura ou parâmetros das GNNs.

Fonte: arXiv stat.ML

RLScore 93

Redes Neurais Variacionais Baseadas em Microestrutura para Quantificação Robusta de Incertezas em Gêmeos Digitais de Materiais

As incertezas aleatórias - variabilidade irremovível na morfologia da microestrutura, comportamento dos constituintes e condições de processamento - representam um grande desafio para o desenvolvimento de gêmeos digitais robustos em relação à incerteza. Apresentamos a Variational Deep Material Network (VDMN), um modelo substituto informado pela física que permite previsões eficientes e probabilísticas do comportamento dos materiais.

Fonte: arXiv cs.LG

RLScore 96

Aprendizado por Reforço Contínuo Guiado por Demonstração em Ambientes Dinâmicos

O aprendizado por reforço (RL) se destaca em várias aplicações, mas enfrenta dificuldades em ambientes dinâmicos. O aprendizado por reforço contínuo (CRL) permite que agentes de RL aprendam e se adaptem continuamente, mas equilibrar estabilidade e plasticidade continua sendo um desafio. Propomos o aprendizado por reforço contínuo guiado por demonstração (DGCRL), que utiliza um repositório de demonstrações externas para guiar a exploração e adaptação do RL.

Fonte: arXiv cs.LG

RLScore 96

LeJOT: Uma Solução Inteligente de Orquestração de Custos de Trabalho para a Plataforma Databricks

Com os avanços rápidos em tecnologias de big data, a plataforma Databricks tornou-se fundamental para empresas e instituições de pesquisa. No entanto, gerenciar os custos operacionais crescentes associados à execução de trabalhos é um desafio crítico. Apresentamos o LeJOT, um framework de orquestração de custos de trabalho que utiliza machine learning para previsão de tempo de execução e um modelo de otimização baseado em solver para alocação de recursos em tempo real.

Fonte: arXiv cs.LG

NLP/LLMsScore 95

Mantenha Leve! Simplificando o Agrupamento de Imagens Através de Adaptadores Sem Texto

No contexto de modelos pré-treinados, a classificação eficaz pode ser alcançada com camadas de leitura leves. Este trabalho demonstra que, no agrupamento profundo, é possível obter desempenho competitivo com métodos mais complexos utilizando um pipeline de treinamento altamente simplificado e sem texto. Nossa abordagem, Simple Clustering via Pre-trained models (SCP), utiliza representações de características de modelos de visão pré-treinados e pares de dados positivos.

Fonte: arXiv stat.ML

RLScore 92

A Single-Loop First-Order Algorithm for Linearly Constrained Bilevel Optimization

arXiv:2510.24710v2 Announce Type: replace-cross Abstract: We study bilevel optimization problems where the lower-level problems are strongly convex and have coupled linear constraints. To overcome the potential non-smoothness of the hyper-objective and the computational challenges associated with the Hessian matrix, we utilize penalty and augmented Lagrangian methods to reformulate the original problem as a single-level one. Especially, we establish a strong theoretical connection between the reformulated function and the original hyper-objective by characterizing the closeness of their values and derivatives. Based on this reformulation, we propose a single-loop, first-order algorithm for linearly constrained bilevel optimization (SFLCB). We provide rigorous analyses of its non-asymptotic convergence rates, showing an improvement over prior double-loop algorithms -- form $O(\epsilon^{-3}\log(\epsilon^{-1}))$ to $O(\epsilon^{-3})$. The experiments corroborate our theoretical findings and demonstrate the practical efficiency of the proposed SFLCB algorithm. Simulation code is provided at https://github.com/ShenGroup/SFLCB.

Fonte: arXiv stat.ML

VisionScore 93

Planejamento de Trajetória para Agricultura Inteligente Baseada em UAV Usando Aprendizado Profundo Triplo Q com Imitacão

Veículos aéreos não tripulados (UAVs) surgiram como uma plataforma auxiliar promissora para a agricultura inteligente, realizando simultaneamente detecção de ervas daninhas, reconhecimento e coleta de dados de sensores sem fio. No entanto, o planejamento de trajetória é desafiador devido à alta incerteza do ambiente e à capacidade limitada da bateria dos UAVs. Propomos um algoritmo inovador de aprendizado profundo triplo Q com imitação (ITDQN) para resolver esses problemas.

Fonte: arXiv cs.LG

NLP/LLMsScore 96

MoE-TransMov: Um Modelo Baseado em Transformer para Previsão do Próximo Ponto de Interesse (POI) em Movimentos Familiares e Não Familiares

A previsão precisa do próximo ponto de interesse (POI) nas trajetórias de mobilidade humana é crucial para serviços baseados em localização, permitindo recomendações mais oportunas e personalizadas. Propomos o MoE-TransMov, um modelo baseado em Transformer com arquitetura Mixture-of-Experts (MoE) que captura padrões de mobilidade distintos em diferentes contextos de movimento, melhorando a precisão das previsões.

Fonte: arXiv cs.LG

NLP/LLMsScore 95

Research on a hybrid LSTM-CNN-Attention model for text-based web content classification

arXiv:2512.18475v1 Announce Type: new Abstract: This study presents a hybrid deep learning architecture that integrates LSTM, CNN, and an Attention mechanism to enhance the classification of web content based on text. Pretrained GloVe embeddings are used to represent words as dense vectors that preserve semantic similarity. The CNN layer extracts local n-gram patterns and lexical features, while the LSTM layer models long-range dependencies and sequential structure. The integrated Attention mechanism enables the model to focus selectively on the most informative parts of the input sequence. A 5-fold cross-validation setup was used to assess the robustness and generalizability of the proposed solution. Experimental results show that the hybrid LSTM-CNN-Attention model achieved outstanding performance, with an accuracy of 0.98, precision of 0.94, recall of 0.92, and F1-score of 0.93. These results surpass the performance of baseline models based solely on CNNs, LSTMs, or transformer-based classifiers such as BERT. The combination of neural network components enabled the model to effectively capture both fine-grained text structures and broader semantic context. Furthermore, the use of GloVe embeddings provided an efficient and effective representation of textual data, making the model suitable for integration into systems with real-time or near-real-time requirements. The proposed hybrid architecture demonstrates high effectiveness in text-based web content classification, particularly in tasks requiring both syntactic feature extraction and semantic interpretation. By combining presented mechanisms, the model addresses the limitations of individual architectures and achieves improved generalization. These findings support the broader use of hybrid deep learning approaches in NLP applications, especially where complex, unstructured textual data must be processed and classified with high reliability.

Fonte: arXiv cs.CL

VisionScore 95

Teste para estrutura latente via a matriz aleatória de Wilcoxon--Wigner de estatísticas de classificação normalizadas

Este artigo considera o problema de testar a estrutura latente em grandes matrizes de dados simétricas. O objetivo é desenvolver uma metodologia estatisticamente fundamentada que seja flexível em sua aplicabilidade, computacionalmente eficiente e insensível a variações extremas nos dados, superando assim as limitações das abordagens existentes.

Fonte: arXiv stat.ML

RLScore 96

Extensão de Base Contrafactual e Geometria Representacional: Um Modelo de Crescimento Conceitual com Restrições de MDL

O aprendizado de conceitos se torna possível apenas quando representações existentes falham em contabilizar a experiência. Este artigo propõe um framework geométrico onde o crescimento conceitual é modelado como extensão de base admissível avaliada sob um critério de Minimum Description Length (MDL). A experiência é representada como vetores em relação a um subespaço conceitual atual.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

Redes Neurais Gráficas Aprimoradas por Recursos para Classificação de Modelos Geradores de Grafos Sintéticos: Um Estudo de Benchmarking

A capacidade de discriminar entre modelos geradores de grafos é fundamental para entender padrões estruturais complexos em grafos sintéticos e nas estruturas do mundo real que eles emulam. Este trabalho investiga a classificação de famílias de grafos sintéticos usando uma abordagem híbrida que combina Graph Neural Networks (GNNs) com recursos teóricos de grafos.

Fonte: arXiv cs.LG

RLScore 93

Família FedSUM: Métodos Eficientes de Aprendizado Federado sob Participação Arbitrária de Clientes

Os métodos de Aprendizado Federado (FL) são frequentemente projetados para padrões específicos de participação de clientes, limitando sua aplicabilidade em implementações práticas. Apresentamos a família de algoritmos FedSUM, que suporta participação arbitrária de clientes sem suposições adicionais sobre a heterogeneidade dos dados. Nosso framework modela a variabilidade de participação com duas métricas de atraso: o atraso máximo $ au_{ ext{max}}$ e o atraso médio $ au_{ ext{avg}}$.

Fonte: arXiv cs.LG

RLScore 96

SD2AIL: Aprendizado por Imitação Adversarial a partir de Demonstrações Sintéticas via Modelos de Difusão

O Aprendizado por Imitação Adversarial (AIL) é um framework dominante que infere recompensas a partir de demonstrações de especialistas para guiar a otimização de políticas. Inspirados pelo sucesso dos modelos de difusão, propomos o SD2AIL, que utiliza demonstrações sintéticas para aumentar as demonstrações de especialistas, introduzindo também uma estratégia de replay priorizado para maximizar a eficácia das demonstrações.

Fonte: arXiv cs.LG

NLP/LLMsScore 92

Consolidação Narrativa: Formulando uma Nova Tarefa para Unificar Relatos de Múltiplas Perspectivas

O processamento de documentos narrativos sobrepostos, como testemunhos legais ou relatos históricos, visa não a compressão, mas um texto unificado, coerente e cronologicamente sólido. Este artigo define formalmente esse desafio como uma nova tarefa de NLP: Consolidação Narrativa, com foco na integridade cronológica, completude e fusão de detalhes complementares.

Fonte: arXiv cs.CL

RLScore 96

O Challenger: Quando Novas Fontes de Dados Justificam a Troca de Modelos de Machine Learning?

Estudamos o problema de decidir se e quando uma organização deve substituir um modelo incumbente treinado por um challenger que utiliza novas características disponíveis. Desenvolvemos um framework econômico e estatístico unificado que relaciona dinâmicas de curva de aprendizado, custos de aquisição de dados e re-treinamento, e desconto de ganhos futuros.

Fonte: arXiv cs.LG

RLScore 89

Estimativa de densidade via discrepância de mistura e momentos

Com o objetivo de generalizar estatísticas de histogramas para casos de alta dimensão, a estimativa de densidade via partição sequencial baseada em discrepância (DSP) foi proposta para aprender uma aproximação adaptativa constante por partes. A discrepância de mistura e a comparação de momentos são utilizadas como substitutos da discrepância estrela, resultando em DSP-mix e MSP, que são computacionalmente viáveis e exibem invariância de reflexão e rotação.

Fonte: arXiv stat.ML

RLScore 96

Previsão e Prognóstico dos Impactos de Secas de Curto Prazo Usando Machine Learning para Apoiar Esforços de Mitigação e Adaptação

A seca é um risco natural complexo que afeta sistemas ecológicos e humanos, resultando em perdas ambientais e econômicas significativas. Este estudo aplica técnicas de machine learning para vincular índices de seca a registros históricos de impactos, gerando previsões de curto prazo. Os resultados indicam que os impactos de incêndios e alívio foram previstos com maior precisão, apoiando o desenvolvimento de um Sistema de Comunicação de Informação Ecológica sobre Secas (EcoDri) para o Novo México.

Fonte: arXiv cs.LG

NLP/LLMsScore 96

NL2CA: Auto-formalização da Tomada de Decisão Cognitiva a partir da Linguagem Natural Usando um Framework Unsupervised CriticNL2LTL

arXiv:2512.18189v1 Tipo de Anúncio: novo Resumo: Modelos de computação cognitiva oferecem uma maneira formal e interpretável de caracterizar a deliberação e a tomada de decisão humanas, no entanto, seu desenvolvimento continua sendo intensivo em mão de obra. Neste artigo, propomos o NL2CA, um método inovador para auto-formalizar regras de tomada de decisão cognitiva a partir de descrições em linguagem natural da experiência humana. Diferente da maioria dos trabalhos relacionados que exploram modelagem puramente manual ou guiada por humanos, nosso método é totalmente automatizado, sem qualquer intervenção humana. A abordagem primeiro traduz o texto em Lógica Temporal Linear (LTL) usando um modelo de linguagem grande (LLM) ajustado, depois refina a lógica através de uma Critic Tree não supervisionada e, finalmente, transforma a saída em regras de produção executáveis compatíveis com frameworks cognitivos simbólicos. Com base nas regras resultantes, um agente cognitivo é posteriormente construído e otimizado através de aprendizado de reforço cognitivo de acordo com os dados comportamentais do mundo real. Nosso método é validado em dois domínios: (1) tradução NL-para-LTL, onde nosso módulo CriticNL2LTL alcança desempenho consistente em benchmarks tanto de especialistas quanto em larga escala, sem feedbacks de humanos, e (2) simulação de direção cognitiva, onde agentes automaticamente construídos a partir de entrevistas humanas aprenderam com sucesso os diversos padrões de decisão em cerca de 70 testes em diferentes cenários críticos. Resultados experimentais demonstram que o NL2CA possibilita modelagem cognitiva escalável, interpretável e alinhada ao humano a partir de dados textuais não estruturados, oferecendo um novo paradigma para projetar automaticamente agentes cognitivos simbólicos.

Fonte: arXiv cs.AI

RLScore 93

Sobre a Taxa de Convergência do Gradiente Descendente LoRA

O algoritmo de adaptação de baixa rank (LoRA) para ajuste fino de grandes modelos ganhou popularidade nos últimos anos devido ao seu desempenho notável e baixos requisitos computacionais. Este trabalho apresenta pela primeira vez uma análise de convergência não assintótica do algoritmo de gradiente descendente LoRA original, sem pressupostos que limitam a compreensão da convergência.

Fonte: arXiv cs.LG

RLScore 95

Amostragem de distribuições multimodais com pontos de partida aquecidos: Limites não assintóticos para o Reweighted Annealed Leap-Point Sampler

A amostragem de distribuições multimodais é um desafio central na inferência bayesiana e machine learning. Este trabalho introduz o Reweighted ALPS (Re-ALPS), uma versão modificada do Annealed Leap-Point Sampler (ALPS), que elimina a suposição de aproximação gaussiana e apresenta um limite de tempo polinomial em um contexto geral.

Fonte: arXiv stat.ML

NLP/LLMsScore 95

Q-KVComm: Comunicação Multi-Agente Eficiente Via Compressão Adaptativa de Cache KV

Sistemas multi-agente de Large Language Model (LLM) enfrentam um gargalo crítico: a transmissão redundante de informações contextuais entre agentes consome largura de banda e recursos computacionais excessivos. Apresentamos o Q-KVComm, um novo protocolo que permite a transmissão direta de representações comprimidas de cache key-value (KV) entre agentes LLM.

Fonte: arXiv cs.CL

RLScore 95

Neural CDEs como Corretores para Modelos de Séries Temporais Aprendidos

Modelos de séries temporais aprendidos, sejam contínuos ou discretos, são amplamente utilizados para prever os estados de um sistema dinâmico. Propomos um mecanismo de Predictor-Corrector, onde o Predictor é um modelo de série temporal aprendido e o Corrector é uma equação diferencial controlada neuralmente. A adição de erros às previsões melhora o desempenho das previsões.

Fonte: arXiv stat.ML

RLScore 95

ReGal: A First Look at PPO-based Legal AI for Judgment Prediction and Summarization in India

arXiv:2512.18014v1 Announce Type: new Abstract: This paper presents an early exploration of reinforcement learning methodologies for legal AI in the Indian context. We introduce Reinforcement Learning-based Legal Reasoning (ReGal), a framework that integrates Multi-Task Instruction Tuning with Reinforcement Learning from AI Feedback (RLAIF) using Proximal Policy Optimization (PPO). Our approach is evaluated across two critical legal tasks: (i) Court Judgment Prediction and Explanation (CJPE), and (ii) Legal Document Summarization. Although the framework underperforms on standard evaluation metrics compared to supervised and proprietary models, it provides valuable insights into the challenges of applying RL to legal texts. These challenges include reward model alignment, legal language complexity, and domain-specific adaptation. Through empirical and qualitative analysis, we demonstrate how RL can be repurposed for high-stakes, long-document tasks in law. Our findings establish a foundation for future work on optimizing legal reasoning pipelines using reinforcement learning, with broader implications for building interpretable and adaptive legal AI systems.

Fonte: arXiv cs.CL

RLScore 92

Modelos Aditivos Generalizados Baseados em Cluster Informados por Recursos Aleatórios de Fourier

A aprendizagem de máquina explicável busca equilibrar a precisão da previsão e a transparência do modelo, especialmente em contextos onde modelos preditivos de caixa-preta, como redes neurais profundas ou métodos baseados em kernel, apresentam forte desempenho empírico, mas são difíceis de interpretar. Este trabalho introduz uma mistura de modelos aditivos generalizados (GAMs) que utilizam representações de recursos aleatórios de Fourier (RFF) para revelar estruturas localmente adaptativas nos dados.

Fonte: arXiv stat.ML

NLP/LLMsScore 94

LLM Agents Implement an NLG System from Scratch: Building Interpretable Rule-Based RDF-to-Text Generators

arXiv:2512.18360v1 Announce Type: new Abstract: We present a novel neurosymbolic framework for RDF-to-text generation, in which the model is "trained" through collaborative interactions among multiple LLM agents rather than traditional backpropagation. The LLM agents produce rule-based Python code for a generator for the given domain, based on RDF triples only, with no in-domain human reference texts. The resulting system is fully interpretable, requires no supervised training data, and generates text nearly instantaneously using only a single CPU. Our experiments on the WebNLG and OpenDialKG data show that outputs produced by our approach reduce hallucination, with only slight fluency penalties compared to finetuned or prompted language models

Fonte: arXiv cs.CL

RLScore 93

Serviço Eficiente de Mistura de Agentes via Roteamento em Estruturas de Árvore, Poda Adaptativa e Sobreposição de Pré-preenchimento e Decodificação Consciente de Dependências

arXiv:2512.18126v1 Tipo de Anúncio: novo Resumo: A inferência de Mistura de Agentes (MoA) pode sofrer com comunicação densa entre agentes e baixa utilização de hardware, o que, em conjunto, aumenta a latência de serviço. Apresentamos um design de serviço que visa esses gargalos por meio de um co-design de algoritmo e sistema. Primeiro, substituímos gráficos de interação densa entre agentes por uma topologia hierárquica em árvore que induz esparsidade estruturada na comunicação entre agentes. Em segundo lugar, introduzimos um mecanismo adaptativo em tempo de execução que termina ou ignora seletivamente invocações de agentes a jusante usando sinais de acordo semântico e confiança a partir de saídas intermediárias. Por fim, realizamos a execução de agentes em pipeline, sobrepondo o pré-preenchimento incremental com a decodificação entre agentes relacionados a dependências, melhorando a utilização e reduzindo a latência de inferência. Em tarefas representativas, essa abordagem reduz substancialmente a latência de ponta a ponta (em até 90%) enquanto mantém uma precisão comparável (dentro de $m{ ext{ extpm}}$1%) em relação às bases de MoA de conectividade densa e pode melhorar a precisão em certos contextos.

Fonte: arXiv cs.AI

NLP/LLMsScore 93

Misturas secretas de especialistas dentro do seu LLM

arXiv:2512.18452v1 Tipo de Anúncio: novo. Este artigo investiga as camadas MLP em modelos LLM densos, propondo que essas camadas realizam secretamente uma computação esparsa, sendo bem aproximadas por camadas de Mixture of Experts (MoE) com ativação esparsa. Validamos empiricamente essa hipótese em LLMs pré-treinados, mostrando que a distribuição de ativação é crucial para os resultados.

Fonte: arXiv cs.LG

NLP/LLMsScore 95

LLM-based Few-Shot Early Rumor Detection with Imitation Agent

arXiv:2512.18352v1 Announce Type: new Abstract: Early Rumor Detection (EARD) aims to identify the earliest point at which a claim can be accurately classified based on a sequence of social media posts. This is especially challenging in data-scarce settings. While Large Language Models (LLMs) perform well in few-shot NLP tasks, they are not well-suited for time-series data and are computationally expensive for both training and inference. In this work, we propose a novel EARD framework that combines an autonomous agent and an LLM-based detection model, where the agent acts as a reliable decision-maker for \textit{early time point determination}, while the LLM serves as a powerful \textit{rumor detector}. This approach offers the first solution for few-shot EARD, necessitating only the training of a lightweight agent and allowing the LLM to remain training-free. Extensive experiments on four real-world datasets show our approach boosts performance across LLMs and surpasses existing EARD methods in accuracy and earliness.

Fonte: arXiv cs.CL

RLScore 96

ARC: Aproveitando Representações Composicionais para Aprendizado entre Problemas em VRPs

Os Problemas de Roteamento de Veículos (VRPs) com atributos diversos do mundo real têm gerado interesse recente em abordagens de aprendizado entre problemas que generalizam eficientemente entre variantes. Propomos o ARC (Attribute Representation via Compositional Learning), um framework de aprendizado entre problemas que aprende representações de atributos desentranhadas, decompondo-as em dois componentes complementares.

Fonte: arXiv cs.LG

NLP/LLMsScore 95

Stop saying LLM: Large Discourse Models (LDM) and Artificial Discursive Agent (ADA)?

arXiv:2512.19117v1 Announce Type: new Abstract: This paper proposes an epistemological shift in the analysis of large generative models, replacing the category ''Large Language Models'' (LLM) with that of ''Large Discourse Models'' (LDM), and then with that of Artificial Discursive Agent (ADA). The theoretical framework is based on an ontological triad distinguishing three regulatory instances: the apprehension of the phenomenal regularities of the referential world, the structuring of embodied cognition, and the structural-linguistic sedimentation of the utterance within a socio-historical context. LDMs, operating on the product of these three instances (the document), model the discursive projection of a portion of human experience reified by the learning corpus. The proposed program aims to replace the ''fascination/fear'' dichotomy with public trials and procedures that make the place, uses, and limits of artificial discursive agents in contemporary social space decipherable, situating this approach within a perspective of governance and co-regulation involving the State, industry, civil society, and academia.

Fonte: arXiv cs.CL

RLScore 92

Deep Learning para Extração do Modo $B$ Primordial

A busca por ondas gravitacionais primordiais é um objetivo central das pesquisas sobre o fundo cósmico de micro-ondas (CMB). Isolar o sinal de polarização característico do modo $B$ gerado por ondas gravitacionais primordiais é desafiador devido a vários fatores, incluindo a pequena amplitude do sinal e a contaminação por foregrounds astrofísicos. Este trabalho demonstra como redes de deep learning podem ser aplicadas para estimar e remover múltiplas fontes de polarização do modo $B$ secundário.

Fonte: arXiv stat.ML

VisionScore 96

EIA-SEC: Framework Melhorado de Actor-Critic para Controle Colaborativo de Multi-UAV na Agricultura Inteligente

A aplicação generalizada da tecnologia de comunicação sem fio tem promovido o desenvolvimento da agricultura inteligente, onde veículos aéreos não tripulados (UAVs) desempenham um papel multifuncional. Neste trabalho, modelamos um processo de decisão de Markov para resolver o problema de planejamento de trajetória de multi-UAV e propomos o novo framework Elite Imitation Actor-Shared Ensemble Critic (EIA-SEC). Resultados experimentais mostram que o EIA-SEC supera as referências de ponta em desempenho de recompensa, estabilidade de treinamento e velocidade de convergência.

Fonte: arXiv cs.LG

RLScore 95

Teorema Central do Limite para médias ergódicas de cadeias de Markov e a comparação de algoritmos de amostragem para distribuições com cauda pesada

Estabelecer teoremas do limite central (CLTs) para médias ergódicas de cadeias de Markov é um problema fundamental em probabilidade e suas aplicações. Este artigo fornece condições necessárias verificáveis para CLTs de médias ergódicas em espaços de estados gerais, com foco em condições de drift que também oferecem limites inferiores para as taxas de convergência à estacionaridade.

Fonte: arXiv stat.ML

NLP/LLMsScore 96

Benchmarking de substitutos neurais em fluxos multifísicos espaço-temporais realistas

Prever dinâmicas multifísicas é computacionalmente caro e desafiador devido ao acoplamento severo de processos físicos heterogêneos e multiescala. Apresentamos o REALM (REalistic AI Learning for Multiphysics), um framework rigoroso de benchmarking para testar substitutos neurais em fluxos reativos desafiadores, com 11 conjuntos de dados de alta fidelidade e um protocolo padronizado de treinamento e avaliação.

Fonte: arXiv cs.LG

RLScore 96

AL-GNN: Aprendizado Contínuo de Grafos Preservando a Privacidade e Livre de Replay via Aprendizado Analítico

O aprendizado contínuo de grafos (CGL) permite que redes neurais de grafos aprendam incrementalmente a partir de dados estruturados em grafos sem esquecer o conhecimento previamente adquirido. O AL-GNN é um novo framework que elimina a necessidade de retropropagação e buffers de replay, utilizando princípios da teoria do aprendizado analítico para otimizar o aprendizado.

Fonte: arXiv cs.LG

RLScore 96

Seleção de Dados Comportamentais Offline

O comportamento de clonagem é uma abordagem amplamente adotada para aprendizado de políticas offline a partir de demonstrações de especialistas. Este artigo revela a saturação de dados em conjuntos de dados comportamentais offline, onde o desempenho da política rapidamente se estabiliza com uma pequena fração do conjunto. Propomos um método eficaz, Stepwise Dual Ranking (SDR), que extrai um subconjunto compacto e informativo de grandes conjuntos de dados comportamentais offline.

Fonte: arXiv cs.LG

NLP/LLMsScore 95

Learning to Prioritize IT Tickets: A Comparative Evaluation of Embedding-based Approaches and Fine-Tuned Transformer Models

arXiv:2512.17916v1 Announce Type: new Abstract: Prioritizing service tickets in IT Service Management (ITSM) is critical for operational efficiency but remains challenging due to noisy textual inputs, subjective writing styles, and pronounced class imbalance. We evaluate two families of approaches for ticket prioritization: embedding-based pipelines that combine dimensionality reduction, clustering, and classical classifiers, and a fine-tuned multilingual transformer that processes both textual and numerical features. Embedding-based methods exhibit limited generalization across a wide range of thirty configurations, with clustering failing to uncover meaningful structures and supervised models highly sensitive to embedding quality. In contrast, the proposed transformer model achieves substantially higher performance, with an average F1-score of 78.5% and weighted Cohen's kappa values of nearly 0.80, indicating strong alignment with true labels. These results highlight the limitations of generic embeddings for ITSM data and demonstrate the effectiveness of domain-adapted transformer architectures for operational ticket prioritization.

Fonte: arXiv cs.CL

NLP/LLMsScore 95

Ensinando e Criticando a Conceituação e Operacionalização em NLP

Pesquisadores de NLP frequentemente invocam conceitos abstratos como 'interpretabilidade', 'viés', 'raciocínio' e 'estereótipos' sem defini-los. Este artigo descreve um seminário criado para estudantes explorarem questões de conceituação e operacionalização, com uma lista de leitura interdisciplinar e ênfase em discussão e crítica.

Fonte: arXiv cs.CL

NLP/LLMsScore 92

DACE For Railway Acronym Disambiguation

arXiv:2512.18357v1 Announce Type: new Abstract: Acronym Disambiguation (AD) is a fundamental challenge in technical text processing, particularly in specialized sectors where high ambiguity complicates automated analysis. This paper addresses AD within the context of the TextMine'26 competition on French railway documentation. We present DACE (Dynamic Prompting, Retrieval Augmented Generation, Contextual Selection, and Ensemble Aggregation), a framework that enhances Large Language Models through adaptive in-context learning and external domain knowledge injection. By dynamically tailoring prompts to acronym ambiguity and aggregating ensemble predictions, DACE mitigates hallucination and effectively handles low-resource scenarios. Our approach secured the top rank in the competition with an F1 score of 0.9069.

Fonte: arXiv cs.CL

NLP/LLMsScore 95

Rumo ao Desaprendizado que Preserva o Raciocínio em Modelos de Linguagem Grande Multimodal

O desaprendizado de máquinas visa apagar dados solicitados de modelos treinados sem re-treinamento completo. Para Modelos de Linguagem Grande Multimodal com Raciocínio (RMLLMs), isso é desafiador, pois etapas intermediárias podem vazar informações sensíveis. Apresentamos o RMLLMU-Bench, o primeiro benchmark para desaprendizado de RMLLM que avalia o vazamento de raciocínio e a retenção de raciocínio.

Fonte: arXiv cs.CL

RLScore 95

Stochastic Optimization with Optimal Importance Sampling

arXiv:2504.03560v2 Announce Type: replace-cross Abstract: Importance Sampling (IS) is a widely used variance reduction technique for enhancing the efficiency of Monte Carlo methods, particularly in rare-event simulation and related applications. Despite its effectiveness, the performance of IS is highly sensitive to the choice of the proposal distribution and often requires stochastic calibration. While the design and analysis of IS have been extensively studied in estimation settings, applying IS within stochastic optimization introduces a fundamental challenge: the decision variable and the importance sampling distribution are mutually dependent, creating a circular optimization structure. This interdependence complicates both convergence analysis and variance control. We consider convex stochastic optimization problems with linear constraints and propose a single-loop stochastic approximation algorithm, based on a joint variant of Nesterov's dual averaging, that jointly updates the decision variable and the importance sampling distribution, without time-scale separation or nested optimization. The method is globally convergent and achieves minimal asymptotic variance among stochastic gradient schemes, matching the performance of an oracle sampler adapted to the optimal solution.

Fonte: arXiv stat.ML

VisionScore 96

Detecção de Out-of-Distribution em Complexos Moleculares via Modelos de Difusão para Grafos Irregulares

Modelos preditivos de machine learning geralmente se destacam em dados in-distribution, mas seu desempenho se degrada em entradas out-of-distribution (OOD). Apresentamos um framework probabilístico de detecção OOD para dados complexos de grafos 3D, construído sobre um modelo de difusão que aprende a densidade da distribuição de treinamento de maneira totalmente não supervisionada.

Fonte: arXiv cs.LG

NLP/LLMsScore 95

Does It Tie Out? Towards Autonomous Legal Agents in Venture Capital

arXiv:2512.18658v1 Announce Type: new Abstract: Before closing venture capital financing rounds, lawyers conduct diligence that includes tying out the capitalization table: verifying that every security (for example, shares, options, warrants) and issuance term (for example, vesting schedules, acceleration triggers, transfer restrictions) is supported by large sets of underlying legal documentation. While LLMs continue to improve on legal benchmarks, specialized legal workflows, such as capitalization tie-out, remain out of reach even for strong agentic systems. The task requires multi-document reasoning, strict evidence traceability, and deterministic outputs that current approaches fail to reliably deliver. We characterize capitalization tie-out as an instance of a real-world benchmark for legal AI, analyze and compare the performance of existing agentic systems, and propose a world model architecture toward tie-out automation-and more broadly as a foundation for applied legal intelligence.

Fonte: arXiv cs.CL

NLP/LLMsScore 95

A Comparative Study of Light-weight Language Models for PII Masking and their Deployment for Real Conversational Texts

arXiv:2512.18608v1 Announce Type: new Abstract: Automated masking of Personally Identifiable Information (PII) is critical for privacy-preserving conversational systems. While current frontier large language models demonstrate strong PII masking capabilities, concerns about data handling and computational costs motivate exploration of whether lightweight models can achieve comparable performance. We compare encoder-decoder and decoder-only architectures by fine-tuning T5-small and Mistral-Instruct-v0.3 on English datasets constructed from the AI4Privacy benchmark. We create different dataset variants to study label standardization and PII representation, covering 24 standardized PII categories and higher-granularity settings. Evaluation using entity-level and character-level metrics, type accuracy, and exact match shows that both lightweight models achieve performance comparable to frontier LLMs for PII masking tasks. Label normalization consistently improves performance across architectures. Mistral achieves higher F1 and recall with greater robustness across PII types but incurs significantly higher generation latency. T5, while less robust in conversational text, offers more controllable structured outputs and lower inference cost, motivating its use in a real-time Discord bot for real-world PII redaction. Evaluation on live messages reveals performance degradation under informal inputs. These results clarify trade-offs between accuracy, robustness, and computational efficiency, demonstrating that lightweight models can provide effective PII masking while addressing data handling concerns associated with frontier LLMs.

Fonte: arXiv cs.CL

RLScore 96

Inteligência Alinhada à Segurança Embutida via Embeddings de Alinhamento Interno Diferenciáveis

Apresentamos a Inteligência Alinhada à Segurança Embutida (ESAI), um framework teórico para aprendizado por reforço multi-agente que incorpora restrições de alinhamento diretamente nas representações internas dos agentes usando embeddings de alinhamento interno diferenciáveis. Este trabalho analisa condições de estabilidade e propriedades teóricas, posicionando o ESAI como uma contribuição conceitual para mecanismos de alinhamento diferenciáveis em sistemas multi-agente.

Fonte: arXiv cs.LG

NLP/LLMsScore 96

Vibe Reasoning: Extraindo Capacidades Matemáticas de IA de Fronteira -- Um Estudo de Caso sobre o Problema 6 do IMO 2025

Apresentamos o Vibe Reasoning, um paradigma colaborativo entre humanos e IA para resolver problemas matemáticos complexos. Nossa principal percepção é que modelos de IA de fronteira já possuem o conhecimento necessário para resolver problemas desafiadores, mas não sabem como, o que ou quando aplicá-lo. Este trabalho demonstra essa abordagem através do Problema 6 do IMO 2025.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

Compreendendo a Cadeia de Pensamento em Grandes Modelos de Linguagem via Análise de Dados Topológicos

Com o desenvolvimento de grandes modelos de linguagem (LLMs) e a introdução da técnica de cadeia de raciocínio longa, a capacidade de raciocínio dos LLMs em resolução de problemas complexos foi significativamente aprimorada. Este trabalho analisa a qualidade da cadeia de raciocínio a partir de uma perspectiva estrutural, utilizando homologia persistente da Análise de Dados Topológicos (TDA) para mapear etapas de raciocínio e extrair características topológicas.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

$eta(3,4)$ 'Atenção' em Agentes Cognitivos: Representações de Conhecimento Sem Ontologia com Semântica Teórica de Promessa

A semântica e a dinâmica da 'atenção' estão intimamente relacionadas a noções teóricas de promessa desenvolvidas para agentes autônomos e podem ser facilmente expressas em um framework de promessas. Isso permite estabelecer uma ponte entre Machine Learning vetorizado e representações de Knowledge Graph sem depender implicitamente de modelos de linguagem.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

Recontextualização Mitiga o Jogo de Especificação sem Modificar a Especificação

Os desenvolvedores frequentemente enfrentam dificuldades para especificar rótulos de treinamento e recompensas corretas. Propomos a recontextualização, que reduz a frequência com que modelos de linguagem 'jogam' com sinais de treinamento, realizando comportamentos inadequados que esses sinais reforçam erroneamente.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

Raciocínio Híbrido Aumentado por Ferramentas com Destilação para Resolução Bilingue de Problemas Matemáticos

A resolução bilíngue de problemas matemáticos requer uma ligação clara entre raciocínio linguístico e cálculo simbólico. Este artigo apresenta o HERALD (Hybrid Ensemble Reasoning with Adaptive Learning and Distillation), um framework que integra raciocínio e cálculo utilizando NuminaMath-7B-TIR, GPT-4o e Mistral-7B, oferecendo uma solução prática para raciocínio matemático multilíngue com melhor precisão e clareza.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

CORE: Reforço Orientado a Conceitos para Reduzir a Lacuna entre Definição e Aplicação em Raciocínio Matemático

Modelos de linguagem grandes (LLMs) frequentemente resolvem exercícios matemáticos desafiadores, mas falham em aplicar o conceito quando o problema exige compreensão genuína. Apresentamos o CORE (Reforço Orientado a Conceitos), um framework de treinamento de RL que transforma conceitos explícitos em um sinal de supervisão controlável.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

Adaptação Automática à Complexidade de Conceitos e Conceitos Naturais Subjetivos: Um Modelo Cognitivo Baseado em Chunking

Um problema central na ciência cognitiva diz respeito aos processos psicológicos fundamentais que sustentam a formação e recuperação de múltiplos tipos de conceitos na memória de curto e longo prazo (STM e LTM, respectivamente). Propomos que os mecanismos de chunking desempenham um papel essencial e mostramos como o modelo computacional CogAct fundamenta o aprendizado de conceitos em processos e estruturas cognitivas fundamentais.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

IntelliCode: Um Sistema de Tutoria com LLM Multi-Agente e Modelagem Centralizada do Aprendiz

Os tutores baseados em LLM geralmente são assistentes de turno único que não possuem representações persistentes do conhecimento do aprendiz, dificultando o suporte pedagógico a longo prazo. Apresentamos o IntelliCode, um sistema de tutoria LLM multi-agente que integra estimativas de domínio, equívocos, cronogramas de revisão e sinais de engajamento em um estado de aprendiz centralizado e versionado.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

Confiança Reflexiva: Corrigindo Falhas de Raciocínio via Auto-Correção Online

Modelos de linguagem grandes (LLMs) demonstraram forte desempenho em tarefas complexas de raciocínio utilizando técnicas como chain-of-thought e self-consistency. No entanto, abordagens baseadas em ensembles, especialmente self-consistency, frequentemente acarretam um overhead computacional substancial. Propomos a confiança reflexiva, um novo framework de raciocínio que transforma sinais de baixa confiança em gatilhos de reflexão.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

Vox Deorum: Uma Arquitetura Híbrida de LLM para IA em Jogos 4X / de Grande Estratégia -- Lições de Civilization V

A capacidade dos Large Language Models de raciocinar em linguagem natural os torna promissores para jogos 4X e de grande estratégia, facilitando interações mais naturais entre humanos e IA. No entanto, a complexidade desses jogos e fatores como latência e custo podem dificultar a implementação real dos LLMs. Apresentamos Vox Deorum, uma arquitetura híbrida LLM+X, validada por meio de 2.327 jogos completos.

Fonte: arXiv cs.AI

VisionScore 96

Detecção de Drift de Saída Baseada em Agentes para Previsão de Resposta ao Câncer de Mama em um Sistema de Suporte à Decisão Clínica Multissite

Os sistemas modernos de suporte à decisão clínica podem atender simultaneamente várias instituições independentes de imagem médica, mas seu desempenho preditivo pode se degradar entre os sites devido a variações nas populações de pacientes, hardware de imagem e protocolos de aquisição. Propomos um framework baseado em agentes para detectar drift e avaliar sua gravidade em sistemas de IA clínica multissite.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

Parceria Inteligente Homem-Máquina para Manufatura: Aprimorando o Planejamento de Armazém através de Grafos de Conhecimento Baseados em Simulação e Colaboração de LLM

Os planejadores de manufatura enfrentam desafios operacionais complexos que exigem colaboração entre a expertise humana e sistemas inteligentes. Nosso framework integra Grafos de Conhecimento e agentes baseados em Large Language Models (LLMs) para capacitar profissionais da manufatura, permitindo interações em linguagem natural com dados operacionais, melhorando a análise e a tomada de decisões.

Fonte: arXiv cs.AI

VisionScore 96

NOVA: Descobrindo Transformações Winograd Bem Condicionadas através da Otimização Numérica da Aritmética de Vandermonde

A convolução Winograd é o algoritmo padrão para inferência eficiente, reduzindo a complexidade aritmética em 2,25x para núcleos 3x3. No entanto, enfrenta uma barreira crítica na era moderna da computação de baixa precisão: a instabilidade numérica. Apresentamos o NOVA, um framework de descoberta que otimiza a seleção de pontos Winograd como um problema de otimização contínua.

Fonte: arXiv cs.LG

NLP/LLMsScore 96

Propor, Resolver, Verificar: Auto-Jogo Através da Verificação Formal

O treinamento de modelos apenas através de auto-jogo (sem dados humanos) tem sido um objetivo de longa data em IA, mas sua eficácia para treinar grandes modelos de linguagem permanece incerta, especialmente na geração de código. Estudamos o auto-jogo no contexto de geração de código verificado, onde a verificação formal fornece sinais de correção confiáveis.

Fonte: arXiv cs.AI

RLScore 96

Explicações Fiéis e Estáveis de Neurônios para Interpretabilidade Mecanística Confiável

A identificação de neurônios é uma ferramenta popular na interpretabilidade mecanística, visando descobrir os conceitos interpretáveis por humanos representados por neurônios individuais em redes profundas. Embora algoritmos como Network Dissection e CLIP-Dissect tenham alcançado grande sucesso empírico, uma base teórica rigorosa ainda está ausente, o que é crucial para permitir explicações confiáveis. Este trabalho apresenta a primeira análise teórica de desafios fundamentais relacionados à fidelidade e estabilidade das explicações de neurônios.

Fonte: arXiv cs.AI

VisionScore 96

Mapas auto-organizáveis para avaliação da qualidade da água em reservatórios e lagos: Uma revisão sistemática da literatura

A qualidade da água sustentável é fundamental para o equilíbrio ecológico e a segurança hídrica. Esta revisão examina a aplicação do Self-Organizing Map (SOM), uma técnica de IA não supervisionada, na avaliação da qualidade da água, abordando seleção de parâmetros, estratégias de amostragem e abordagens de agrupamento.

Fonte: arXiv cs.LG

NLP/LLMsScore 95

MemEvolve: Meta-Evolution of Agent Memory Systems

arXiv:2512.18746v1 Announce Type: new Abstract: Self-evolving memory systems are unprecedentedly reshaping the evolutionary paradigm of large language model (LLM)-based agents. Prior work has predominantly relied on manually engineered memory architectures to store trajectories, distill experience, and synthesize reusable tools, enabling agents to evolve on the fly within environment interactions. However, this paradigm is fundamentally constrained by the staticity of the memory system itself: while memory facilitates agent-level evolving, the underlying memory architecture cannot be meta-adapted to diverse task contexts. To address this gap, we propose MemEvolve, a meta-evolutionary framework that jointly evolves agents' experiential knowledge and their memory architecture, allowing agent systems not only to accumulate experience but also to progressively refine how they learn from it. To ground MemEvolve in prior research and foster openness in future self-evolving systems, we introduce EvolveLab, a unified self-evolving memory codebase that distills twelve representative memory systems into a modular design space (encode, store, retrieve, manage), providing both a standardized implementation substrate and a fair experimental arena. Extensive evaluations on four challenging agentic benchmarks demonstrate that MemEvolve achieves (I) substantial performance gains, improving frameworks such as SmolAgent and Flash-Searcher by up to $17.06\%$; and (II) strong cross-task and cross-LLM generalization, designing memory architectures that transfer effectively across diverse benchmarks and backbone models.

Fonte: arXiv cs.CL

RLScore 96

Treinamento de Modelos de Raciocínio Multimodal Grandes Necessita de Melhores Ideias: Um Framework de Três Estágios para Síntese e Seleção de Longas Cadeias de Pensamento

Modelos de Raciocínio Grandes (LRMs) demonstraram desempenho notável em tarefas complexas de raciocínio por meio de longas Cadeias de Pensamento (CoT). A extensão desses sucessos para raciocínio multimodal é desafiadora devido à complexidade de integrar diversas modalidades de entrada e à escassez de dados de treinamento de alta qualidade. Neste artigo, propomos o SynSelect, um novo framework de Síntese-Seleção em três estágios para gerar dados longos de CoT de alta qualidade voltados para tarefas de raciocínio multimodal.

Fonte: arXiv cs.AI

NLP/LLMsScore 95

LLM-CAS: Perturbação Dinâmica de Neurônios para Correção de Alucinações em Tempo Real

Modelos de linguagem grandes (LLMs) frequentemente geram conteúdo alucinado que carece de fundamentação factual ou contextual, limitando sua confiabilidade em aplicações críticas. O LLM-CAS propõe uma abordagem de correção de alucinações em tempo real como um problema de aprendizado por reforço hierárquico, permitindo correções adaptativas sem modificação permanente de parâmetros.

Fonte: arXiv cs.CL

NLP/LLMsScore 92

Graph-O1 : Monte Carlo Tree Search with Reinforcement Learning for Text-Attributed Graph Reasoning

arXiv:2512.17912v1 Announce Type: new Abstract: ChatGPT said: Text-attributed graphs, where nodes and edges contain rich textual information, are widely used across diverse domains. A central challenge in this setting is question answering, which requires jointly leveraging unstructured text and the structured relational signals within the graph. Although Large Language Models (LLMs) have made significant advances in natural language understanding, their direct use for reasoning over text-attributed graphs remains limited. Retrieval-augmented generation methods that operate purely on text often treat passages as isolated units, ignoring the interconnected structure of the graph. Conversely, graph-based RAG methods that serialize large subgraphs into long textual sequences quickly become infeasible due to LLM context-length constraints, resulting in fragmented reasoning and degraded accuracy. To overcome these limitations, we introduce Graph-O1, an agentic GraphRAG framework that enables LLMs to conduct stepwise, interactive reasoning over graphs. Our approach integrates Monte Carlo Tree Search (MCTS) with end-to-end reinforcement learning, allowing the model to selectively explore and retrieve only the most informative subgraph components. The reasoning procedure is framed as a multi-turn interaction between the agent and the graph environment, and the agent is trained through a unified reward mechanism. Extensive experiments across multiple LLM backbones demonstrate that Graph-O1 consistently surpasses state-of-the-art baselines, producing answers that are more accurate, reliable, and interpretable.

Fonte: arXiv cs.CL

RLScore 96

Unificando Aprendizado por Reforço Causal: Revisão, Taxonomia, Algoritmos e Aplicações

Integrar inferência causal (CI) com aprendizado por reforço (RL) se tornou um paradigma poderoso para abordar limitações críticas no RL clássico, como baixa explicabilidade e falta de robustez. Este trabalho revisa avanços recentes na interseção entre CI e RL, categorizando abordagens existentes e discutindo desafios, sucessos empíricos e direções futuras de pesquisa.

Fonte: arXiv cs.AI

VisionScore 96

Grad: Geração de Difusão de Relações Guiadas para Aumento de Grafos na Detecção de Fraude em Grafos

Atualmente, a Detecção de Fraude em Grafos (GFD) em cenários financeiros tornou-se um tópico de pesquisa urgente para proteger a segurança de pagamentos online. Com a evolução das estratégias de camuflagem dos fraudadores, propomos o modelo Grad, que utiliza um módulo de aprendizado contrastivo supervisionado para melhorar a diferença entre fraudes e usuários benignos, gerando relações homofílicas auxiliares.

Fonte: arXiv cs.LG

RLScore 96

Monitoramento da Monitorabilidade

A observabilidade na tomada de decisões de sistemas de IA modernos pode ser necessária para implantar com segurança agentes cada vez mais capazes. Monitorar a cadeia de pensamento (CoT) dos modelos de raciocínio atuais tem se mostrado eficaz na detecção de comportamentos inadequados. No entanto, essa 'monitorabilidade' pode ser frágil sob diferentes procedimentos de treinamento e fontes de dados.

Fonte: arXiv cs.AI

RLScore 93

A Geometria da Abstração: Aprendizado Contínuo via Quotienting Recursivo

Sistemas de aprendizado contínuo que operam em espaços de dimensão fixa enfrentam uma barreira geométrica fundamental: o problema do manifold plano. Neste trabalho, propomos uma resolução geométrica para esse paradoxo com base na Contração Métrica Recursiva, formalizando a abstração como uma deformação topológica.

Fonte: arXiv cs.LG

NLP/LLMsScore 96

Uma Rede Híbrida Indutiva-Transdutiva para Imputação de Fluxo de Tráfego em Locais Não Amostrados

Imputar com precisão o fluxo de tráfego em locais não sensorizados é desafiador. Propomos a HINT, uma Rede Híbrida Indutiva-Transdutiva, que utiliza uma estratégia de treinamento INDU-TRANSDUTIVA para tratar a velocidade como um sinal transdutivo, enquanto aprende o fluxo indutivamente. HINT supera consistentemente as linhas de base indutivas em três conjuntos de dados do mundo real.

Fonte: arXiv cs.LG

NLP/LLMsScore 95

A Large Language Model Based Method for Complex Logical Reasoning over Knowledge Graphs

arXiv:2512.19092v1 Announce Type: new Abstract: Reasoning over knowledge graphs (KGs) with first-order logic (FOL) queries is challenging due to the inherent incompleteness of real-world KGs and the compositional complexity of logical query structures. Most existing methods rely on embedding entities and relations into continuous geometric spaces and answer queries via differentiable set operations. While effective for simple query patterns, these approaches often struggle to generalize to complex queries involving multiple operators, deeper reasoning chains, or heterogeneous KG schemas. We propose ROG (Reasoning Over knowledge Graphs with large language models), an ensemble-style framework that combines query-aware KG neighborhood retrieval with large language model (LLM)-based chain-of-thought reasoning. ROG decomposes complex FOL queries into sequences of simpler sub-queries, retrieves compact, query-relevant subgraphs as contextual evidence, and performs step-by-step logical inference using an LLM, avoiding the need for task-specific embedding optimization. Experiments on standard KG reasoning benchmarks demonstrate that ROG consistently outperforms strong embedding-based baselines in terms of mean reciprocal rank (MRR), with particularly notable gains on high-complexity query types. These results suggest that integrating structured KG retrieval with LLM-driven logical reasoning offers a robust and effective alternative for complex KG reasoning tasks.

Fonte: arXiv cs.CL

NLP/LLMsScore 95

Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations

arXiv:2512.18906v1 Announce Type: new Abstract: Over the years, automatic MT metrics have hillclimbed benchmarks and presented strong and sometimes human-level agreement with human ratings. Yet they remain black-box, offering little insight into their decision-making and often failing under real-world out-of-distribution (OOD) inputs. We introduce Remedy-R, a reasoning-driven generative MT metric trained with reinforcement learning from pairwise translation preferences, without requiring error-span annotations or distillation from closed LLMs. Remedy-R produces step-by-step analyses of accuracy, fluency, and completeness, followed by a final score, enabling more interpretable assessments. With only 60K training pairs across two language pairs, Remedy-R remains competitive with top scalar metrics and GPT-4-based judges on WMT22-24 meta-evaluation, generalizes to other languages, and exhibits strong robustness on OOD stress tests. Moreover, Remedy-R models generate self-reflective feedback that can be reused for translation improvement. Building on this finding, we introduce Remedy-R Agent, a simple evaluate-revise pipeline that leverages Remedy-R's evaluation analysis to refine translations. This agent consistently improves translation quality across diverse models, including Qwen2.5, ALMA-R, GPT-4o-mini, and Gemini-2.0-Flash, suggesting that Remedy-R's reasoning captures translation-relevant information and is practically useful.

Fonte: arXiv cs.CL

RLScore 95

From Natural Language to Control Signals: A Conceptual Framework for Semantic Channel Finding in Complex Experimental Infrastructure

arXiv:2512.18779v1 Announce Type: new Abstract: Modern experimental platforms such as particle accelerators, fusion devices, telescopes, and industrial process control systems expose tens to hundreds of thousands of control and diagnostic channels accumulated over decades of evolution. Operators and AI systems rely on informal expert knowledge, inconsistent naming conventions, and fragmented documentation to locate signals for monitoring, troubleshooting, and automated control, creating a persistent bottleneck for reliability, scalability, and language-model-driven interfaces. We formalize semantic channel finding-mapping natural-language intent to concrete control-system signals-as a general problem in complex experimental infrastructure, and introduce a four-paradigm framework to guide architecture selection across facility-specific data regimes. The paradigms span (i) direct in-context lookup over curated channel dictionaries, (ii) constrained hierarchical navigation through structured trees, (iii) interactive agent exploration using iterative reasoning and tool-based database queries, and (iv) ontology-grounded semantic search that decouples channel meaning from facility-specific naming conventions. We demonstrate each paradigm through proof-of-concept implementations at four operational facilities spanning two orders of magnitude in scale-from compact free-electron lasers to large synchrotron light sources-and diverse control-system architectures, from clean hierarchies to legacy environments. These implementations achieve 90-97% accuracy on expert-curated operational queries.

Fonte: arXiv cs.CL

NLP/LLMsScore 96

HARBOR: Modelo Holístico e Adaptativo de Avaliação de Risco para Cuidados de Saúde Comportamental

A avaliação de risco em cuidados de saúde comportamental continua sendo um desafio devido à natureza multimodal dos dados dos pacientes e à dinâmica temporal dos transtornos de humor e afetivos. Neste trabalho, apresentamos o HARBOR, um modelo de linguagem consciente da saúde comportamental projetado para prever um escore de humor e risco discreto, denominado Harbor Risk Score (HRS).

Fonte: arXiv cs.AI

RLScore 92

Algoritmo Garantido de Regret a Qualquer Momento para Controle de Sistemas Lineares Quadráticos

Propomos um algoritmo computacionalmente eficiente que alcança um regret a qualquer momento de ordem $ extmath{O}( extmath{ extsqrt{t}})$, com dependência explícita nas dimensões do sistema e na solução da Equação de Riccati Algébrica Discreta (DARE). Nossa abordagem utiliza uma regularização adequadamente ajustada e uma estimativa inicial suficientemente precisa para construir elipsóides de confiança para o design de controle.

Fonte: arXiv stat.ML

RLScore 93

Rumo ao Descida Guiada: Algoritmos de Otimização para Treinamento de Redes Neurais em Larga Escala

A otimização de redes neurais continua sendo um dos desafios mais significativos e mal compreendidos na pesquisa de IA moderna. Melhorias em algoritmos de treinamento podem levar a um aprendizado de características aprimorado em modelos fundamentais, reduções significativas no tempo de treinamento e uma melhor interpretabilidade de como as redes aprendem. Esta tese investiga a evolução dos algoritmos de otimização, revelando como um design algorítmico fundamentado pode desmistificar o processo de treinamento.

Fonte: arXiv cs.LG

NLP/LLMsScore 96

DeliveryBench: Agentes Podem Lucrar no Mundo Real?

arXiv:2512.19234v1 Tipo de Anúncio: novo. Resumo: LLMs e VLMs estão sendo cada vez mais utilizados como agentes incorporados, mas os benchmarks existentes se concentram em tarefas simples de curto prazo e têm dificuldade em capturar as ricas restrições realistas que moldam a tomada de decisão no mundo real. Para fechar essa lacuna, propomos o DeliveryBench, um benchmark incorporado em escala de cidade baseado na profissão real de entrega de alimentos.

Fonte: arXiv cs.AI

NLP/LLMsScore 95

Evaluating the Challenges of LLMs in Real-world Medical Follow-up: A Comparative Study and An Optimized Framework

arXiv:2512.18999v1 Announce Type: new Abstract: When applied directly in an end-to-end manner to medical follow-up tasks, Large Language Models (LLMs) often suffer from uncontrolled dialog flow and inaccurate information extraction due to the complexity of follow-up forms. To address this limitation, we designed and compared two follow-up chatbot systems: an end-to-end LLM-based system (control group) and a modular pipeline with structured process control (experimental group). Experimental results show that while the end-to-end approach frequently fails on lengthy and complex forms, our modular method-built on task decomposition, semantic clustering, and flow management-substantially improves dialog stability and extraction accuracy. Moreover, it reduces the number of dialogue turns by 46.73% and lowers token consumption by 80% to 87.5%. These findings highlight the necessity of integrating external control mechanisms when deploying LLMs in high-stakes medical follow-up scenarios.

Fonte: arXiv cs.CL

NLP/LLMsScore 95

Supplementary Resources and Analysis for Automatic Speech Recognition Systems Trained on the Loquacious Dataset

arXiv:2512.17915v1 Announce Type: new Abstract: The recently published Loquacious dataset aims to be a replacement for established English automatic speech recognition (ASR) datasets such as LibriSpeech or TED-Lium. The main goal of the Loquacious dataset is to provide properly defined training and test partitions across many acoustic and language domains, with an open license suitable for both academia and industry. To further promote the benchmarking and usability of this new dataset, we present additional resources in the form of n-gram language models (LMs), a grapheme-to-phoneme (G2P) model and pronunciation lexica, with open and public access. Utilizing those additional resources we show experimental results across a wide range of ASR architectures with different label units and topologies. Our initial experimental results indicate that the Loquacious dataset offers a valuable study case for a variety of common challenges in ASR.

Fonte: arXiv cs.CL

NLP/LLMsScore 95

Context-Aware Initialization for Reducing Generative Path Length in Diffusion Language Models

arXiv:2512.19004v1 Announce Type: new Abstract: Diffusion Large Language Models (DLLMs) enable fully parallel token decoding but often remain impractical at inference time due to the many denoising iterations required to refine an information-free, fully masked initialization into coherent text. Most existing acceleration methods focus on traversing this generative trajectory more efficiently via improved solvers or sampling strategies. We advance a complementary perspective: shorten the trajectory itself by starting closer to the target distribution through context-aware initialization. We propose a training-free interface that injects prompt-conditioned priors from a lightweight auxiliary model into the diffusion initialization, and instantiate it with two mechanisms: discrete token injection and representation-level embedding interpolation. Because injected priors can be imperfect and unmask-only decoding can over-commit early, we also introduce a simple confidence-based remasking mechanism as a form of prior skepticism. Preliminary evidence on GSM8K suggests that context-aware initialization can substantially reduce denoising iterations (about 35\% fewer function evaluations in our setting), while also exposing a key open challenge: naive warm-starting can degrade final accuracy relative to strong diffusion baselines. We use these findings to motivate a research agenda around calibration, revision mechanisms, and representation alignment for reliable warm-started diffusion decoding.

Fonte: arXiv cs.CL

NLP/LLMsScore 95

Separating Constraint Compliance from Semantic Accuracy: A Novel Benchmark for Evaluating Instruction-Following Under Compression

arXiv:2512.17920v1 Announce Type: new Abstract: Large language models (LLMs) exhibit degraded performance under prompt compression, but the mechanisms remain poorly understood. We introduce the Compression-Decay Comprehension Test (CDCT), a benchmark that independently measures constraint compliance (CC) and semantic accuracy (SA) across compression levels. We evaluate 9 frontier LLMs across 8 concepts using 5 compression levels from extreme (c=0.0, ~2 words) to none (c=1.0, ~135 words). A three-judge LLM jury achieves almost perfect inter-rater agreement on CC (Fleiss' \k{appa}=0.90). We observe a universal U-curve pattern in constraint compliance (97.2% prevalence), with violations peaking at medium compression (c=0.5, ~27 words). Counterintuitively, models perform better at extreme compression than medium lengths. The dimensions are statistically orthogonal (r=0.193, p=0.084), with constraint effects 2.9x larger than semantic effects. Experimental validation via RLHF ablation confirms our constraint salience hypothesis: removing "helpfulness" signals improves CC by 598% on average (71/72 trials, p<0.001), with 79% achieving perfect compliance. This demonstrates that RLHF-trained helpfulness behaviors are the dominant cause of constraint violations at medium compression. Reasoning models outperform efficient models by 27.5% (Cohen's d=0.96). Our findings reveal a fundamental tension between RLHF alignment and instruction-following, providing actionable guidelines for improving deployed systems.

Fonte: arXiv cs.CL

RLScore 95

Seleção de Recursos Não Supervisionada via Autoencoder Robusto e Aprendizado Adaptativo de Grafo

A seleção eficaz de recursos é essencial para a análise de dados de alta dimensão e machine learning. A seleção de recursos não supervisionada (UFS) visa agrupar dados e identificar as características mais discriminativas. Propomos o modelo Robust Autoencoder-based Unsupervised Feature Selection (RAEUFS), que utiliza um autoencoder profundo para aprender representações de recursos não lineares, melhorando a robustez contra outliers.

Fonte: arXiv stat.ML

NLP/LLMsScore 96

Geração de Regras Programáticas para Detecção de Falsificação de Documentos Usando Modelos de Linguagem de Grande Escala

A falsificação de documentos representa uma ameaça crescente a processos legais, econômicos e governamentais, exigindo mecanismos de verificação cada vez mais sofisticados. Este trabalho investiga como modelos de linguagem de grande escala (LLMs) podem ser adaptados para gerar verificações de plausibilidade baseadas em regras para detecção de falsificações, utilizando recursos de hardware limitados.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

Grandes Modelos de Linguagem como Filtros Bayesianos Descontados

Os Grandes Modelos de Linguagem (LLMs) demonstram forte generalização em poucos exemplos através do aprendizado em contexto, mas seu raciocínio em ambientes dinâmicos e estocásticos permanece opaco. Introduzimos um framework de filtragem bayesiana para avaliar a inferência online em LLMs, revelando como as atualizações de crença se comportam como filtros de esquecimento exponencial.

Fonte: arXiv cs.AI

RLScore 93

ORPR: Um Modelo de Aprendizado Guiado por OR para Gestão de Estoque com Pré-treinamento e Reforço

À medida que a busca por sinergia entre Inteligência Artificial (AI) e Pesquisa Operacional (OR) avança na gestão de sistemas de estoque complexos, um desafio crítico permanece: como reconciliar efetivamente a percepção adaptativa da AI com o rigor estrutural da OR. Propomos um novo framework 'Pré-treinamento e Reforço' guiado por OR.

Fonte: arXiv cs.AI

RLScore 96

Podemos Testar Teorias da Consciência em IA? Ablations, Marcadores e Robustez

A busca por indicadores confiáveis de consciência se fragmentou em campos teóricos concorrentes (Global Workspace Theory (GWT), Integrated Information Theory (IIT) e Higher-Order Theories (HOT)), cada um propondo assinaturas neurais distintas. Adotamos uma abordagem de neuro-fenomenologia sintética, construindo agentes artificiais para testar as consequências funcionais dessas teorias através de ablações arquitetônicas precisas. Relatamos dissociações que sugerem que essas teorias descrevem camadas funcionais complementares.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

FC-MIR: Um Framework de Consciência de Tela Móvel para Recomendação Consciente de Intenção Baseada em Raciocínio Multimodal de Trajetória Comprimida por Frame

Identificar a intenção do usuário a partir de trajetórias de operação da interface móvel é crucial para avançar na compreensão da UI e habilitar agentes de automação de tarefas. Propomos o framework FC-MIR, que utiliza amostragem de keyframes e concatenação adaptativa para reduzir a redundância visual e aumentar a eficiência da inferência, integrando MLLMs de última geração para sumarização de trajetórias e previsão de intenção.

Fonte: arXiv cs.AI

VisionScore 96

Rede Bayesiana Multimodal para Avaliação Robusta de Vítimas em Triagem Autônoma

Incidentes de Múltiplas Vítimas podem sobrecarregar sistemas médicos de emergência, e atrasos ou erros na avaliação das vítimas podem resultar em mortes evitáveis. Apresentamos um framework de suporte à decisão que combina saídas de múltiplos modelos de computer vision, estimando sinais de hemorragia severa, dificuldade respiratória, alerta físico ou trauma visível, em uma rede Bayesiana construída inteiramente a partir de regras definidas por especialistas.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

ESearch-R1: Aprendendo Agentes MLLM Conscientes de Custo para Busca Embodida Interativa via Aprendizado por Reforço

Modelos de Linguagem de Grande Escala Multimodal (MLLMs) capacitaram agentes embodidos com habilidades notáveis em planejamento e raciocínio. No entanto, ao enfrentar instruções ambíguas em linguagem natural, os agentes atuais frequentemente falham em equilibrar o alto custo da exploração física com o custo cognitivo da interação humana. Para preencher essa lacuna, propomos o ESearch-R1, um framework de raciocínio embodido consciente de custo.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

NEURO-GUARD: Generalização Neuro-Simbólica e Roteamento Adaptativo Imparcial para Diagnósticos -- IA Médica Explicável

O diagnóstico baseado em imagem, preciso e interpretável, continua sendo um desafio central na IA médica, especialmente em ambientes com dados limitados e decisões clínicas críticas. Apresentamos o NEURO-GUARD, um novo framework guiado por conhecimento que integra Vision Transformers (ViTs) com raciocínio orientado por linguagem, melhorando desempenho e robustez em diferentes domínios.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

ASTIF: Integração Semântica-Temporal Adaptativa para Previsão de Preços de Criptomoedas

A previsão de séries temporais financeiras é um desafio de fusão de informações, mas a maioria dos modelos existentes depende de arquiteturas estáticas que têm dificuldade em integrar fontes de conhecimento heterogêneas. Propomos o ASTIF, um sistema híbrido inteligente que adapta sua estratégia de previsão em tempo real através de meta-aprendizado baseado em confiança, integrando componentes complementares para melhorar a precisão das previsões.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

Repensando a Inteligência Multi-Agente Através da Lente de Redes de Pequeno Mundo

Modelos de linguagem grandes (LLMs) possibilitaram sistemas multi-agente (MAS) onde múltiplos agentes argumentam, criticam e coordenam para resolver tarefas complexas, tornando a topologia de comunicação uma escolha de design fundamental. Neste trabalho, revisitamos a teoria clássica sobre redes de pequeno mundo (SW) e investigamos como a conectividade SW pode ser utilizada como um princípio de design para MAS.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

Sophia: Um Framework de Agente Persistente de Vida Artificial

O desenvolvimento de LLMs elevou os agentes de IA de ferramentas específicas para entidades de tomada de decisão de longa duração. No entanto, a maioria das arquiteturas permanece estática e reativa, limitadas a cenários definidos manualmente. Propomos um terceiro estrato, o Sistema 3, que supervisiona a identidade narrativa do agente e a adaptação a longo prazo, culminando em Sophia, um wrapper de 'Agente Persistente' que integra um ciclo contínuo de autoaperfeiçoamento.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

ChronoDreamer: Modelo de Mundo Condicionado por Ação como um Simulador Online para Planejamento Robótico

Apresentamos o ChronoDreamer, um modelo de mundo condicionado por ação para manipulação robótica rica em contatos. Com base em um histórico de quadros RGB egocêntricos, mapas de contato, ações e estados das juntas, o ChronoDreamer prevê quadros de vídeo futuros, distribuições de contato e ângulos das juntas através de um transformer espaço-temporal treinado com previsão mascarada no estilo MaskGIT.

Fonte: arXiv cs.AI

RLScore 96

FairExpand: Justiça Individual em Grafos com Informações de Similaridade Parcial

A justiça individual, que exige que indivíduos semelhantes sejam tratados de forma semelhante por sistemas algorítmicos, é um princípio central em machine learning justo. Este trabalho apresenta o FairExpand, um framework flexível que promove a justiça individual em cenários de informações parciais, superando a limitação de métodos existentes que requerem informações de similaridade pré-definidas para todos os pares de nós.

Fonte: arXiv cs.LG

NLP/LLMsScore 95

Rumo a Agentes Eficientes: Um Co-Design da Arquitetura de Inferência e do Sistema

O rápido desenvolvimento de agentes baseados em large language models (LLMs) abriu novas possibilidades para raciocínio autônomo em múltiplas interações e tomada de decisão com ferramentas. No entanto, sua implementação no mundo real é dificultada por ineficiências severas que surgem não da inferência isolada do modelo, mas da latência sistêmica acumulada ao longo dos ciclos de raciocínio, crescimento de contexto e interações heterogêneas com ferramentas.

Fonte: arXiv cs.CL

NLP/LLMsScore 96

MoE Pathfinder: Poda de Especialistas Orientada por Trajetória

As arquiteturas de Mixture-of-experts (MoE) em grandes modelos de linguagem (LLMs) alcançam desempenho de ponta em diversas tarefas, mas enfrentam desafios práticos como complexidade de implantação e baixa eficiência de ativação. A poda de especialistas surgiu como uma solução promissora para reduzir a sobrecarga computacional e simplificar a implantação de modelos MoE.

Fonte: arXiv cs.LG

RLScore 92

Aprendizado de atratores para sistemas dinâmicos caóticos espaciotemporais usando redes de estado de eco com aprendizado por transferência

Neste artigo, exploramos as capacidades preditivas das redes de estado de eco (ESNs) para a equação de Kuramoto-Sivashinsky generalizada (gKS), uma PDE não linear arquetípica que exibe caos espaciotemporal. Nossa pesquisa foca na previsão de mudanças em padrões estatísticos de longo prazo do modelo gKS resultantes da variação da relação de dispersão ou do comprimento do domínio espacial.

Fonte: arXiv stat.ML

NLP/LLMsScore 96

RL de Lançamento Único Estável e Eficiente para Raciocínio Multimodal

O Reinforcement Learning com Recompensas Verificáveis (RLVR) se tornou um paradigma chave para melhorar as capacidades de raciocínio de Modelos de Linguagem Grande Multimodal (MLLMs). Introduzimos o $ extbf{MSSR}$ (Multimodal Stabilized Single-Rollout), um framework RLVR sem grupos que alcança otimização estável e desempenho eficaz em raciocínio multimodal, utilizando um mecanismo de modelagem de vantagem baseado em entropia.

Fonte: arXiv cs.LG

RLScore 93

Quando a Aprendizagem Renormaliza? Condições Suficientes para Dinâmicas Espectrais em Lei de Potência

A escalabilidade empírica em lei de potência tem sido amplamente observada em sistemas modernos de deep learning, mas suas origens teóricas e escopo de validade permanecem incompletamente compreendidos. O framework Generalized Resolution-Shell Dynamics (GRSD) modela a aprendizagem como transporte de energia espectral através de camadas de resolução logarítmica, oferecendo uma descrição dinâmica da formação.

Fonte: arXiv cs.LG

RLScore 96

Aprendizado Profundo de Reforço Confiável e Explicável para Controle de Processos Seguro e Eficiente em Energia: Um Caso de Uso em Sistemas Industriais de Ar Comprimido

Este artigo apresenta uma abordagem confiável de aprendizado por reforço para o controle de sistemas industriais de ar comprimido. Desenvolvemos um framework que possibilita operação segura e eficiente em energia sob condições de contorno realistas e introduzimos um pipeline de explicabilidade em múltiplos níveis. Uma avaliação empírica mostra que a política aprendida é fisicamente plausível e respeita consistentemente os limites do sistema.

Fonte: arXiv cs.LG

RLScore 92

Por que a Maioria dos Algoritmos de Bandit de Otimismo Tem a Mesma Análise de Arrependimento: Um Teorema Unificador Simples

Vários algoritmos de bandit estocásticos baseados em otimismo -- incluindo UCB, UCB-V, linear UCB e GP-UCB de braço finito -- alcançam arrependimento logarítmico usando provas que, apesar de diferenças superficiais, seguem essencialmente a mesma estrutura. Este artigo isola os ingredientes mínimos por trás dessas análises.

Fonte: arXiv cs.LG

NLP/LLMsScore 95

Sobre a Universalidade das Arquiteturas Transformer: Quanto de Atenção é Suficiente?

Transformers são cruciais em diversos campos da IA, como modelos de linguagem de grande escala, visão computacional e aprendizado por reforço. Este trabalho examina a universalidade nos Transformers, revisa avanços recentes e identifica direções-chave para futuras pesquisas teóricas.

Fonte: arXiv cs.LG

NLP/LLMsScore 95

FASTRIC: Linguagem de Especificação de Prompt para Interações LLM Verificáveis

Modelos de Linguagem de Grande Escala (LLMs) executam protocolos complexos de interação, mas carecem de especificações formais para verificar a execução em relação à intenção do designer. Apresentamos o FASTRIC, uma Linguagem de Especificação de Prompt que torna Máquinas de Estados Finitos (FSMs) implícitas explícitas em prompts de linguagem natural, permitindo a verificação de conformidade por meio da análise de rastreamento de execução.

Fonte: arXiv cs.CL

NLP/LLMsScore 96

Rumo à Avaliação de Vulnerabilidades de Privacidade no Esquecimento Seletivo com Modelos de Linguagem de Grande Escala

Os avanços rápidos em inteligência artificial (IA) têm se concentrado no aprendizado a partir de dados para desenvolver sistemas de aprendizado informados. Com a implementação desses sistemas em áreas críticas, garantir sua privacidade e alinhamento com valores humanos é essencial. O esquecimento seletivo, ou machine unlearning, surge como uma abordagem promissora, mas também levanta preocupações significativas de privacidade, especialmente em domínios sensíveis.

Fonte: arXiv cs.LG

RLScore 95

AraMix: Recycling, Refiltering, and Deduplicating to Deliver the Largest Arabic Pretraining Corpus

arXiv:2512.18834v1 Announce Type: new Abstract: We present AraMix, a deduplicated Arabic pretraining corpus containing approximately 178 billion tokens across 179 million documents. Rather than scraping the web again, AraMix demonstrates that substantial value lies in systematically reusing and curating existing pretraining datasets: we combine seven publicly available Arabic web datasets, apply quality filtering designed specifically for Arabic text to re-filter some datasets, and perform cross-dataset deduplication, both MinHash and sentence-level. This approach reveals that nearly 60% of tokens across these independently collected corpora are duplicates, redundancy that any new scraping efforts will reproduce. Our work suggests that for lower resource languages, investment in curation pipelines for existing data yields greater returns than additional web crawls, an approach that allowed us to curate the largest heavily filtered publicly available Arabic pretraining corpus.

Fonte: arXiv cs.CL

NLP/LLMsScore 95

Rumo a Testes de Independência Condicional Escaláveis e Válidos com Representações Espectrais

A independência condicional (CI) é central para inferência causal, seleção de características e modelagem gráfica, mas é muitas vezes impossível de testar sem suposições adicionais. Testes existentes de CI dependem de condições estruturais restritivas, limitando sua validade em dados do mundo real. Este trabalho explora se o aprendizado de representações pode ajudar a superar essas limitações.

Fonte: arXiv stat.ML

NLP/LLMsScore 95

De Palavra a Mundo: Podem Modelos de Linguagem Grande Servir como Modelos de Mundo Baseados em Texto Implicitamente?

O aprendizado por reforço agente depende cada vez mais de escalabilidade orientada pela experiência, mas ambientes do mundo real continuam sendo não adaptativos e difíceis de escalar. Este estudo investiga se modelos de linguagem grande (LLMs) podem melhorar a eficiência do aprendizado em ambientes baseados em texto, apresentando um framework de três níveis para avaliação de modelos de mundo baseados em LLMs.

Fonte: arXiv cs.CL

RLScore 96

APC-GNN++: Uma GNN Adaptativa Centrada no Paciente com Atenção Consciente do Contexto e Explicabilidade de Mini-Grafos para Classificação de Diabetes

Propomos o APC-GNN++, uma Rede Neural Gráfica centrada no paciente para classificação de diabetes. Nosso modelo integra atenção de arestas consciente do contexto, mistura guiada por confiança de características de nós e representações gráficas, e regularização de consistência de vizinhança para capturar melhor relações clinicamente significativas entre pacientes.

Fonte: arXiv cs.LG

RLScore 95

Método de Direção Alternada de Multiplicadores para Decomposições de Matrizes Não Lineares

Apresentamos um algoritmo baseado no método de direção alternada de multiplicadores (ADMM) para resolver decomposições de matrizes não lineares (NMD). Dada uma matriz de entrada $X \in \mathbb{R}^{m \times n}$ e um rank de fatoração $r \ll \min(m, n)$, NMD busca matrizes $W \in \mathbb{R}^{m \times r}$ e $H \in \mathbb{R}^{r \times n}$ de forma que $X \approx f(WH)$, onde $f$ é uma função não linear elemento a elemento.

Fonte: arXiv stat.ML

RLScore 93

O Design de IA Humanóide Aumenta o Antropomorfismo, mas Gera Resultados Divergentes em Engajamento e Confiança Globalmente

arXiv:2512.17898v1 Tipo de Anúncio: novo Resumo: Mais de um bilhão de usuários em todo o mundo interagem com sistemas de IA projetados com uma sofisticação crescente para imitar características humanas. Essa mudança desencadeou um debate urgente sobre o Antropomorfismo, a atribuição de características humanas a agentes sintéticos, e seu potencial para induzir confiança mal colocada ou dependência emocional. No entanto, a relação causal entre um design de IA mais humanóide e os efeitos subsequentes sobre engajamento e confiança não foi testada em interações realistas entre humanos e IA com um pool de usuários global. As estruturas de segurança predominantes continuam a depender de suposições teóricas derivadas de populações ocidentais, negligenciando a diversidade global de usuários de IA. Aqui, abordamos essas lacunas por meio de dois experimentos transnacionais em larga escala (N=3.500) em 10 nações diversas, envolvendo interações em tempo real e abertas com um sistema de IA. Descobrimos que, ao avaliar a humanidade de uma IA, os usuários se concentram menos nos aspectos teóricos frequentemente citados em políticas (por exemplo, sentiência ou consciência), mas sim em pistas interacionais aplicadas, como o fluxo da conversa ou a compreensão da perspectiva do usuário. Também demonstramos experimentalmente que alavancas de design humanóide podem aumentar causalmente o antropomorfismo entre os usuários; no entanto, não encontramos que o design humanóide aumente universalmente as medidas comportamentais de engajamento e confiança do usuário, como sugere o trabalho teórico anterior. Em vez disso, parte da conexão entre humanidade e resultados comportamentais é fragmentada pela cultura: escolhas de design específicas que promovem a confiança autorrelatada em sistemas de IA em algumas populações (por exemplo, Brasil) podem desencadear o resultado oposto em outras (por exemplo, Japão). Nossas descobertas desafiam narrativas predominantes de risco inerente no design de IA humanóide. Em vez disso, identificamos um cenário sutil e mediado culturalmente de interação humano-IA, que exige que avancemos além de uma abordagem única para todos na governança de IA.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

A percepção realista de ameaças impulsiona o conflito entre grupos: uma análise causal e dinâmica usando simulações de agentes generativos

arXiv:2512.17066v1 Tipo de Anúncio: novo Resumo: O conflito humano é frequentemente atribuído a ameaças contra condições materiais e valores simbólicos, mas ainda não está claro como eles interagem e qual deles domina. O progresso é limitado pelo controle causal fraco, restrições éticas e escassez de dados temporais. Abordamos essas barreiras usando simulações de agentes impulsionados por modelos de linguagem de grande escala (LLM) em sociedades virtuais, variando independentemente a ameaça realista e simbólica enquanto rastreamos ações, linguagem e atitudes. Análises representacionais mostram que o LLM subjacente codifica a ameaça realista, a ameaça simbólica e a hostilidade como estados internos distintos, que nossas manipulações mapeiam sobre eles, e que direcionar esses estados muda causalmente o comportamento. Nossas simulações fornecem uma explicação causal do conflito impulsionado por ameaças ao longo do tempo: a ameaça realista aumenta diretamente a hostilidade, enquanto os efeitos da ameaça simbólica são mais fracos, totalmente mediados pelo viés de grupo, e aumentam a hostilidade apenas quando a ameaça realista está ausente. O contato intergrupal não hostil amortiza a escalada, e as assimetrias estruturais concentram a hostilidade entre grupos majoritários.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

MMRAG-RFT: Ajuste Fino de Reforço em Duas Etapas para Geração Aumentada por Recuperação Multi-modal Explicável

arXiv:2512.17194v1 Tipo de Anúncio: novo Resumo: A Geração Aumentada por Recuperação Multi-modal (MMRAG) possibilita uma geração altamente credível ao integrar conhecimento externo multi-modal, demonstrando assim um desempenho impressionante em cenários complexos multi-modais. No entanto, os métodos MMRAG existentes falham em esclarecer a lógica de raciocínio por trás da recuperação e geração de respostas, o que limita a explicabilidade dos resultados. Para abordar essa lacuna, propomos introduzir o aprendizado por reforço na geração aumentada por recuperação multi-modal, aprimorando as capacidades de raciocínio de modelos de linguagem grandes multi-modais por meio de uma estrutura de ajuste fino de reforço em duas etapas para alcançar uma geração aumentada por recuperação multi-modal explicável. Especificamente, na primeira etapa, o ajuste fino de reforço baseado em regras é empregado para realizar uma classificação ponto a ponto de granulação grossa de documentos multi-modais, filtrando efetivamente aqueles que são significativamente irrelevantes. Na segunda etapa, o ajuste fino de reforço baseado em raciocínio é utilizado para otimizar conjuntamente a classificação lista a lista de granulação fina e a geração de respostas, orientando modelos de linguagem grandes multi-modais a produzir uma lógica de raciocínio explicável no processo MMRAG. Nosso método alcança resultados de ponta no WebQA e MultimodalQA, dois conjuntos de dados de referência para geração aumentada por recuperação multi-modal, e sua eficácia é validada por meio de experimentos de ablação abrangentes.

Fonte: arXiv cs.AI

VisionScore 96

Dialética para Inteligência Artificial

arXiv:2512.17373v1 Tipo de Anúncio: novo Resumo: A inteligência artificial pode descobrir, a partir de experiências brutas e sem supervisão humana, conceitos que os humanos já descobriram? Um desafio é que os próprios conceitos humanos são fluidos: as fronteiras conceituais podem mudar, se dividir e se fundir à medida que a investigação avança (por exemplo, Plutão não é mais considerado um planeta). Para avançar, precisamos de uma definição de 'conceito' que não seja meramente um rótulo de dicionário, mas uma estrutura que possa ser revisada, comparada e alinhada entre agentes. Propomos um ponto de vista de informação algorítmica que trata um conceito como um objeto de informação definido apenas através de sua relação estrutural com a experiência total de um agente. A restrição central é a determinação: um conjunto de partes forma uma relação de consistência reversível se qualquer parte ausente puder ser recuperada a partir das outras (até o padrão de folga logarítmica em identidades no estilo Kolmogorov). Essa reversibilidade impede que 'conceitos' flutuem livres da experiência e transforma a existência de conceitos em uma afirmação estrutural verificável. Para julgar se uma decomposição é natural, definimos informação excessiva, medindo a sobrecarga de redundância introduzida pela divisão da experiência em várias partes descritas separadamente. Com base nessas definições, formulamos a dialética como uma dinâmica de otimização: à medida que novos fragmentos de informação aparecem (ou se tornam contestados), conceitos concorrentes competem para explicá-los por meio de descrições condicionais mais curtas, impulsionando a expansão, contração, divisão e fusão sistemáticas. Finalmente, formalizamos a transmissão de conceitos de baixo custo e o alinhamento multi-agente usando pequenos fundamentos/sementes que permitem que outro agente reconstrua o mesmo conceito sob um protocolo compartilhado, tornando a comunicação uma troca concreta de bits computacionais.

Fonte: arXiv cs.AI

RLScore 92

Text-Conditioned Background Generation for Editable Multi-Layer Documents

arXiv:2512.17151v1 Announce Type: new Abstract: We present a framework for document-centric background generation with multi-page editing and thematic continuity. To ensure text regions remain readable, we employ a \emph{latent masking} formulation that softly attenuates updates in the diffusion space, inspired by smooth barrier functions in physics and numerical optimization. In addition, we introduce \emph{Automated Readability Optimization (ARO)}, which automatically places semi-transparent, rounded backing shapes behind text regions. ARO determines the minimal opacity needed to satisfy perceptual contrast standards (WCAG 2.2) relative to the underlying background, ensuring readability while maintaining aesthetic harmony without human intervention. Multi-page consistency is maintained through a summarization-and-instruction process, where each page is distilled into a compact representation that recursively guides subsequent generations. This design reflects how humans build continuity by retaining prior context, ensuring that visual motifs evolve coherently across an entire document. Our method further treats a document as a structured composition in which text, figures, and backgrounds are preserved or regenerated as separate layers, allowing targeted background editing without compromising readability. Finally, user-provided prompts allow stylistic adjustments in color and texture, balancing automated consistency with flexible customization. Our training-free framework produces visually coherent, text-preserving, and thematically aligned documents, bridging generative modeling with natural design workflows.

Fonte: arXiv cs.CV

NLP/LLMsScore 95

Anatomical Region-Guided Contrastive Decoding: A Plug-and-Play Strategy for Mitigating Hallucinations in Medical VLMs

arXiv:2512.17189v1 Announce Type: new Abstract: Medical Vision-Language Models (MedVLMs) show immense promise in clinical applicability. However, their reliability is hindered by hallucinations, where models often fail to derive answers from visual evidence, instead relying on learned textual priors. Existing mitigation strategies for MedVLMs have distinct limitations: training-based methods rely on costly expert annotations, limiting scalability, while training-free interventions like contrastive decoding, though data-efficient, apply a global, untargeted correction whose effects in complex real-world clinical settings can be unreliable. To address these challenges, we introduce Anatomical Region-Guided Contrastive Decoding (ARCD), a plug-and-play strategy that mitigates hallucinations by providing targeted, region-specific guidance. Our module leverages an anatomical mask to direct a three-tiered contrastive decoding process. By dynamically re-weighting at the token, attention, and logits levels, it verifiably steers the model's focus onto specified regions, reinforcing anatomical understanding and suppressing factually incorrect outputs. Extensive experiments across diverse datasets, including chest X-ray, CT, brain MRI, and ocular ultrasound, demonstrate our method's effectiveness in improving regional understanding, reducing hallucinations, and enhancing overall diagnostic accuracy.

Fonte: arXiv cs.CV

VisionScore 95

AnyCXR: Segmentação da Anatomia Humana em Radiografias de Tórax em Qualquer Posição de Aquisição Usando Dados Sintéticos Randomizados de Domínio em Múltiplas Etapas com Anotações Imperfeitas e Aprendizado de Regularização de Anotação Conjunta Condicional

A segmentação anatômica robusta de radiografias de tórax (CXRs) continua desafiadora devido à escassez de anotações abrangentes e à variabilidade substancial das condições de aquisição no mundo real. Propomos o AnyCXR, um framework unificado que permite a segmentação multi-orgânica generalizável em ângulos de projeção arbitrários de CXR usando apenas supervisão sintética.

Fonte: arXiv cs.CV

NLP/LLMsScore 96

V-Agent: Um Sistema de Busca de Vídeo Interativo Usando Modelos de Visão-Linguagem

Apresentamos o V-Agent, uma nova plataforma multi-agente projetada para busca avançada de vídeos e conversas interativas entre usuário e sistema. Ao ajustar um modelo de visão-linguagem (VLM) com um pequeno conjunto de dados de preferência de vídeo e aprimorá-lo com um vetor de recuperação de um modelo de recuperação de imagem-texto, superamos as limitações dos sistemas tradicionais de recuperação baseados em texto em cenários multimodais.

Fonte: arXiv cs.AI

NLP/LLMsScore 95

Globally Optimal Solution to the Generalized Relative Pose Estimation Problem using Affine Correspondences

arXiv:2512.17188v1 Announce Type: new Abstract: Mobile devices equipped with a multi-camera system and an inertial measurement unit (IMU) are widely used nowadays, such as self-driving cars. The task of relative pose estimation using visual and inertial information has important applications in various fields. To improve the accuracy of relative pose estimation of multi-camera systems, we propose a globally optimal solver using affine correspondences to estimate the generalized relative pose with a known vertical direction. First, a cost function about the relative rotation angle is established after decoupling the rotation matrix and translation vector, which minimizes the algebraic error of geometric constraints from affine correspondences. Then, the global optimization problem is converted into two polynomials with two unknowns based on the characteristic equation and its first derivative is zero. Finally, the relative rotation angle can be solved using the polynomial eigenvalue solver, and the translation vector can be obtained from the eigenvector. Besides, a new linear solution is proposed when the relative rotation is small. The proposed solver is evaluated on synthetic data and real-world datasets. The experiment results demonstrate that our method outperforms comparable state-of-the-art methods in accuracy.

Fonte: arXiv cs.CV

NLP/LLMsScore 96

Uma Estrutura Solver-in-the-Loop para Melhorar LLMs em Programação de Conjuntos de Respostas para Resolução de Quebra-Cabeças Lógicos

arXiv:2512.17093v1 Tipo de Anúncio: novo Resumo: O surgimento de grandes modelos de linguagem (LLMs) despertou interesse em assistentes de codificação. Embora linguagens de programação de uso geral sejam bem suportadas, gerar código para linguagens específicas de domínio continua sendo um problema desafiador para LLMs. Neste artigo, focamos na geração de código baseada em LLM para Programação de Conjuntos de Respostas (ASP), uma abordagem particularmente eficaz para encontrar soluções para problemas de busca combinatória. A eficácia dos LLMs na geração de código ASP é atualmente limitada pelo número reduzido de exemplos vistos durante sua fase inicial de pré-treinamento. Neste artigo, introduzimos uma nova abordagem ASP-solver-in-the-loop para ajuste de instruções guiado por solver de LLMs, abordando a tarefa de análise semântica altamente complexa inerente à geração de código ASP. Nosso método requer apenas especificações de problemas em linguagem natural e suas soluções. Especificamente, amostramos declarações ASP para continuações de programas a partir de LLMs para desvendar quebra-cabeças lógicos. Aproveitando a propriedade especial da programação declarativa ASP, onde codificações parciais restringem cada vez mais o espaço de soluções, categorizamos essas instâncias em escolhidas e rejeitadas com base no feedback do solver. Em seguida, aplicamos ajuste fino supervisionado para treinar LLMs nos dados selecionados e melhorar ainda mais a robustez usando uma busca guiada por solver que inclui amostragem best-of-N. Nossos experimentos demonstram melhorias consistentes em dois cenários de prompting distintos em dois conjuntos de dados.

Fonte: arXiv cs.AI

VisionScore 95

Robust Scene Coordinate Regression via Geometrically-Consistent Global Descriptors

arXiv:2512.17226v1 Announce Type: new Abstract: Recent learning-based visual localization methods use global descriptors to disambiguate visually similar places, but existing approaches often derive these descriptors from geometric cues alone (e.g., covisibility graphs), limiting their discriminative power and reducing robustness in the presence of noisy geometric constraints. We propose an aggregator module that learns global descriptors consistent with both geometrical structure and visual similarity, ensuring that images are close in descriptor space only when they are visually similar and spatially connected. This corrects erroneous associations caused by unreliable overlap scores. Using a batch-mining strategy based solely on the overlap scores and a modified contrastive loss, our method trains without manual place labels and generalizes across diverse environments. Experiments on challenging benchmarks show substantial localization gains in large-scale environments while preserving computational and memory efficiency. Code is available at \href{https://github.com/sontung/robust\_scr}{github.com/sontung/robust\_scr}.

Fonte: arXiv cs.CV

NLP/LLMsScore 95

ResSVD: Residual Compensated SVD for Large Language Model Compression

arXiv:2505.20112v3 Announce Type: replace Abstract: Large language models (LLMs) have demonstrated impressive capabilities in a wide range of downstream natural language processing tasks. Nevertheless, their considerable sizes and memory demands hinder practical deployment, underscoring the importance of developing efficient compression strategies. Singular value decomposition (SVD) decomposes a matrix into orthogonal components, enabling efficient low-rank approximation. This is particularly suitable for LLM compression, where weight matrices often exhibit significant redundancy. However, current SVD-based methods neglect the residual matrix from truncation, resulting in significant truncation loss. Additionally, compressing all layers of the model results in severe performance degradation. To overcome these limitations, we propose ResSVD, a new post-training SVD-based LLM compression method. Specifically, we leverage the residual matrix generated during the truncation process to reduce truncation loss. Moreover, under a fixed overall compression ratio, we selectively compress the last few layers of the model, which mitigates error propagation and significantly improves the performance of compressed models. Comprehensive evaluations of ResSVD on diverse LLM families and multiple benchmark datasets indicate that ResSVD consistently achieves superior performance over existing counterpart methods, demonstrating its practical effectiveness.

Fonte: arXiv cs.CL

NLP/LLMsScore 95

Towards Safer Chatbots: Automated Policy Compliance Evaluation of Custom GPTs

arXiv:2502.01436v3 Announce Type: replace Abstract: User-configured chatbots built on top of large language models are increasingly available through centralized marketplaces such as OpenAI's GPT Store. While these platforms enforce usage policies intended to prevent harmful or inappropriate behavior, the scale and opacity of customized chatbots make systematic policy enforcement challenging. As a result, policy-violating chatbots continue to remain publicly accessible despite existing review processes. This paper presents a fully automated method for evaluating the compliance of Custom GPTs with its marketplace usage policy using black-box interaction. The method combines large-scale GPT discovery, policy-driven red-teaming prompts, and automated compliance assessment using an LLM-as-a-judge. We focus on three policy-relevant domains explicitly addressed in OpenAI's usage policies: Romantic, Cybersecurity, and Academic GPTs. We validate our compliance assessment component against a human-annotated ground-truth dataset, achieving an F1 score of 0.975 for binary policy violation detection. We then apply the method in a large-scale empirical study of 782 Custom GPTs retrieved from the GPT Store. The results show that 58.7% of the evaluated GPTs exhibit at least one policy-violating response, with substantial variation across policy domains. A comparison with the base models (GPT-4 and GPT-4o) indicates that most violations originate from model-level behavior, while customization tends to amplify these tendencies rather than create new failure modes. Our findings reveal limitations in current review mechanisms for user-configured chatbots and demonstrate the feasibility of scalable, behavior-based policy compliance evaluation.

Fonte: arXiv cs.CL

RLScore 92

Computational analysis reveals historical trajectory of East-Polynesian lunar calendars

arXiv:2512.17525v1 Announce Type: cross Abstract: We investigate a type of lunar calendar known as lists of the 'nights of the moon', found throughout East Polynesia, including Rapa Nui (Easter Island). Using computational methods, we analyzed the lexical and structural divergence of 49 calendric lists from all major archipelagos, each containing about 30 night names. Our results, presented as a rooted phylogenetic tree, show a clear split into two main groups: one including lists from Rapa Nui, Mangareva, and the Marquesas; the other comprising lists from New Zealand, Hawaii, the Cook Islands, the Austral Islands, Tahiti, and the Tuamotu. This pattern aligns with a recent alternative classification of East Polynesian languages into 'Distal' (Marquesan, Mangarevan, Rapanui) and 'Proximal' (Maori, Hawaiian, Tahitian, etc.) subgroups. Since both language and lunar calendars are symbolic systems passed down and changed within communities - and given the geographic isolation of many archipelagos - we interpret this correspondence as evidence that the early divergence of East Polynesian lunar calendars mirrors early population movements and language splits in the region.

Fonte: arXiv cs.CL

NLP/LLMsScore 96

Acelerando o Desempenho de Jogos Multi-modal LLM via Previsão de Entrada e Correção de Erros

arXiv:2512.17250v1 Tipo de Anúncio: novo Resumo: Agentes de controle sequencial em tempo real frequentemente enfrentam gargalos devido à latência de inferência. Mesmo atrasos modestos no planejamento por etapa podem desestabilizar o controle e degradar o desempenho geral. Propomos uma estrutura de especulação e correção que adapta a filosofia de prever-então-verificar da execução especulativa ao controle baseado em modelo com TD-MPC2. Em cada etapa, um modelo de mundo pré-treinado e um planejador MPC no espaço latente geram uma fila de ações de curto prazo juntamente com previsões de rollouts latentes, permitindo que o agente execute múltiplas ações planejadas sem replanejamento imediato. Quando uma nova observação chega, o sistema mede a discrepância entre o estado latente real codificado e o latente previsto na fila. Para discrepâncias pequenas a moderadas, um corretor leve aprendido aplica uma atualização residual à ação especulativa, destilada offline de um professor de replanejamento. Para discrepâncias grandes, o agente retorna de forma segura ao replanejamento completo e limpa as filas de ações obsoletas. Estudamos tanto um corretor MLP de duas torres com controle por porta quanto um corretor Transformer temporal para abordar erros locais e deriva sistemática. Experimentos na tarefa DMC Humanoid-Walk mostram que nosso método reduz o número de inferências de planejamento de 500 para 282, melhora a latência de etapa de ponta a ponta em 25 por cento e mantém um desempenho de controle forte com apenas uma redução de retorno de 7,1 por cento. Resultados de ablação demonstram que a execução especulativa sem correção é pouco confiável em horizontes mais longos, destacando a necessidade de correção ciente da discrepância para uma redução robusta da latência.

Fonte: arXiv cs.AI

NLP/LLMsScore 95

RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering

arXiv:2512.17396v1 Announce Type: cross Abstract: In this work, we introduce RadImageNet-VQA, a large-scale dataset designed to advance radiologic visual question answering (VQA) on CT and MRI exams. Existing medical VQA datasets are limited in scale, dominated by X-ray imaging or biomedical illustrations, and often prone to text-based shortcuts. RadImageNet-VQA is built from expert-curated annotations and provides 750K images paired with 7.5M question-answer samples. It covers three key tasks - abnormality detection, anatomy recognition, and pathology identification - spanning eight anatomical regions and 97 pathology categories, and supports open-ended, closed-ended, and multiple-choice questions. Extensive experiments show that state-of-the-art vision-language models still struggle with fine-grained pathology identification, particularly in open-ended settings and even after fine-tuning. Text-only analysis further reveals that model performance collapses to near-random without image inputs, confirming that RadImageNet-VQA is free from linguistic shortcuts. The full dataset and benchmark are publicly available at https://huggingface.co/datasets/raidium/RadImageNet-VQA.

Fonte: arXiv cs.CL

NLP/LLMsScore 95

Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents

arXiv:2503.10689v2 Announce Type: replace Abstract: Recent advances in large language models (LLMs) have led to a growing interest in developing LLM-based agents for automating web tasks. However, these agents often struggle with even simple tasks on real-world websites due to their limited capability to understand and process complex web page structures. In this work, we introduce LCoW, a framework for Learning language models to Contextualize complex Web pages into a more comprehensible form, thereby enhancing decision making by LLM agents. LCoW decouples web page understanding from decision making by training a separate contextualization module to transform complex web pages into comprehensible format, which are then utilized by the decision-making agent. We demonstrate that our contextualization module effectively integrates with LLM agents of various scales to significantly enhance their decision-making capabilities in web automation tasks. Notably, LCoW improves the success rates of closed-source LLMs (e.g., Gemini-1.5-flash, GPT-4o, Claude-3.5-Sonnet) by an average of 15.6%, and demonstrates a 23.7% average improvement in success rates for open-source LMs (e.g., Llama-3.1-8B, Llama-3.1-70B) on the WorkArena benchmark. Moreover, the Gemini-1.5-flash agent with LCoW achieves state-of-the-art results on the WebShop benchmark, outperforming human experts. The relevant code materials are available at our project page: https://lcowiclr2025.github.io.

Fonte: arXiv cs.CL

RLScore 96

Quando o Raciocínio Encontra Suas Leis

arXiv:2512.17901v1 Tipo de Anúncio: novo. Este artigo apresenta as Leis do Raciocínio (LoRe), um framework unificado que caracteriza padrões intrínsecos de raciocínio em Large Reasoning Models (LRMs). Propomos a lei de computação e uma lei de precisão suplementar, introduzindo o LoRe-Bench para medir essas propriedades em modelos de raciocínio. Avaliações mostram que a maioria dos modelos apresenta monotonicidade razoável, mas carece de composicionalidade.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens

arXiv:2512.17375v1 Announce Type: new Abstract: Reward models and LLM-as-a-Judge systems are central to modern post-training pipelines such as RLHF, DPO, and RLAIF, where they provide scalar feedback and binary decisions that guide model selection and RL-based fine-tuning. We show that these judge systems exhibit a recurring vulnerability: short sequences of low-perplexity control tokens can flip many binary evaluations from correct ``No'' judgments to incorrect ``Yes'' judgments by steering the last-layer logit gap. These control tokens are patterns that a policy model could plausibly generate during post-training, and thus represent realistic reward-hacking risks rather than worst-case adversarial strings. Our method, AdvJudge-Zero, uses the model's next-token distribution and beam-search exploration to discover diverse control-token sequences from scratch, and our analysis shows that the induced hidden-state perturbations concentrate in a low-rank ``soft mode'' that is anti-aligned with the judge's refusal direction. Empirically, these tokens cause very high false positive rates when large open-weight and specialized judge models score incorrect answers on math and reasoning benchmarks. Finally, we show that LoRA-based adversarial training on small sets of control-token-augmented examples can markedly reduce these false positives while preserving evaluation quality.

Fonte: arXiv cs.LG

NLP/LLMsScore 96

PAACE: Um Framework de Engenharia de Contexto Automatizado e Consciente de Planejamento

Modelos de Linguagem de Grande Escala (LLM) estão sendo cada vez mais utilizados em fluxos de trabalho complexos que envolvem planejamento, uso de ferramentas, reflexão e interação com sistemas de conhecimento externos. Este trabalho apresenta o PAACE, um framework unificado para otimizar o estado evolutivo de agentes LLM através de modelagem de relevância de próximas tarefas, análise de estrutura de planejamento, co-refinamento de instruções e compressão que preserva funções.

Fonte: arXiv cs.AI

NLP/LLMsScore 95

Strategic Planning and Rationalizing on Trees Make LLMs Better Debaters

arXiv:2505.14886v2 Announce Type: replace Abstract: Winning competitive debates requires sophisticated reasoning and argument skills. There are unique challenges in the competitive debate: (1) The time constraints force debaters to make strategic choices about which points to pursue rather than covering all possible arguments; (2) The persuasiveness of the debate relies on the back-and-forth interaction between arguments, which a single final game status cannot evaluate. To address these challenges, we propose TreeDebater, a novel debate framework that excels in competitive debate. We introduce two tree structures: the Rehearsal Tree and Debate Flow Tree. The Rehearsal Tree anticipates the attack and defenses to evaluate the strength of the claim, while the Debate Flow Tree tracks the debate status to identify the active actions. TreeDebater allocates its time budget among candidate actions and uses the speech time controller and feedback from the simulated audience to revise its statement. The human evaluation on both the stage-level and the debate-level comparison shows that our TreeDebater outperforms the state-of-the-art multi-agent debate system, with a +15.6% improvement in stage-level persuasiveness with DeepSeek and +10% debate-level opinion shift win. Further investigation shows that TreeDebater shows better strategies in limiting time to important debate actions, aligning with the strategies of human debate experts.

Fonte: arXiv cs.CL

NLP/LLMsScore 95

Generating Completions for Broca's Aphasic Sentences Using Large Language Models

arXiv:2412.17669v2 Announce Type: replace Abstract: Broca's aphasia is a type of aphasia characterized by non-fluent, effortful and agrammatic speech production with relatively good comprehension. Since traditional aphasia treatment methods are often time-consuming, labour-intensive, and do not reflect real-world conversations, applying natural language processing based approaches such as Large Language Models (LLMs) could potentially contribute to improving existing treatment approaches. To address this issue, we explore the use of sequence-to-sequence LLMs for completing Broca's aphasic sentences. We first generate synthetic Broca's aphasic data using a rule-based system designed to mirror the linguistic characteristics of Broca's aphasic speech. Using this synthetic data (without authentic aphasic samples), we then fine-tune four pre-trained LLMs on the task of completing agrammatic sentences. We evaluate our fine-tuned models on both synthetic and authentic Broca's aphasic data. We demonstrate LLMs' capability for reconstructing agrammatic sentences, with the models showing improved performance with longer input utterances. Our result highlights the LLMs' potential in advancing communication aids for individuals with Broca's aphasia and possibly other clinical populations.

Fonte: arXiv cs.CL

NLP/LLMsScore 96

Large Language Models as Pok\'emon Battle Agents: Strategic Play and Content Generation

arXiv:2512.17308v1 Announce Type: new Abstract: Strategic decision-making in Pok\'emon battles presents a unique testbed for evaluating large language models. Pok\'emon battles demand reasoning about type matchups, statistical trade-offs, and risk assessment, skills that mirror human strategic thinking. This work examines whether Large Language Models (LLMs) can serve as competent battle agents, capable of both making tactically sound decisions and generating novel, balanced game content. We developed a turn-based Pok\'emon battle system where LLMs select moves based on battle state rather than pre-programmed logic. The framework captures essential Pok\'emon mechanics: type effectiveness multipliers, stat-based damage calculations, and multi-Pok\'emon team management. Through systematic evaluation across multiple model architectures we measured win rates, decision latency, type-alignment accuracy, and token efficiency. These results suggest LLMs can function as dynamic game opponents without domain-specific training, offering a practical alternative to reinforcement learning for turn-based strategic games. The dual capability of tactical reasoning and content creation, positions LLMs as both players and designers, with implications for procedural generation and adaptive difficulty systems in interactive entertainment.

Fonte: arXiv cs.AI

NLP/LLMsScore 95

ShareChat: A Dataset of Chatbot Conversations in the Wild

arXiv:2512.17843v1 Announce Type: new Abstract: While Large Language Models (LLMs) have evolved into distinct platforms with unique interface designs and capabilities, existing public datasets treat models as generic text generators, stripping away the interface context that actively shapes user interaction. To address this limitation, we present ShareChat, a large-scale, cross-platform corpus comprising 142,808 conversations and over 660,000 turns collected from publicly shared URLs across five major platforms: ChatGPT, Claude, Gemini, Perplexity, and Grok. ShareChat distinguishes itself by preserving native platform affordances often lost in standard logs, including reasoning traces, source links, and code artifacts, while spanning 101 languages over the period from April 2023 to October 2025. Furthermore, ShareChat offers substantially longer context windows and greater interaction depth than prior datasets. We demonstrate the dataset's multifaceted utility through three representative analyses: (1) analyzing conversation completeness to measure user intent satisfaction; (2) evaluating source citation behaviors in content generation; and (3) conducting temporal analysis to track evolving usage patterns. This work provides the community with a vital and timely resource for understanding authentic user-LLM chatbot interactions in the wild.

Fonte: arXiv cs.CL

NLP/LLMsScore 96

Investigando a Inteligência Geral Científica de LLMs com Fluxos de Trabalho Alinhados a Cientistas

Apesar dos avanços em IA científica, falta um framework coerente para Inteligência Geral Científica (SGI) — a capacidade de conceber, investigar e raciocinar autonomamente em domínios científicos. Apresentamos uma definição operacional de SGI baseada no Modelo de Investigação Prática (PIM) e a operacionalizamos através de quatro tarefas alinhadas a cientistas: pesquisa profunda, geração de ideias, experimentos secos/molhados e raciocínio experimental.

Fonte: arXiv cs.AI

RLScore 95

Affect, Body, Cognition, Demographics, and Emotion: The ABCDE of Text Features for Computational Affective Science

arXiv:2512.17752v1 Announce Type: new Abstract: Work in Computational Affective Science and Computational Social Science explores a wide variety of research questions about people, emotions, behavior, and health. Such work often relies on language data that is first labeled with relevant information, such as the use of emotion words or the age of the speaker. Although many resources and algorithms exist to enable this type of labeling, discovering, accessing, and using them remains a substantial impediment, particularly for practitioners outside of computer science. Here, we present the ABCDE dataset (Affect, Body, Cognition, Demographics, and Emotion), a large-scale collection of over 400 million text utterances drawn from social media, blogs, books, and AI-generated sources. The dataset is annotated with a wide range of features relevant to computational affective and social science. ABCDE facilitates interdisciplinary research across numerous fields, including affective science, cognitive science, the digital humanities, sociology, political science, and computational linguistics.

Fonte: arXiv cs.CL

NLP/LLMsScore 95

Are Vision Language Models Cross-Cultural Theory of Mind Reasoners?

arXiv:2512.17394v1 Announce Type: new Abstract: Theory of Mind (ToM) -- the ability to attribute beliefs, desires, and emotions to others -- is fundamental for human social intelligence, yet remains a major challenge for artificial agents. Existing Vision-Language Models (VLMs) are increasingly applied in socially grounded tasks, but their capacity for cross-cultural ToM reasoning is largely unexplored. In this work, we introduce CulturalToM-VQA, a new evaluation benchmark containing 5095 questions designed to probe ToM reasoning across diverse cultural contexts through visual question answering. The dataset captures culturally grounded cues such as rituals, attire, gestures, and interpersonal dynamics, enabling systematic evaluation of ToM reasoning beyond Western-centric benchmarks. Our dataset is built through a VLM-assisted human-in-the-loop pipeline, where human experts first curate culturally rich images across traditions, rituals, and social interactions; a VLM then assist in generating structured ToM-focused scene descriptions, which are refined into question-answer pairs spanning a taxonomy of six ToM tasks and four graded complexity levels. The resulting dataset covers diverse theory of mind facets such as mental state attribution, false belief reasoning, non-literal communication, social norm violations, perspective coordination, and multi-agent reasoning.

Fonte: arXiv cs.CL

NLP/LLMsScore 95

CIFE: Code Instruction-Following Evaluation

arXiv:2512.17387v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly applied to real-world code generation, where functional correctness alone is insufficient for reliable deployment, developers also expect adherence to explicit requirements for robustness, formatting, and security. Existing benchmarks primarily assess correctness through test-case execution, offering limited insight into how reliably models follow such constraints. We introduce a benchmark of 1,000 Python tasks, each paired with an average of 7 developer-specified constraints spanning 13 categories. Constraints are curated through a four-stage human-LLM pipeline to ensure they are atomic, relevant, and objective. We evaluate 14 open- and closed-source models using complementary adherence metrics and propose the C2A Score, a composite measure that jointly captures correctness and constraint compliance. Results reveal a substantial gap between partial and strict satisfaction, while strong models achieve over 90% partial adherence, strict adherence remains between 39-66%. These findings highlight that trustworthy code generation requires not only correctness but also consistent adherence to developer intent.

Fonte: arXiv cs.CL

NLP/LLMsScore 95

UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models

arXiv:2512.17385v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, their effectiveness heavily relies on supervised training with extensive labeled (e.g., question-answering pairs) or unlabeled datasets (e.g., code snippets), which are often expensive and difficult to obtain at scale. To address this limitation, this paper introduces a method IPC, an unsupervised framework that leverages Internal Probing of LLMs for Code generation without any external corpus, even unlabeled code snippets. We introduce the problem space probing, test understanding probing, solution space probing, and knowledge consolidation and reinforcement to probe the internal knowledge and confidence patterns existing in LLMs. Further, IPC identifies reliable code candidates through self-consistency mechanisms and representation-based quality estimation to train UCoder (coder with unsupervised learning). We validate the proposed approach across multiple code benchmarks, demonstrating that unsupervised methods can achieve competitive performance compared to supervised approaches while significantly reducing the dependency on labeled data and computational resources. Analytic experiments reveal that internal model states contain rich signals about code quality and correctness, and that properly harnessing these signals enables effective unsupervised learning for code generation tasks, opening new directions for training code LLMs in resource-constrained scenarios.

Fonte: arXiv cs.CL

NLP/LLMsScore 95

When the Gold Standard isn't Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content

arXiv:2512.17738v1 Announce Type: new Abstract: User-generated content (UGC) is characterised by frequent use of non-standard language, from spelling errors to expressive choices such as slang, character repetitions, and emojis. This makes evaluating UGC translation particularly challenging: what counts as a "good" translation depends on the level of standardness desired in the output. To explore this, we examine the human translation guidelines of four UGC datasets, and derive a taxonomy of twelve non-standard phenomena and five translation actions (NORMALISE, COPY, TRANSFER, OMIT, CENSOR). Our analysis reveals notable differences in how UGC is treated, resulting in a spectrum of standardness in reference translations. Through a case study on large language models (LLMs), we show that translation scores are highly sensitive to prompts with explicit translation instructions for UGC, and that they improve when these align with the dataset's guidelines. We argue that when preserving UGC style is important, fair evaluation requires both models and metrics to be aware of translation guidelines. Finally, we call for clear guidelines during dataset creation and for the development of controllable, guideline-aware evaluation frameworks for UGC translation.

Fonte: arXiv cs.CL

NLP/LLMsScore 92

Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers

arXiv:2512.17351v1 Announce Type: new Abstract: Understanding architectural differences in language models is challenging, especially at academic-scale pretraining (e.g., 1.3B parameters, 100B tokens), where results are often dominated by noise and randomness. To overcome this, we introduce controlled synthetic pretraining tasks that isolate and evaluate core model capabilities. Within this framework, we discover CANON LAYERS: lightweight architectural components -- named after the musical term "canon" -- that promote horizontal information flow across neighboring tokens. Canon layers compute weighted sums of nearby token representations and integrate seamlessly into Transformers, linear attention, state-space models, or any sequence architecture. We present 12 key results. This includes how Canon layers enhance reasoning depth (e.g., by $2\times$), reasoning breadth, knowledge manipulation, etc. They lift weak architectures like NoPE to match RoPE, and linear attention to rival SOTA linear models like Mamba2/GDN -- validated both through synthetic tasks and real-world academic-scale pretraining. This synthetic playground offers an economical, principled path to isolate core model capabilities often obscured at academic scales. Equipped with infinite high-quality data, it may even PREDICT how future architectures will behave as training pipelines improve -- e.g., through better data curation or RL-based post-training -- unlocking deeper reasoning and hierarchical inference.

Fonte: arXiv cs.CL

NLP/LLMsScore 95

Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience

arXiv:2512.17260v1 Announce Type: new Abstract: Large language models have recently made significant progress to generate rigorous mathematical proofs. In contrast, utilizing LLMs for theorem proving in formal languages (such as Lean) remains challenging and computationally expensive, particularly when addressing problems at the undergraduate level and beyond. In this work, we present \textbf{Seed-Prover 1.5}, a formal theorem-proving model trained via large-scale agentic reinforcement learning, alongside an efficient test-time scaling (TTS) workflow. Through extensive interactions with Lean and other tools, the model continuously accumulates experience during the RL process, substantially enhancing the capability and efficiency of formal theorem proving. Furthermore, leveraging recent advancements in natural language proving, our TTS workflow efficiently bridges the gap between natural and formal languages. Compared to state-of-the-art methods, Seed-Prover 1.5 achieves superior performance with a smaller compute budget. It solves \textbf{88\% of PutnamBench} (undergraduate-level), \textbf{80\% of Fate-H} (graduate-level), and \textbf{33\% of Fate-X} (PhD-level) problems. Notably, using our system, we solved \textbf{11 out of 12 problems} from Putnam 2025 within 9 hours. Our findings suggest that scaling learning from experience, driven by high-quality formal feedback, holds immense potential for the future of formal mathematical reasoning.

Fonte: arXiv cs.CL

VisionScore 95

ABE-CLIP: Training-Free Attribute Binding Enhancement for Compositional Image-Text Matching

arXiv:2512.17178v1 Announce Type: new Abstract: Contrastive Language-Image Pretraining (CLIP) has achieved remarkable performance in various multimodal tasks. However, it still struggles with compositional image-text matching, particularly in accurately associating objects with their corresponding attributes, because its inherent global representation often overlooks fine-grained semantics for attribute binding. Existing methods often require additional training or extensive hard negative sampling, yet they frequently show limited generalization to novel compositional concepts and fail to fundamentally address the drawbacks of global representations. In this paper, we propose ABE-CLIP, a novel training-free Attribute Binding Enhancement method designed to strengthen attribute-object binding in CLIP-like models. Specifically, we employ a Semantic Refinement Mechanism to refine token embeddings for both object and attribute phrases in the text, thereby mitigating attribute confusion and improving semantic precision. We further introduce a Local Token-Patch Alignment strategy that computes similarity scores between refined textual tokens and their most relevant image patches. By aggregating localized similarity scores, ABE-CLIP computes the final image-text similarity. Experiments on multiple datasets demonstrate that ABE-CLIP significantly improves attribute-object binding performance, even surpassing methods that require extensive training.

Fonte: arXiv cs.CV

VisionScore 95

Nem sempre é mais verde do outro lado: Percepção de áreas verdes entre diferentes demografias e personalidades em várias cidades

Quantificar e avaliar a vegetação urbana é crucial para o planejamento e desenvolvimento, refletindo a importância contínua dos espaços verdes para múltiplas dimensões climáticas e de bem-estar nas cidades. Este trabalho mede as diferenças entre percepções subjetivas e medidas objetivas de áreas verdes, utilizando dados de uma pesquisa abrangente com 1.000 pessoas em cinco países.

Fonte: arXiv cs.CV

VisionScore 95

Endo-SemiS: Rumo à Segmentação de Imagem Semi-Supervisionada Robusta para Vídeo Endoscópico

Neste artigo, apresentamos o Endo-SemiS, um framework de segmentação semi-supervisionada para fornecer segmentação confiável de quadros de vídeo endoscópico com anotação limitada. O Endo-SemiS utiliza 4 estratégias para melhorar o desempenho, aproveitando efetivamente todos os dados disponíveis, especialmente dados não rotulados.

Fonte: arXiv cs.CV

NLP/LLMsScore 95

Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs

arXiv:2512.17206v1 Announce Type: new Abstract: Exploration capacity shapes both inference-time performance and reinforcement learning (RL) training for large (vision-) language models, as stochastic sampling often yields redundant reasoning paths with little high-level diversity. This paper proposes Reasoning Palette, a novel latent-modulation framework that endows the model with a stochastic latent variable for strategic contextualization, guiding its internal planning prior to token generation. This latent context is inferred from the mean-pooled embedding of a question-answer pair via a variational autoencoder (VAE), where each sampled latent potentially encodes a distinct reasoning context. During inference, a sampled latent is decoded into learnable token prefixes and prepended to the input prompt, modulating the model's internal reasoning trajectory. In this way, the model performs internal sampling over reasoning strategies prior to output generation, which shapes the style and structure of the entire response sequence. A brief supervised fine-tuning (SFT) warm-up phase allows the model to adapt to this latent conditioning. Within RL optimization, Reasoning Palette facilitates structured exploration by enabling on-demand injection for diverse reasoning modes, significantly enhancing exploration efficiency and sustained learning capability. Experiments across multiple reasoning benchmarks demonstrate that our method enables interpretable and controllable control over the (vision-) language model's strategic behavior, thereby achieving consistent performance gains over standard RL methods.

Fonte: arXiv cs.CV

NLP/LLMsScore 95

Sample, Don't Search: Rethinking Test-Time Alignment for Language Models

arXiv:2504.03790v2 Announce Type: replace-cross Abstract: Increasing test-time computation has emerged as a promising direction for improving language model performance, particularly in scenarios where model finetuning is impractical or impossible due to computational constraints or private model weights. However, existing test-time search methods using a reward model (RM) often degrade in quality as compute scales, due to the over-optimization of what are inherently imperfect reward proxies. We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access. We demonstrate the effectiveness of QAlign on mathematical reasoning benchmarks (GSM8K and GSM-Symbolic) using a task-specific RM, showing consistent improvements over existing test-time compute methods like best-of-n and majority voting. Furthermore, when applied with more realistic RMs trained on the Tulu 3 preference dataset, QAlign outperforms direct preference optimization (DPO), best-of-n, majority voting, and weighted majority voting on a diverse range of datasets (GSM8K, MATH500, IFEval, MMLU-Redux, and TruthfulQA). A practical solution to aligning language models at test time using additional computation without degradation, our approach expands the limits of the capability that can be obtained from off-the-shelf language models without further training.

Fonte: arXiv stat.ML

VisionScore 95

Towards Pixel-Wise Anomaly Location for High-Resolution PCBA \\ via Self-Supervised Image Reconstruction

arXiv:2512.17296v1 Announce Type: new Abstract: Automated defect inspection of assembled Printed Circuit Board Assemblies (PCBA) is quite challenging due to the insufficient labeled data, micro-defects with just a few pixels in visually-complex and high-resolution images. To address these challenges, we present HiSIR-Net, a High resolution, Self-supervised Reconstruction framework for pixel-wise PCBA localization. Our design combines two lightweight modules that make this practical on real 4K-resolution boards: (i) a Selective Input-Reconstruction Gate (SIR-Gate) that lets the model decide where to trust reconstruction versus the original input, thereby reducing irrelevant reconstruction artifacts and false alarms; and (ii) a Region-level Optimized Patch Selection (ROPS) scheme with positional cues to select overlapping patch reconstructions coherently across arbitrary resolutions. Organically integrating these mechanisms yields clean, high-resolution anomaly maps with low false positive (FP) rate. To bridge the gap in high-resolution PCBA datasets, we further contribute a self-collected dataset named SIPCBA-500 of 500 images. We conduct extensive experiments on our SIPCBA-500 as well as public benchmarks, demonstrating the superior localization performance of our method while running at practical speed. Full code and dataset will be made available upon acceptance.

Fonte: arXiv cs.CV

RLScore 89

Incremental Generation is Necessary and Sufficient for Universality in Flow-Based Modelling

arXiv:2511.09902v2 Announce Type: replace-cross Abstract: Incremental flow-based denoising models have reshaped generative modelling, but their empirical advantage still lacks a rigorous approximation-theoretic foundation. We show that incremental generation is necessary and sufficient for universal flow-based generation on the largest natural class of self-maps of $[0,1]^d$ compatible with denoising pipelines, namely the orientation-preserving homeomorphisms of $[0,1]^d$. All our guarantees are uniform on the underlying maps and hence imply approximation both samplewise and in distribution. Using a new topological-dynamical argument, we first prove an impossibility theorem: the class of all single-step autonomous flows, independently of the architecture, width, depth, or Lipschitz activation of the underlying neural network, is meagre and therefore not universal in the space of orientation-preserving homeomorphisms of $[0,1]^d$. By exploiting algebraic properties of autonomous flows, we conversely show that every orientation-preserving Lipschitz homeomorphism on $[0,1]^d$ can be approximated at rate $O(n^{-1/d})$ by a composition of at most $K_d$ such flows, where $K_d$ depends only on the dimension. Under additional smoothness assumptions, the approximation rate can be made dimension-free, and $K_d$ can be chosen uniformly over the class being approximated. Finally, by linearly lifting the domain into one higher dimension, we obtain structured universal approximation results for continuous functions and for probability measures on $[0,1]^d$, the latter realized as pushforwards of empirical measures with vanishing $1$-Wasserstein error.

Fonte: arXiv stat.ML

RLScore 95

Safeguarded Stochastic Polyak Step Sizes for Non-smooth Optimization: Robust Performance Without Small (Sub)Gradients

arXiv:2512.02342v2 Announce Type: replace-cross Abstract: The stochastic Polyak step size (SPS) has proven to be a promising choice for stochastic gradient descent (SGD), delivering competitive performance relative to state-of-the-art methods on smooth convex and non-convex optimization problems, including deep neural network training. However, extensions of this approach to non-smooth settings remain in their early stages, often relying on interpolation assumptions or requiring knowledge of the optimal solution. In this work, we propose a novel SPS variant, Safeguarded SPS (SPS$_{safe}$), for the stochastic subgradient method, and provide rigorous convergence guarantees for non-smooth convex optimization with no need for strong assumptions. We further incorporate momentum into the update rule, yielding equally tight theoretical results. On non-smooth convex benchmarks, our experiments are consistent with the theoretical predictions on how the safeguard affects the convergence neighborhood. On deep neural networks the proposed step size achieves competitive performance to existing adaptive baselines and exhibits stable behavior across a wide range of problem settings. Moreover, in these experiments, the gradient norms under our step size do not collapse to (near) zero, indicating robustness to vanishing gradients.

Fonte: arXiv stat.ML

RLScore 95

A Certified Unlearning Approach without Access to Source Data

arXiv:2506.06486v3 Announce Type: replace-cross Abstract: With the growing adoption of data privacy regulations, the ability to erase private or copyrighted information from trained models has become a crucial requirement. Traditional unlearning methods often assume access to the complete training dataset, which is unrealistic in scenarios where the source data is no longer available. To address this challenge, we propose a certified unlearning framework that enables effective data removal \final{without access to the original training data samples}. Our approach utilizes a surrogate dataset that approximates the statistical properties of the source data, allowing for controlled noise scaling based on the statistical distance between the two. \updated{While our theoretical guarantees assume knowledge of the exact statistical distance, practical implementations typically approximate this distance, resulting in potentially weaker but still meaningful privacy guarantees.} This ensures strong guarantees on the model's behavior post-unlearning while maintaining its overall utility. We establish theoretical bounds, introduce practical noise calibration techniques, and validate our method through extensive experiments on both synthetic and real-world datasets. The results demonstrate the effectiveness and reliability of our approach in privacy-sensitive settings.

Fonte: arXiv stat.ML

NLP/LLMsScore 95

Towards Human-Guided, Data-Centric LLM Co-Pilots

arXiv:2501.10321v3 Announce Type: replace-cross Abstract: Machine learning (ML) has the potential to revolutionize various domains, but its adoption is often hindered by the disconnect between the needs of domain experts and translating these needs into robust and valid ML tools. Despite recent advances in LLM-based co-pilots to democratize ML for non-technical domain experts, these systems remain predominantly focused on model-centric aspects while overlooking critical data-centric challenges. This limitation is problematic in complex real-world settings where raw data often contains complex issues, such as missing values, label noise, and domain-specific nuances requiring tailored handling. To address this we introduce CliMB-DC, a human-guided, data-centric framework for LLM co-pilots that combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing. At its core, CliMB-DC introduces a novel, multi-agent reasoning system that combines a strategic coordinator for dynamic planning and adaptation with a specialized worker agent for precise execution. Domain expertise is then systematically incorporated to guide the reasoning process using a human-in-the-loop approach. To guide development, we formalize a taxonomy of key data-centric challenges that co-pilots must address. Thereafter, to address the dimensions of the taxonomy, we integrate state-of-the-art data-centric tools into an extensible, open-source architecture, facilitating the addition of new tools from the research community. Empirically, using real-world healthcare datasets we demonstrate CliMB-DC's ability to transform uncurated datasets into ML-ready formats, significantly outperforming existing co-pilot baselines for handling data-centric challenges. CliMB-DC promises to empower domain experts from diverse domains -- healthcare, finance, social sciences and more -- to actively participate in driving real-world impact using ML.

Fonte: arXiv stat.ML

RLScore 95

A Survey on Archetypal Analysis

arXiv:2504.12392v2 Announce Type: replace-cross Abstract: Archetypal analysis (AA) was originally proposed in 1994 by Adele Cutler and Leo Breiman as a computational procedure for extracting distinct aspects, so-called archetypes, from observations, with each observational record approximated as a mixture (i.e., convex combination) of these archetypes. AA thereby provides straightforward, interpretable, and explainable representations for feature extraction and dimensionality reduction, facilitating the understanding of the structure of high-dimensional data and enabling wide applications across the sciences. However, AA also faces challenges, particularly as the associated optimization problem is non-convex. This is the first survey that provides researchers and data mining practitioners with an overview of the methodologies and opportunities that AA offers, surveying the many applications of AA across disparate fields of science, as well as best practices for modeling data with AA and its limitations. The survey concludes by explaining crucial future research directions concerning AA.

Fonte: arXiv stat.ML

RLScore 95

Fairness via Independence: A (Conditional) Distance Covariance Framework

arXiv:2412.00720v2 Announce Type: replace-cross Abstract: We explore fairness from a statistical perspective by selectively utilizing either conditional distance covariance or distance covariance statistics as measures to assess the independence between predictions and sensitive attributes. We boost fairness with independence by adding a distance covariance-based penalty to the model's training. Additionally, we present the matrix form of empirical (conditional) distance covariance for parallel calculations to enhance computational efficiency. Theoretically, we provide a proof for the convergence between empirical and population (conditional) distance covariance, establishing necessary guarantees for batch computations. Through experiments conducted on a range of real-world datasets, we have demonstrated that our method effectively bridges the fairness gap in machine learning. Our code is available at \url{https://github.com/liuhaixias1/Fair_dc/}.

Fonte: arXiv stat.ML

RLScore 92

Quantifying Uncertainty in the Presence of Distribution Shifts

arXiv:2506.18283v2 Announce Type: replace Abstract: Neural networks make accurate predictions but often fail to provide reliable uncertainty estimates, especially under covariate distribution shifts between training and testing. To address this problem, we propose a Bayesian framework for uncertainty estimation that explicitly accounts for covariate shifts. While conventional approaches rely on fixed priors, the key idea of our method is an adaptive prior, conditioned on both training and new covariates. This prior naturally increases uncertainty for inputs that lie far from the training distribution in regions where predictive performance is likely to degrade. To efficiently approximate the resulting posterior predictive distribution, we employ amortized variational inference. Finally, we construct synthetic environments by drawing small bootstrap samples from the training data, simulating a range of plausible covariate shift using only the original dataset. We evaluate our method on both synthetic and real-world data. It yields substantially improved uncertainty estimates under distribution shifts.

Fonte: arXiv stat.ML

RLScore 91

Refined Analysis of Federated Averaging and Federated Richardson-Romberg

arXiv:2412.01389v2 Announce Type: replace Abstract: In this paper, we present a novel analysis of \FedAvg with constant step size, relying on the Markov property of the underlying process. We demonstrate that the global iterates of the algorithm converge to a stationary distribution and analyze its resulting bias and variance relative to the problem's solution. We provide a first-order bias expansion in both homogeneous and heterogeneous settings. Interestingly, this bias decomposes into two distinct components: one that depends solely on stochastic gradient noise and another on client heterogeneity. Finally, we introduce a new algorithm based on the Richardson-Romberg extrapolation technique to mitigate this bias.

Fonte: arXiv stat.ML

RLScore 92

Regularized Langevin Dynamics for Combinatorial Optimization

arXiv:2502.00277v4 Announce Type: replace-cross Abstract: This work proposes a simple yet effective sampling framework for combinatorial optimization (CO). Our method builds on discrete Langevin dynamics (LD), an efficient gradient-guided generative paradigm. However, we observe that directly applying LD often leads to limited exploration. To overcome this limitation, we propose the Regularized Langevin Dynamics (RLD), which enforces an expected distance between the sampled and current solutions, effectively avoiding local minima. We develop two CO solvers on top of RLD, one based on simulated annealing (SA), and the other one based on neural network (NN). Empirical results on three classic CO problems demonstrate that both of our methods can achieve comparable or better performance against the previous state-of-the-art (SOTA) SA- and NN-based solvers. In particular, our SA algorithm reduces the runtime of the previous SOTA SA method by up to 80\%, while achieving equal or superior performance. In summary, RLD offers a promising framework for enhancing both traditional heuristics and NN models to solve CO problems. Our code is available at https://github.com/Shengyu-Feng/RLD4CO.

Fonte: arXiv stat.ML

NLP/LLMsScore 95

The Stochastic Occupation Kernel (SOCK) Method for Learning Stochastic Differential Equations

arXiv:2505.11622v2 Announce Type: replace Abstract: We present a novel kernel-based method for learning multivariate stochastic differential equations (SDEs). The method follows a two-step procedure: we first estimate the drift term function, then the (matrix-valued) diffusion function given the drift. Occupation kernels are integral functionals on a reproducing kernel Hilbert space (RKHS) that aggregate information over a trajectory. Our approach leverages vector-valued occupation kernels for estimating the drift component of the stochastic process. For diffusion estimation, we extend this framework by introducing operator-valued occupation kernels, enabling the estimation of an auxiliary matrix-valued function as a positive semi-definite operator, from which we readily derive the diffusion estimate. This enables us to avoid common challenges in SDE learning, such as intractable likelihoods, by optimizing a reconstruction-error-based objective. We propose a simple learning procedure that retains strong predictive accuracy while using Fenchel duality to promote efficiency. We validate the method on simulated benchmarks and a real-world dataset of Amyloid imaging in healthy and Alzheimer's disease subjects.

Fonte: arXiv stat.ML

RLScore 89

SCAFFLSA: Taming Heterogeneity in Federated Linear Stochastic Approximation and TD Learning

arXiv:2402.04114v3 Announce Type: replace Abstract: In this paper, we analyze the sample and communication complexity of the federated linear stochastic approximation (FedLSA) algorithm. We explicitly quantify the effects of local training with agent heterogeneity. We show that the communication complexity of FedLSA scales polynomially with the inverse of the desired accuracy $\epsilon$. To overcome this, we propose SCAFFLSA a new variant of FedLSA that uses control variates to correct for client drift, and establish its sample and communication complexities. We show that for statistically heterogeneous agents, its communication complexity scales logarithmically with the desired accuracy, similar to Scaffnew. An important finding is that, compared to the existing results for Scaffnew, the sample complexity scales with the inverse of the number of agents, a property referred to as linear speed-up. Achieving this linear speed-up requires completely new theoretical arguments. We apply the proposed method to federated temporal difference learning with linear function approximation and analyze the corresponding complexity improvements.

Fonte: arXiv stat.ML

RLScore 89

On the performance of multi-fidelity and reduced-dimensional neural emulators for inference of physiological boundary conditions

arXiv:2506.11683v2 Announce Type: replace Abstract: Solving inverse problems in cardiovascular modeling is particularly challenging due to the high computational cost of running high-fidelity simulations. In this work, we focus on Bayesian parameter estimation and explore different methods to reduce the computational cost of sampling from the posterior distribution by leveraging low-fidelity approximations. A common approach is to construct a surrogate model for the high-fidelity simulation itself. Another is to build a surrogate for the discrepancy between high- and low-fidelity models. This discrepancy, which is often easier to approximate, is modeled with either a fully connected neural network or a nonlinear dimensionality reduction technique that enables surrogate construction in a lower-dimensional space. A third possible approach is to treat the discrepancy between the high-fidelity and surrogate models as random noise and estimate its distribution using normalizing flows. This allows us to incorporate the approximation error into the Bayesian inverse problem by modifying the likelihood function. We validate five different methods which are variations of the above on analytical test cases by comparing them to posterior distributions derived solely from high-fidelity models, assessing both accuracy and computational cost. Finally, we demonstrate our approaches on two cardiovascular examples of increasing complexity: a lumped-parameter Windkessel model and a patient-specific three-dimensional anatomy.

Fonte: arXiv stat.ML

RLScore 92

Targeted Learning for Variable Importance

arXiv:2411.02221v2 Announce Type: replace Abstract: Variable importance is one of the most widely used measures for interpreting machine learning with significant interest from both statistics and machine learning communities. Recently, increasing attention has been directed toward uncertainty quantification in these metrics. Current approaches largely rely on one-step procedures, which, while asymptotically efficient, can present higher sensitivity and instability in finite sample settings. To address these limitations, we propose a novel method by employing the targeted learning (TL) framework, designed to enhance robustness in inference for variable importance metrics. Our approach is particularly suited for conditional permutation variable importance. We show that it (i) retains the asymptotic efficiency of traditional methods, (ii) maintains comparable computational complexity, and (iii) delivers improved accuracy, especially in finite sample contexts. We further support these findings with numerical experiments that illustrate the practical advantages of our method and validate the theoretical results.

Fonte: arXiv stat.ML

RLScore 92

On the Identification of Temporally Causal Representation with Instantaneous Dependence

arXiv:2405.15325v3 Announce Type: replace-cross Abstract: Temporally causal representation learning aims to identify the latent causal process from time series observations, but most methods require the assumption that the latent causal processes do not have instantaneous relations. Although some recent methods achieve identifiability in the instantaneous causality case, they require either interventions on the latent variables or grouping of the observations, which are in general difficult to obtain in real-world scenarios. To fill this gap, we propose an \textbf{ID}entification framework for instantane\textbf{O}us \textbf{L}atent dynamics (\textbf{IDOL}) by imposing a sparse influence constraint that the latent causal processes have sparse time-delayed and instantaneous relations. Specifically, we establish identifiability results of the latent causal process based on sufficient variability and the sparse influence constraint by employing contextual information of time series data. Based on these theories, we incorporate a temporally variational inference architecture to estimate the latent variables and a gradient-based sparsity regularization to identify the latent causal process. Experimental results on simulation datasets illustrate that our method can identify the latent causal process. Furthermore, evaluations on multiple human motion forecasting benchmarks with instantaneous dependencies indicate the effectiveness of our method in real-world settings.

Fonte: arXiv stat.ML

RLScore 89

Clustering and Pruning in Causal Data Fusion

arXiv:2505.15215v2 Announce Type: replace Abstract: Data fusion, the process of combining observational and experimental data, can enable the identification of causal effects that would otherwise remain non-identifiable. Although identification algorithms have been developed for specific scenarios, do-calculus remains the only general-purpose tool for causal data fusion, particularly when variables are present in some data sources but not others. However, approaches based on do-calculus may encounter computational challenges as the number of variables increases and the causal graph grows in complexity. Consequently, there exists a need to reduce the size of such models while preserving the essential features. For this purpose, we propose pruning (removing unnecessary variables) and clustering (combining variables) as preprocessing operations for causal data fusion. We generalize earlier results on a single data source and derive conditions for applying pruning and clustering in the case of multiple data sources. We give sufficient conditions for inferring the identifiability or non-identifiability of a causal effect in a larger graph based on a smaller graph and show how to obtain the corresponding identifying functional for identifiable causal effects. Examples from epidemiology and social science demonstrate the use of the results.

Fonte: arXiv stat.ML

NLP/LLMsScore 95

Generalized infinite dimensional Alpha-Procrustes based geometries

arXiv:2511.09801v2 Announce Type: replace Abstract: This work extends the recently introduced Alpha-Procrustes family of Riemannian metrics for symmetric positive definite (SPD) matrices by incorporating generalized versions of the Bures-Wasserstein (GBW), Log-Euclidean, and Wasserstein distances. While the Alpha-Procrustes framework has unified many classical metrics in both finite- and infinite- dimensional settings, it previously lacked the structural components necessary to realize these generalized forms. We introduce a formalism based on unitized Hilbert-Schmidt operators and an extended Mahalanobis norm that allows the construction of robust, infinite-dimensional generalizations of GBW and Log-Hilbert-Schmidt distances. Our approach also incorporates a learnable regularization parameter that enhances geometric stability in high-dimensional comparisons. Preliminary experiments reproducing benchmarks from the literature demonstrate the improved performance of our generalized metrics, particularly in scenarios involving comparisons between datasets of varying dimension and scale. This work lays a theoretical and computational foundation for advancing robust geometric methods in machine learning, statistical inference, and functional data analysis.

Fonte: arXiv stat.ML

RLScore 92

Constraint-based causal discovery with tiered background knowledge and latent variables in single or overlapping datasets

arXiv:2503.21526v3 Announce Type: replace Abstract: In this paper we consider the use of tiered background knowledge within constraint based causal discovery. Our focus is on settings relaxing causal sufficiency, i.e. allowing for latent variables which may arise because relevant information could not be measured at all, or not jointly, as in the case of multiple overlapping datasets. We first present novel insights into the properties of the 'tiered FCI' (tFCI) algorithm. Building on this, we introduce a new extension of the IOD (integrating overlapping datasets) algorithm incorporating tiered background knowledge, the 'tiered IOD' (tIOD) algorithm. We show that under full usage of the tiered background knowledge tFCI and tIOD are sound, while simple versions of the tIOD and tFCI are sound and complete. We further show that the tIOD algorithm can often be expected to be considerably more efficient and informative than the IOD algorithm even beyond the obvious restriction of the Markov equivalence classes. We provide a formal result on the conditions for this gain in efficiency and informativeness. Our results are accompanied by a series of examples illustrating the exact role and usefulness of tiered background knowledge.

Fonte: arXiv stat.ML

NLP/LLMsScore 95

Spatially-informed transformers: Injecting geostatistical covariance biases into self-attention for spatio-temporal forecasting

arXiv:2512.17696v1 Announce Type: cross Abstract: The modeling of high-dimensional spatio-temporal processes presents a fundamental dichotomy between the probabilistic rigor of classical geostatistics and the flexible, high-capacity representations of deep learning. While Gaussian processes offer theoretical consistency and exact uncertainty quantification, their prohibitive computational scaling renders them impractical for massive sensor networks. Conversely, modern transformer architectures excel at sequence modeling but inherently lack a geometric inductive bias, treating spatial sensors as permutation-invariant tokens without a native understanding of distance. In this work, we propose a spatially-informed transformer, a hybrid architecture that injects a geostatistical inductive bias directly into the self-attention mechanism via a learnable covariance kernel. By formally decomposing the attention structure into a stationary physical prior and a non-stationary data-driven residual, we impose a soft topological constraint that favors spatially proximal interactions while retaining the capacity to model complex dynamics. We demonstrate the phenomenon of ``Deep Variography'', where the network successfully recovers the true spatial decay parameters of the underlying process end-to-end via backpropagation. Extensive experiments on synthetic Gaussian random fields and real-world traffic benchmarks confirm that our method outperforms state-of-the-art graph neural networks. Furthermore, rigorous statistical validation confirms that the proposed method delivers not only superior predictive accuracy but also well-calibrated probabilistic forecasts, effectively bridging the gap between physics-aware modeling and data-driven learning.

Fonte: arXiv stat.ML

NLP/LLMsScore 95

Rotterdam artery-vein segmentation (RAV) dataset

arXiv:2512.17322v1 Announce Type: new Abstract: Purpose: To provide a diverse, high-quality dataset of color fundus images (CFIs) with detailed artery-vein (A/V) segmentation annotations, supporting the development and evaluation of machine learning algorithms for vascular analysis in ophthalmology. Methods: CFIs were sampled from the longitudinal Rotterdam Study (RS), encompassing a wide range of ages, devices, and capture conditions. Images were annotated using a custom interface that allowed graders to label arteries, veins, and unknown vessels on separate layers, starting from an initial vessel segmentation mask. Connectivity was explicitly verified and corrected using connected component visualization tools. Results: The dataset includes 1024x1024-pixel PNG images in three modalities: original RGB fundus images, contrast-enhanced versions, and RGB-encoded A/V masks. Image quality varied widely, including challenging samples typically excluded by automated quality assessment systems, but judged to contain valuable vascular information. Conclusion: This dataset offers a rich and heterogeneous source of CFIs with high-quality segmentations. It supports robust benchmarking and training of machine learning models under real-world variability in image quality and acquisition settings. Translational Relevance: By including connectivity-validated A/V masks and diverse image conditions, this dataset enables the development of clinically applicable, generalizable machine learning tools for retinal vascular analysis, potentially improving automated screening and diagnosis of systemic and ocular diseases.

Fonte: arXiv cs.CV

RLScore 89

Perfect reconstruction of sparse signals using nonconvexity control and one-step RSB message passing

arXiv:2512.17426v1 Announce Type: new Abstract: We consider sparse signal reconstruction via minimization of the smoothly clipped absolute deviation (SCAD) penalty, and develop one-step replica-symmetry-breaking (1RSB) extensions of approximate message passing (AMP), termed 1RSB-AMP. Starting from the 1RSB formulation of belief propagation, we derive explicit update rules of 1RSB-AMP together with the corresponding state evolution (1RSB-SE) equations. A detailed comparison shows that 1RSB-AMP and 1RSB-SE agree remarkably well at the macroscopic level, even in parameter regions where replica-symmetric (RS) AMP, termed RS-AMP, diverges and where the 1RSB description itself is not expected to be thermodynamically exact. Fixed-point analysis of 1RSB-SE reveals a phase diagram consisting of success, failure, and diverging phases, as in the RS case. However, the diverging-region boundary now depends on the Parisi parameter due to the 1RSB ansatz, and we propose a new criterion -- minimizing the size of the diverging region -- rather than the conventional zero-complexity condition, to determine its value. Combining this criterion with the nonconvexity-control (NCC) protocol proposed in a previous RS study improves the algorithmic limit of perfect reconstruction compared with RS-AMP. Numerical solutions of 1RSB-SE and experiments with 1RSB-AMP confirm that this improved limit is achieved in practice, though the gain is modest and remains slightly inferior to the Bayes-optimal threshold. We also report the behavior of thermodynamic quantities -- overlaps, free entropy, complexity, and the non-self-averaging susceptibility -- that characterize the 1RSB phase in this problem.

Fonte: arXiv stat.ML

RLScore 95

Generative Multi-Objective Bayesian Optimization with Scalable Batch Evaluations for Sample-Efficient De Novo Molecular Design

arXiv:2512.17659v1 Announce Type: new Abstract: Designing molecules that must satisfy multiple, often conflicting objectives is a central challenge in molecular discovery. The enormous size of chemical space and the cost of high-fidelity simulations have driven the development of machine learning-guided strategies for accelerating design with limited data. Among these, Bayesian optimization (BO) offers a principled framework for sample-efficient search, while generative models provide a mechanism to propose novel, diverse candidates beyond fixed libraries. However, existing methods that couple the two often rely on continuous latent spaces, which introduces both architectural entanglement and scalability challenges. This work introduces an alternative, modular "generate-then-optimize" framework for de novo multi-objective molecular design/discovery. At each iteration, a generative model is used to construct a large, diverse pool of candidate molecules, after which a novel acquisition function, qPMHI (multi-point Probability of Maximum Hypervolume Improvement), is used to optimally select a batch of candidates most likely to induce the largest Pareto front expansion. The key insight is that qPMHI decomposes additively, enabling exact, scalable batch selection via only simple ranking of probabilities that can be easily estimated with Monte Carlo sampling. We benchmark the framework against state-of-the-art latent-space and discrete molecular optimization methods, demonstrating significant improvements across synthetic benchmarks and application-driven tasks. Specifically, in a case study related to sustainable energy storage, we show that our approach quickly uncovers novel, diverse, and high-performing organic (quinone-based) cathode materials for aqueous redox flow battery applications.

Fonte: arXiv stat.ML

VisionScore 95

Disentangled representations via score-based variational autoencoders

arXiv:2512.17127v1 Announce Type: new Abstract: We present the Score-based Autoencoder for Multiscale Inference (SAMI), a method for unsupervised representation learning that combines the theoretical frameworks of diffusion models and VAEs. By unifying their respective evidence lower bounds, SAMI formulates a principled objective that learns representations through score-based guidance of the underlying diffusion process. The resulting representations automatically capture meaningful structure in the data: it recovers ground truth generative factors in our synthetic dataset, learns factorized, semantic latent dimensions from complex natural images, and encodes video sequences into latent trajectories that are straighter than those of alternative encoders, despite training exclusively on static images. Furthermore, SAMI can extract useful representations from pre-trained diffusion models with minimal additional training. Finally, the explicitly probabilistic formulation provides new ways to identify semantically meaningful axes in the absence of supervised labels, and its mathematical exactness allows us to make formal statements about the nature of the learned representation. Overall, these results indicate that implicit structural information in diffusion models can be made explicit and interpretable through synergistic combination with a variational autoencoder.

Fonte: arXiv stat.ML

RLScore 96

Sharing Knowledge without Sharing Data: Stitches can improve ensembles of disjointly trained models

arXiv:2512.17592v1 Announce Type: new Abstract: Deep learning has been shown to be very capable at performing many real-world tasks. However, this performance is often dependent on the presence of large and varied datasets. In some settings, like in the medical domain, data is often fragmented across parties, and cannot be readily shared. While federated learning addresses this situation, it is a solution that requires synchronicity of parties training a single model together, exchanging information about model weights. We investigate how asynchronous collaboration, where only already trained models are shared (e.g. as part of a publication), affects performance, and propose to use stitching as a method for combining models. Through taking a multi-objective perspective, where performance on each parties' data is viewed independently, we find that training solely on a single parties' data results in similar performance when merging with another parties' data, when considering performance on that single parties' data, while performance on other parties' data is notably worse. Moreover, while an ensemble of such individually trained networks generalizes better, performance on each parties' own dataset suffers. We find that combining intermediate representations in individually trained models with a well placed pair of stitching layers allows this performance to recover to a competitive degree while maintaining improved generalization, showing that asynchronous collaboration can yield competitive results.

Fonte: arXiv cs.LG

RLScore 96

Assessing Long-Term Electricity Market Design for Ambitious Decarbonization Targets using Multi-Agent Reinforcement Learning

arXiv:2512.17444v1 Announce Type: new Abstract: Electricity systems are key to transforming today's society into a carbon-free economy. Long-term electricity market mechanisms, including auctions, support schemes, and other policy instruments, are critical in shaping the electricity generation mix. In light of the need for more advanced tools to support policymakers and other stakeholders in designing, testing, and evaluating long-term markets, this work presents a multi-agent reinforcement learning model capable of capturing the key features of decarbonizing energy systems. Profit-maximizing generation companies make investment decisions in the wholesale electricity market, responding to system needs, competitive dynamics, and policy signals. The model employs independent proximal policy optimization, which was selected for suitability to the decentralized and competitive environment. Nevertheless, given the inherent challenges of independent learning in multi-agent settings, an extensive hyperparameter search ensures that decentralized training yields market outcomes consistent with competitive behavior. The model is applied to a stylized version of the Italian electricity system and tested under varying levels of competition, market designs, and policy scenarios. Results highlight the critical role of market design for decarbonizing the electricity sector and avoiding price volatility. The proposed framework allows assessing long-term electricity markets in which multiple policy and market mechanisms interact simultaneously, with market participants responding and adapting to decarbonization pathways.

Fonte: arXiv cs.LG

VisionScore 95

Pro-Pose: Unpaired Full-Body Portrait Synthesis via Canonical UV Maps

arXiv:2512.17143v1 Announce Type: new Abstract: Photographs of people taken by professional photographers typically present the person in beautiful lighting, with an interesting pose, and flattering quality. This is unlike common photos people can take of themselves. In this paper, we explore how to create a ``professional'' version of a person's photograph, i.e., in a chosen pose, in a simple environment, with good lighting, and standard black top/bottom clothing. A key challenge is to preserve the person's unique identity, face and body features while transforming the photo. If there would exist a large paired dataset of the same person photographed both ``in the wild'' and by a professional photographer, the problem would potentially be easier to solve. However, such data does not exist, especially for a large variety of identities. To that end, we propose two key insights: 1) Our method transforms the input photo and person's face to a canonical UV space, which is further coupled with reposing methodology to model occlusions and novel view synthesis. Operating in UV space allows us to leverage existing unpaired datasets. 2) We personalize the output photo via multi image finetuning. Our approach yields high-quality, reposed portraits and achieves strong qualitative and quantitative performance on real-world imagery.

Fonte: arXiv cs.CV

NLP/LLMsScore 96

Understanding Generalization in Role-Playing Models via Information Theory

arXiv:2512.17270v1 Announce Type: new Abstract: Role-playing models (RPMs) are widely used in real-world applications but underperform when deployed in the wild. This degradation can be attributed to distribution shifts, including user, character, and dialogue compositional shifts. Existing methods like LLM-as-a-judge fall short in providing a fine-grained diagnosis of how these shifts affect RPM generalization, and thus there lack formal frameworks to characterize RPM generalization behaviors. To bridge these gaps, we introduce an information-theoretic metric, named reasoning-based effective mutual information difference (R-EMID), to measure RPM performance degradation in an interpretable way. We also derive an upper bound on R-EMID to predict the worst-case generalization performance of RPMs and theoretically reveal how various shifts contribute to the RPM performance degradation. Moreover, we propose a co-evolving reinforcement learning framework to adaptively model the connection among user, character, and dialogue context and thus enhance the estimation of dialogue response generation probability, which is critical for calculating R-EMID. Finally, we evaluate the generalization performance of various RPMs using R-EMID, finding that user shift poses the highest risk among all shifts and reinforcement learning is the most effective approach for enhancing RPM generalization.

Fonte: arXiv cs.LG

RLScore 96

Learning Safe Autonomous Driving Policies Using Predictive Safety Representations

arXiv:2512.17586v1 Announce Type: new Abstract: Safe reinforcement learning (SafeRL) is a prominent paradigm for autonomous driving, where agents are required to optimize performance under strict safety requirements. This dual objective creates a fundamental tension, as overly conservative policies limit driving efficiency while aggressive exploration risks safety violations. The Safety Representations for Safer Policy Learning (SRPL) framework addresses this challenge by equipping agents with a predictive model of future constraint violations and has shown promise in controlled environments. This paper investigates whether SRPL extends to real-world autonomous driving scenarios. Systematic experiments on the Waymo Open Motion Dataset (WOMD) and NuPlan demonstrate that SRPL can improve the reward-safety tradeoff, achieving statistically significant improvements in success rate (effect sizes r = 0.65-0.86) and cost reduction (effect sizes r = 0.70-0.83), with p < 0.05 for observed improvements. However, its effectiveness depends on the underlying policy optimizer and the dataset distribution. The results further show that predictive safety representations play a critical role in improving robustness to observation noise. Additionally, in zero-shot cross-dataset evaluation, SRPL-augmented agents demonstrate improved generalization compared to non-SRPL methods. These findings collectively demonstrate the potential of predictive safety representations to strengthen SafeRL for autonomous driving.

Fonte: arXiv cs.LG

RLScore 96

SafeBench-Seq: A Homology-Clustered, CPU-Only Baseline for Protein Hazard Screening with Physicochemical/Composition Features and Cluster-Aware Confidence Intervals

arXiv:2512.17527v1 Announce Type: new Abstract: Foundation models for protein design raise concrete biosecurity risks, yet the community lacks a simple, reproducible baseline for sequence-level hazard screening that is explicitly evaluated under homology control and runs on commodity CPUs. We introduce SafeBench-Seq, a metadata-only, reproducible benchmark and baseline classifier built entirely from public data (SafeProtein hazards and UniProt benigns) and interpretable features (global physicochemical descriptors and amino-acid composition). To approximate "never-before-seen" threats, we homology-cluster the combined dataset at <=40% identity and perform cluster-level holdouts (no cluster overlap between train/test). We report discrimination (AUROC/AUPRC) and screening-operating points (TPR@1% FPR; FPR@95% TPR) with 95% bootstrap confidence intervals (n=200), and we provide calibrated probabilities via CalibratedClassifierCV (isotonic for Logistic Regression / Random Forest; Platt sigmoid for Linear SVM). We quantify probability quality using Brier score, Expected Calibration Error (ECE; 15 bins), and reliability diagrams. Shortcut susceptibility is probed via composition-preserving residue shuffles and length-/composition-only ablations. Empirically, random splits substantially overestimate robustness relative to homology-clustered evaluation; calibrated linear models exhibit comparatively good calibration, while tree ensembles retain slightly higher Brier/ECE. SafeBench-Seq is CPU-only, reproducible, and releases metadata only (accessions, cluster IDs, split labels), enabling rigorous evaluation without distributing hazardous sequences.

Fonte: arXiv cs.LG

NLP/LLMsScore 96

Deep Learning-Based Surrogate Creep Modelling in Inconel 625: A High-Temperature Alloy Study

arXiv:2512.17477v1 Announce Type: new Abstract: Time-dependent deformation, particularly creep, in high-temperature alloys such as Inconel 625 is a key factor in the long-term reliability of components used in aerospace and energy systems. Although Inconel 625 shows excellent creep resistance, finite-element creep simulations in tools such as ANSYS remain computationally expensive, often requiring tens of minutes for a single 10,000-hour run. This work proposes deep learning based surrogate models to provide fast and accurate replacements for such simulations. Creep strain data was generated in ANSYS using the Norton law under uniaxial stresses of 50 to 150 MPa and temperatures of 700 to 1000 $^\circ$C, and this temporal dataset was used to train two architectures: a BiLSTM Variational Autoencoder for uncertainty-aware and generative predictions, and a BiLSTM Transformer hybrid that employs self-attention to capture long-range temporal behavior. Both models act as surrogate predictors, with the BiLSTM-VAE offering probabilistic output and the BiLSTM-Transformer delivering high deterministic accuracy. Performance is evaluated using RMSE, MAE, and $R^2$. Results show that the BiLSTM-VAE provides stable and reliable creep strain forecasts, while the BiLSTM-Transformer achieves strong accuracy across the full time range. Latency tests indicate substantial speedup: while each ANSYS simulation requires 30 to 40 minutes for a given stress-temperature condition, the surrogate models produce predictions within seconds. The proposed framework enables rapid creep assessment for design optimization and structural health monitoring, and provides a scalable solution for high-temperature alloy applications.

Fonte: arXiv cs.LG

NLP/LLMsScore 96

Rumo à IA Conversacional Explicável para Diagnóstico Precoce com Modelos de Linguagem de Grande Escala

arXiv:2512.17559v1 Tipo de Anúncio: novo Resumo: Sistemas de saúde ao redor do mundo estão enfrentando problemas como diagnósticos ineficientes, aumento de custos e acesso limitado a especialistas. Esses problemas frequentemente levam a atrasos no tratamento e a resultados de saúde insatisfatórios. A maioria dos sistemas de diagnóstico baseados em IA e deep learning atuais não é muito interativa ou transparente, tornando-os menos eficazes em ambientes centrados no paciente. Esta pesquisa introduz um chatbot de diagnóstico alimentado por um Large Language Model (LLM), utilizando GPT-4o, Geração Aumentada por Recuperação e técnicas de IA explicável. O chatbot envolve os pacientes em uma conversa dinâmica, ajudando a extrair e normalizar sintomas enquanto prioriza diagnósticos potenciais por meio de correspondência de similaridade e questionamento adaptativo. Com o prompting Chain-of-Thought, o sistema também oferece um raciocínio mais transparente por trás de seus diagnósticos. Quando testado em comparação com modelos tradicionais de machine learning como Naive Bayes, Regressão Logística, SVM, Random Forest e KNN, o sistema baseado em LLM apresentou resultados impressionantes, alcançando uma precisão de 90% e uma precisão Top-3 de 100%. Esses achados oferecem uma perspectiva promissora para uma IA mais transparente, interativa e clinicamente relevante na saúde.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

GreedySnake: Accelerating SSD-Offloaded LLM Training with Efficient Scheduling and Optimizer Step Overlapping

arXiv:2512.17570v1 Announce Type: new Abstract: SSD-offloaded training offers a practical and promising approach to making LLM training cost-effective. Building on gradient accumulation with micro-batches, this paper introduces GreedySnake, a new SSD-offloaded training system that employs vertical scheduling, which executes all microbatches of a layer before proceeding to the next. Compared to existing systems that use horizontal scheduling (i.e., executing micro-batches sequentially), GreedySnake achieves higher training throughput with smaller batch sizes, bringing the system much closer to the ideal scenario predicted by the roofline model. To further mitigate the I/O bottleneck, GreedySnake overlaps part of the optimization step with the forward pass of the next iteration. Experimental results on A100 GPUs show that GreedySnake achieves saturated training throughput improvements over ZeRO-Infinity: 1.96x on 1 GPU and 1.93x on 4 GPUs for GPT-65B, and 2.53x on 1 GPU for GPT-175B. The code is open-sourced at https://github.com/npz7yyk/GreedySnake

Fonte: arXiv cs.LG

NLP/LLMsScore 95

Data Augmentation Supporting a Conversational Agent Designed for Smoking Cessation Support Groups

arXiv:2512.17092v1 Announce Type: new Abstract: Online support groups for smoking cessation are economical and accessible, yet they often face challenges with low user engagement and stigma. The use of an automatic conversational agent would improve engagement by ensuring that all user comments receive a timely response.). We address the challenge of insufficient high-quality data by employing a two-level data augmentation strategy: synthetic data augmentation and real data augmentation. First, we fine-tuned an open source LLM to classify posts from our existing smoking cessation support groups and identify intents with low F1 (precision+recall) scores. Then, for these intents, we generate additional synthetic data using prompt engineering with the GPT model, with an average of 87\% of the generated synthetic posts deemed high quality by human annotators. Overall, the synthetic augmentation process resulted in 43\% of the original posts being selected for augmentation, followed by 140\% synthetic expansion of these posts. Additionally, we scraped more than 10,000 real posts from a related online support context, of which 73\% were validated as good quality by human annotators. Each synthetic or scraped post underwent rigorous validation involving human reviewers to ensure quality and relevance. The validated new data, combined with the original support group posts, formed an augmented dataset used to retrain the intent classifier. Performance evaluation of the retrained model demonstrated a 32\% improvement in F1, confirming the effectiveness of our data augmentation approach. Synthetic and real post augmentation led to similar performance improvements. This study provides a replicable framework for enhancing conversational agent performance in domains where data scarcity is a critical issue.

Fonte: arXiv cs.CL

NLP/LLMsScore 96

Learning What to Write: Write-Gated KV for Efficient Long-Context Inference

arXiv:2512.17452v1 Announce Type: new Abstract: Long-context LLM inference is bottlenecked by the quadratic attention complexity and linear KV cache growth. Prior approaches mitigate this via post-hoc selection or eviction but overlook the root inefficiency: indiscriminate writing to persistent memory. In this paper, we formalize KV cache management as a causal system of three primitives: KV Admission, Selection, and Eviction. We instantiate KV Admission via Write-Gated KV, a lightweight mechanism that learns to predict token utility before it enters the cache. By filtering out low-utility states early to maintain a compact global cache alongside a sliding local cache, Write-Gated KV reduces memory usage by 46-57% and delivers 3.03-3.45$\times$ prefill and 1.89-2.56$\times$ decode speedups on Llama model with negligible accuracy loss, all while remaining compatible with FlashAttention and paged-KV systems. These results demonstrate that learning what to write, is a principled and practical recipe for efficient long-context inference. Code is available at https://github.com/EMCLab-Sinica/WG-KV .

Fonte: arXiv cs.LG

RLScore 96

Adaptive Graph Pruning with Sudden-Events Evaluation for Traffic Prediction using Online Semi-Decentralized ST-GNNs

arXiv:2512.17352v1 Announce Type: new Abstract: Spatio-Temporal Graph Neural Networks (ST-GNNs) are well-suited for processing high-frequency data streams from geographically distributed sensors in smart mobility systems. However, their deployment at the edge across distributed compute nodes (cloudlets) createssubstantial communication overhead due to repeated transmission of overlapping node features between neighbouring cloudlets. To address this, we propose an adaptive pruning algorithm that dynamically filters redundant neighbour features while preserving the most informative spatial context for prediction. The algorithm adjusts pruning rates based on recent model performance, allowing each cloudlet to focus on regions experiencing traffic changes without compromising accuracy. Additionally, we introduce the Sudden Event Prediction Accuracy (SEPA), a novel event-focused metric designed to measure responsiveness to traffic slowdowns and recoveries, which are often missed by standard error metrics. We evaluate our approach in an online semi-decentralized setting with traditional FL, server-free FL, and Gossip Learning on two large-scale traffic datasets, PeMS-BAY and PeMSD7-M, across short-, mid-, and long-term prediction horizons. Experiments show that, in contrast to standard metrics, SEPA exposes the true value of spatial connectivity in predicting dynamic and irregular traffic. Our adaptive pruning algorithm maintains prediction accuracy while significantly lowering communication cost in all online semi-decentralized settings, demonstrating that communication can be reduced without compromising responsiveness to critical traffic events.

Fonte: arXiv cs.LG

RLScore 96

Alzheimer's Disease Brain Network Mining

arXiv:2512.17276v1 Announce Type: new Abstract: Machine learning approaches for Alzheimer's disease (AD) diagnosis face a fundamental challenges. Clinical assessments are expensive and invasive, leaving ground truth labels available for only a fraction of neuroimaging datasets. We introduce Multi view Adaptive Transport Clustering for Heterogeneous Alzheimer's Disease (MATCH-AD), a semi supervised framework that integrates deep representation learning, graph-based label propagation, and optimal transport theory to address this limitation. The framework leverages manifold structure in neuroimaging data to propagate diagnostic information from limited labeled samples to larger unlabeled populations, while using Wasserstein distances to quantify disease progression between cognitive states. Evaluated on nearly five thousand subjects from the National Alzheimer's Coordinating Center, encompassing structural MRI measurements from hundreds of brain regions, cerebrospinal fluid biomarkers, and clinical variables MATCHAD achieves near-perfect diagnostic accuracy despite ground truth labels for less than one-third of subjects. The framework substantially outperforms all baseline methods, achieving kappa indicating almost perfect agreement compared to weak agreement for the best baseline, a qualitative transformation in diagnostic reliability. Performance remains clinically useful even under severe label scarcity, and we provide theoretical convergence guarantees with proven bounds on label propagation error and transport stability. These results demonstrate that principled semi-supervised learning can unlock the diagnostic potential of the vast repositories of partially annotated neuroimaging data accumulating worldwide, substantially reducing annotation burden while maintaining accuracy suitable for clinical deployment.

Fonte: arXiv cs.LG

RLScore 93

A Theoretical Analysis of State Similarity Between Markov Decision Processes

arXiv:2512.17265v1 Announce Type: new Abstract: The bisimulation metric (BSM) is a powerful tool for analyzing state similarities within a Markov decision process (MDP), revealing that states closer in BSM have more similar optimal value functions. While BSM has been successfully utilized in reinforcement learning (RL) for tasks like state representation learning and policy exploration, its application to state similarity between multiple MDPs remains challenging. Prior work has attempted to extend BSM to pairs of MDPs, but a lack of well-established mathematical properties has limited further theoretical analysis between MDPs. In this work, we formally establish a generalized bisimulation metric (GBSM) for measuring state similarity between arbitrary pairs of MDPs, which is rigorously proven with three fundamental metric properties, i.e., GBSM symmetry, inter-MDP triangle inequality, and a distance bound on identical spaces. Leveraging these properties, we theoretically analyze policy transfer, state aggregation, and sampling-based estimation across MDPs, obtaining explicit bounds that are strictly tighter than existing ones derived from the standard BSM. Additionally, GBSM provides a closed-form sample complexity for estimation, improving upon existing asymptotic results based on BSM. Numerical results validate our theoretical findings and demonstrate the effectiveness of GBSM in multi-MDP scenarios.

Fonte: arXiv cs.LG

VisionScore 95

EMMA: Concept Erasure Benchmark with Comprehensive Semantic Metrics and Diverse Categories

arXiv:2512.17320v1 Announce Type: new Abstract: The widespread adoption of text-to-image (T2I) generation has raised concerns about privacy, bias, and copyright violations. Concept erasure techniques offer a promising solution by selectively removing undesired concepts from pre-trained models without requiring full retraining. However, these methods are often evaluated on a limited set of concepts, relying on overly simplistic and direct prompts. To test the boundaries of concept erasure techniques, and assess whether they truly remove targeted concepts from model representations, we introduce EMMA, a benchmark that evaluates five key dimensions of concept erasure over 12 metrics. EMMA goes beyond standard metrics like image quality and time efficiency, testing robustness under challenging conditions, including indirect descriptions, visually similar non-target concepts, and potential gender and ethnicity bias, providing a socially aware analysis of method behavior. Using EMMA, we analyze five concept erasure methods across five domains (objects, celebrities, art styles, NSFW, and copyright). Our results show that existing methods struggle with implicit prompts (i.e., generating the erased concept when it is indirectly referenced) and visually similar non-target concepts (i.e., failing to generate non-targeted concepts resembling the erased one), while some amplify gender and ethnicity bias compared to the original model.

Fonte: arXiv cs.CV

RLScore 96

Electric Vehicle Charging Load Forecasting: An Experimental Comparison of Machine Learning Methods

arXiv:2512.17257v1 Announce Type: new Abstract: With the growing popularity of electric vehicles as a means of addressing climate change, concerns have emerged regarding their impact on electric grid management. As a result, predicting EV charging demand has become a timely and important research problem. While substantial research has addressed energy load forecasting in transportation, relatively few studies systematically compare multiple forecasting methods across different temporal horizons and spatial aggregation levels in diverse urban settings. This work investigates the effectiveness of five time series forecasting models, ranging from traditional statistical approaches to machine learning and deep learning methods. Forecasting performance is evaluated for short-, mid-, and long-term horizons (on the order of minutes, hours, and days, respectively), and across spatial scales ranging from individual charging stations to regional and city-level aggregations. The analysis is conducted on four publicly available real-world datasets, with results reported independently for each dataset. To the best of our knowledge, this is the first work to systematically evaluate EV charging demand forecasting across such a wide range of temporal horizons and spatial aggregation levels using multiple real-world datasets.

Fonte: arXiv cs.LG

RLScore 93

Learning solution operator of dynamical systems with diffusion maps kernel ridge regression

arXiv:2512.17203v1 Announce Type: new Abstract: Many scientific and engineering systems exhibit complex nonlinear dynamics that are difficult to predict accurately over long time horizons. Although data-driven models have shown promise, their performance often deteriorates when the geometric structures governing long-term behavior are unknown or poorly represented. We demonstrate that a simple kernel ridge regression (KRR) framework, when combined with a dynamics-aware validation strategy, provides a strong baseline for long-term prediction of complex dynamical systems. By employing a data-driven kernel derived from diffusion maps, the proposed Diffusion Maps Kernel Ridge Regression (DM-KRR) method implicitly adapts to the intrinsic geometry of the system's invariant set, without requiring explicit manifold reconstruction or attractor modeling, procedures that often limit predictive performance. Across a broad range of systems, including smooth manifolds, chaotic attractors, and high-dimensional spatiotemporal flows, DM-KRR consistently outperforms state-of-the-art random feature, neural-network and operator-learning methods in both accuracy and data efficiency. These findings underscore that long-term predictive skill depends not only on model expressiveness, but critically on respecting the geometric constraints encoded in the data through dynamically consistent model selection. Together, simplicity, geometry awareness, and strong empirical performance point to a promising path for reliable and efficient learning of complex dynamical systems.

Fonte: arXiv cs.LG

NLP/LLMsScore 96

The Effect of Negation on CLIP in Medical Imaging: Limitations of Contrastive Language-Image Pretraining

arXiv:2512.17121v1 Announce Type: new Abstract: Large vision-language models like CLIP are increasingly used in medical imaging tasks due to their ability to align images and text without the need for extensive labeled data. This makes them particularly useful for applications like image retrieval, report generation, and classification in clinical settings. A potential issue to this approach is that CLIP-based models often under perform when interpreting negated phrases, which is especially problematic in the context of medical diagnosing. In this study, we evaluate the Stanford AIMI CheXagent model on its ability to correctly retrieve chest X-ray images using prompts with and without negation. The goal of this project is to understand where this model fails and then use it as a base model to improve its retrieval accuracy by fine tuning methods outlined in previous work. Results from this study show improvement in handling of negation in the CLIP model with a slight decrease in accuracy of positive prompt evaluation. Alongside retrieval accuracy, we examined internal model behavior through token attribution, t-SNE projection, and attention-head ablation to better characterize how each fine tuning approach reshaped the text encoders representation of negated clinical language. Through this work, we hope to better understand the internal behavior of CLIP and improve its handling of negation using clinically relevant language for improving its reliability in medical AI devices.

Fonte: arXiv cs.LG

RLScore 93

UniCoMTE: A Universal Counterfactual Framework for Explaining Time-Series Classifiers on ECG Data

arXiv:2512.17100v1 Announce Type: cross Abstract: Machine learning models, particularly deep neural networks, have demonstrated strong performance in classifying complex time series data. However, their black-box nature limits trust and adoption, especially in high-stakes domains such as healthcare. To address this challenge, we introduce UniCoMTE, a model-agnostic framework for generating counterfactual explanations for multivariate time series classifiers. The framework identifies temporal features that most heavily influence a model's prediction by modifying the input sample and assessing its impact on the model's prediction. UniCoMTE is compatible with a wide range of model architectures and operates directly on raw time series inputs. In this study, we evaluate UniCoMTE's explanations on a time series ECG classifier. We quantify explanation quality by comparing our explanations' comprehensibility to comprehensibility of established techniques (LIME and SHAP) and assessing their generalizability to similar samples. Furthermore, clinical utility is assessed through a questionnaire completed by medical experts who review counterfactual explanations presented alongside original ECG samples. Results show that our approach produces concise, stable, and human-aligned explanations that outperform existing methods in both clarity and applicability. By linking model predictions to meaningful signal patterns, the framework advances the interpretability of deep learning models for real-world time series applications.

Fonte: arXiv cs.AI

RLScore 96

Learning to Plan, Planning to Learn: Adaptive Hierarchical RL-MPC for Sample-Efficient Decision Making

arXiv:2512.17091v1 Announce Type: cross Abstract: We propose a new approach for solving planning problems with a hierarchical structure, fusing reinforcement learning and MPC planning. Our formulation tightly and elegantly couples the two planning paradigms. It leverages reinforcement learning actions to inform the MPPI sampler, and adaptively aggregates MPPI samples to inform the value estimation. The resulting adaptive process leverages further MPPI exploration where value estimates are uncertain, and improves training robustness and the overall resulting policies. This results in a robust planning approach that can handle complex planning problems and easily adapts to different applications, as demonstrated over several domains, including race driving, modified Acrobot, and Lunar Lander with added obstacles. Our results in these domains show better data efficiency and overall performance in terms of both rewards and task success, with up to a 72% increase in success rate compared to existing approaches, as well as accelerated convergence (x2.1) compared to non-adaptive sampling.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

Dynamic Tool Dependency Retrieval for Efficient Function Calling

arXiv:2512.17052v1 Announce Type: new Abstract: Function calling agents powered by Large Language Models (LLMs) select external tools to automate complex tasks. On-device agents typically use a retrieval module to select relevant tools, improving performance and reducing context length. However, existing retrieval methods rely on static and limited inputs, failing to capture multi-step tool dependencies and evolving task context. This limitation often introduces irrelevant tools that mislead the agent, degrading efficiency and accuracy. We propose Dynamic Tool Dependency Retrieval (DTDR), a lightweight retrieval method that conditions on both the initial query and the evolving execution context. DTDR models tool dependencies from function calling demonstrations, enabling adaptive retrieval as plans unfold. We benchmark DTDR against state-of-the-art retrieval methods across multiple datasets and LLM backbones, evaluating retrieval precision, downstream task accuracy, and computational efficiency. Additionally, we explore strategies to integrate retrieved tools into prompts. Our results show that dynamic tool retrieval improves function calling success rates between $23\%$ and $104\%$ compared to state-of-the-art static retrievers.

Fonte: arXiv cs.LG

RLScore 96

Distributed Learning in Markovian Restless Bandits over Interference Graphs for Stable Spectrum Sharing

arXiv:2512.17161v1 Announce Type: new Abstract: We study distributed learning for spectrum access and sharing among multiple cognitive communication entities, such as cells, subnetworks, or cognitive radio users (collectively referred to as cells), in communication-constrained wireless networks modeled by interference graphs. Our goal is to achieve a globally stable and interference-aware channel allocation. Stability is defined through a generalized Gale-Shapley multi-to-one matching, a well-established solution concept in wireless resource allocation. We consider wireless networks where L cells share S orthogonal channels and cannot simultaneously use the same channel as their neighbors. Each channel evolves as an unknown restless Markov process with cell-dependent rewards, making this the first work to establish global Gale-Shapley stability for channel allocation in a stochastic, temporally varying restless environment. To address this challenge, we develop SMILE (Stable Multi-matching with Interference-aware LEarning), a communication-efficient distributed learning algorithm that integrates restless bandit learning with graph-constrained coordination. SMILE enables cells to distributedly balance exploration of unknown channels with exploitation of learned information. We prove that SMILE converges to the optimal stable allocation and achieves logarithmic regret relative to a genie with full knowledge of expected utilities. Simulations validate the theoretical guarantees and demonstrate SMILE's robustness, scalability, and efficiency across diverse spectrum-sharing scenarios.

Fonte: arXiv cs.LG

RLScore 93

SFBD-OMNI: Bridge models for lossy measurement restoration with limited clean samples

arXiv:2512.17051v1 Announce Type: new Abstract: In many real-world scenarios, obtaining fully observed samples is prohibitively expensive or even infeasible, while partial and noisy observations are comparatively easy to collect. In this work, we study distribution restoration with abundant noisy samples, assuming the corruption process is available as a black-box generator. We show that this task can be framed as a one-sided entropic optimal transport problem and solved via an EM-like algorithm. We further provide a test criterion to determine whether the true underlying distribution is recoverable under per-sample information loss, and show that in otherwise unrecoverable cases, a small number of clean samples can render the distribution largely recoverable. Building on these insights, we introduce SFBD-OMNI, a bridge model-based framework that maps corrupted sample distributions to the ground-truth distribution. Our method generalizes Stochastic Forward-Backward Deconvolution (SFBD; Lu et al., 2025) to handle arbitrary measurement models beyond Gaussian corruption. Experiments across benchmark datasets and diverse measurement settings demonstrate significant improvements in both qualitative and quantitative performance.

Fonte: arXiv cs.LG

RLScore 96

SHARP-QoS: Sparsely-gated Hierarchical Adaptive Routing for joint Prediction of QoS

arXiv:2512.17262v1 Announce Type: new Abstract: Dependable service-oriented computing relies on multiple Quality of Service (QoS) parameters that are essential to assess service optimality. However, real-world QoS data are extremely sparse, noisy, and shaped by hierarchical dependencies arising from QoS interactions, and geographical and network-level factors, making accurate QoS prediction challenging. Existing methods often predict each QoS parameter separately, requiring multiple similar models, which increases computational cost and leads to poor generalization. Although recent joint QoS prediction studies have explored shared architectures, they suffer from negative transfer due to loss-scaling caused by inconsistent numerical ranges across QoS parameters and further struggle with inadequate representation learning, resulting in degraded accuracy. This paper presents an unified strategy for joint QoS prediction, called SHARP-QoS, that addresses these issues using three components. First, we introduce a dual mechanism to extract the hierarchical features from both QoS and contextual structures via hyperbolic convolution formulated in the Poincar\'e ball. Second, we propose an adaptive feature-sharing mechanism that allows feature exchange across informative QoS and contextual signals. A gated feature fusion module is employed to support dynamic feature selection among structural and shared representations. Third, we design an EMA-based loss balancing strategy that allows stable joint optimization, thereby mitigating the negative transfer. Evaluations on three datasets with two, three, and four QoS parameters demonstrate that SHARP-QoS outperforms both single- and multi-task baselines. Extensive study shows that our model effectively addresses major challenges, including sparsity, robustness to outliers, and cold-start, while maintaining moderate computational overhead, underscoring its capability for reliable joint QoS prediction.

Fonte: arXiv cs.LG

NLP/LLMsScore 96

GB-DQN: Gradient Boosted DQN Models for Non-stationary Reinforcement Learning

arXiv:2512.17034v1 Announce Type: new Abstract: Non-stationary environments pose a fundamental challenge for deep reinforcement learning, as changes in dynamics or rewards invalidate learned value functions and cause catastrophic forgetting. We propose \emph{Gradient-Boosted Deep Q-Networks (GB-DQN)}, an adaptive ensemble method that addresses model drift through incremental residual learning. Instead of retraining a single Q-network, GB-DQN constructs an additive ensemble in which each new learner is trained to approximate the Bellman residual of the current ensemble after drift. We provide theoretical results showing that each boosting step reduces the empirical Bellman residual and that the ensemble converges to the post-drift optimal value function under standard assumptions. Experiments across a diverse set of control tasks with controlled dynamics changes demonstrate faster recovery, improved stability, and greater robustness compared to DQN and common non-stationary baselines.

Fonte: arXiv cs.LG

RLScore 92

Convergence Guarantees for Federated SARSA with Local Training and Heterogeneous Agents

arXiv:2512.17688v1 Announce Type: cross Abstract: We present a novel theoretical analysis of Federated SARSA (FedSARSA) with linear function approximation and local training. We establish convergence guarantees for FedSARSA in the presence of heterogeneity, both in local transitions and rewards, providing the first sample and communication complexity bounds in this setting. At the core of our analysis is a new, exact multi-step error expansion for single-agent SARSA, which is of independent interest. Our analysis precisely quantifies the impact of heterogeneity, demonstrating the convergence of FedSARSA with multiple local updates. Crucially, we show that FedSARSA achieves linear speed-up with respect to the number of agents, up to higher-order terms due to Markovian sampling. Numerical experiments support our theoretical findings.

Fonte: arXiv stat.ML

RLScore 96

DiffeoMorph: Aprendendo a Morfose de Formas 3D Usando Simulações Baseadas em Agentes Diferenciáveis

Sistemas biológicos podem formar estruturas tridimensionais complexas através do comportamento coletivo de agentes idênticos. Neste trabalho, apresentamos o DiffeoMorph, um framework diferenciável para aprender um protocolo de morfogênese que orienta uma população de agentes a se transformar em uma forma 3D alvo, utilizando uma rede neural gráfica SE(3)-equivariant baseada em atenção.

Fonte: arXiv cs.LG

NLP/LLMsScore 96

Turn-PPO: Estimativa de Vantagem em Nível de Turno com PPO para Melhorar o RL Multi-Turno em LLMs Agentes

O aprendizado por reforço (RL) voltou a ser uma abordagem natural para treinar agentes LLM interativos em ambientes do mundo real. No entanto, a aplicação direta do algoritmo Group Relative Policy Optimization (GRPO) em tarefas multi-turno revela limitações significativas, especialmente em cenários que exigem raciocínio de longo prazo. Investigamos estratégias de estimativa de vantagem mais estáveis e eficazes, introduzindo o turn-PPO como uma variante que opera em uma formulação MDP em nível de turno.

Fonte: arXiv cs.LG

VisionScore 96

Machine Learning Leve Informado por Física para Previsão de Visibilidade na Aviação em Vários Regimes Climáticos

A previsão de curto prazo (nowcasting) de eventos de baixa visibilidade e precipitação é crítica para a segurança e eficiência operacional da aviação. Este estudo apresenta um framework leve de gradient boosting (XGBoost) treinado exclusivamente com dados de observação de superfície (METAR) e aprimorado por meio de engenharia de características guiada por princípios termodinâmicos.

Fonte: arXiv cs.LG

NLP/LLMsScore 96

QSMOTE-PGM/kPGM: Classificação de Conjuntos de Dados Desequilibrados com PGM e kPGM Baseados em QSMOTE

O aprendizado de máquina inspirado em quantum (QiML) utiliza estruturas matemáticas da teoria quântica para aprimorar algoritmos clássicos, focando em produtos internos em espaços de características de alta dimensão. Este trabalho apresenta uma comparação teórica e empírica dos classificadores PGM e KPGM, mostrando que ambos superam consistentemente um baseline de random forest clássico, especialmente com múltiplas cópias quânticas.

Fonte: arXiv cs.LG

RLScore 96

O Papel da Ética Islâmica na Prevenção do Abuso de Deepfakes Baseados em Inteligência Artificial (AI)

arXiv:2512.17218v1 Tipo de Anúncio: cruzado Resumo: O desenvolvimento significativo da tecnologia de deepfake impulsionada por inteligência artificial (AI) gerou preocupações em todo o mundo sobre a alteração de informações falsas, a usurpação de identidades online e a diminuição da confiança pública na autenticidade do conteúdo online. Esses incidentes não apenas levantam questões técnicas, mas também carregam implicações morais complexas, tornando inadequados os métodos convencionais de gestão, impulsionados pela tecnologia e reativos, para abordar as causas subjacentes do problema, incluindo intenção, moralidade e potenciais impactos sociais intangíveis. Com base nessas questões, este estudo visa formular uma estrutura ética islâmica abrangente que possa servir como uma ferramenta preventiva mais completa para mitigar os riscos de uso indevido de deepfakes. O estudo empregou uma Revisão Sistemática da Literatura (SLR) guiada pelo PRISMA, selecionando dez fontes primárias publicadas entre 2018 e 2025 para identificar deficiências éticas, necessidades regulatórias e soluções normativas apropriadas. A análise mostra que a integração dos princípios de (Maqasid al-Shariah), particularmente (hifz al-ird) proteção da honra e (hifz al-nafs) proteção do eu, fornece uma base normativa sólida para regular o uso responsável da tecnologia. Este estudo gera três recomendações estratégicas: mudanças regulatórias que reconheçam o dano intangível e psicológico causado pelo dano à reputação; melhor gestão da tecnologia por meio de escrutínio moral que defenda os valores de justiça (adl), confiança e transparência; e aumento da alfabetização digital pública com base no princípio de (tabayyun) exame e cautela. No geral, este estudo conclui que a aplicação da ética islâmica oferece uma mudança de pensamento de mecanismos punitivos para abordagens preventivas que se concentram na proteção da dignidade humana, na prevenção de danos e no fortalecimento do bem comum na era digital.

Fonte: arXiv cs.AI

RLScore 96

Viés Conservador na Aprendizagem Multi-Professor: Por Que Agentes Preferem Consultores de Baixa Recompensa

arXiv:2512.17180v1 Tipo de Anúncio: cruzado Resumo: A aprendizagem por reforço interativa (IRL) mostrou-se promissora em permitir que agentes e robôs autônomos aprendam comportamentos complexos com professores humanos, no entanto, a dinâmica de seleção de professores permanece pouco compreendida. Este artigo revela um fenômeno inesperado na IRL: ao serem apresentados a uma escolha entre professores com diferentes estruturas de recompensa, os agentes de aprendizagem preferem esmagadoramente professores conservadores e de baixa recompensa (taxa de seleção de 93,16%) em comparação àqueles que oferecem recompensas 20 vezes maiores. Através de 1.250 execuções experimentais em tarefas de navegação com múltiplos professores especialistas, descobrimos: (1) O viés conservador domina a seleção de professores: os agentes escolhem sistematicamente o professor de menor recompensa, priorizando a consistência em vez da optimalidade; (2) Existem limiares críticos de desempenho na disponibilidade de professores rho >= 0.6 e precisão omega >= 0.6, abaixo dos quais a estrutura falha de forma catastrófica; (3) A estrutura alcança uma melhoria de 159% em relação ao Q-learning de base sob desvio de conceito. Essas descobertas desafiam suposições fundamentais sobre o ensino ótimo em RL e sugerem implicações potenciais para a colaboração humano-robô, onde as preferências humanas por segurança e consistência podem se alinhar ao comportamento de seleção observado nos agentes, potencialmente informando paradigmas de treinamento para aplicações robóticas críticas em segurança.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

PILAR: Personalizando Interações de Realidade Aumentada com Explicações Centricas no Humano e Confiáveis Baseadas em LLM para Casos de Uso Diários

arXiv:2512.17172v1 Tipo de Anúncio: cruzado Resumo: Sistemas de realidade aumentada (AR) impulsionados por inteligência artificial (IA) estão se tornando cada vez mais integrados à vida cotidiana, e com esse crescimento vem uma maior necessidade de explicabilidade nas interações do usuário em tempo real. Métodos tradicionais de inteligência artificial explicável (XAI), que frequentemente dependem de explicações baseadas em características ou exemplos, lutam para fornecer insights dinâmicos, específicos ao contexto, personalizados e centrados no humano para usuários diários de AR. Esses métodos normalmente abordam dimensões de explicabilidade separadas (por exemplo, quando, o que, como) com diferentes técnicas de explicação, resultando em experiências fragmentadas e irreais para interações AR sem costura. Para enfrentar esse desafio, propomos o PILAR, uma nova estrutura que aproveita um modelo de linguagem grande (LLM) pré-treinado para gerar explicações personalizadas e conscientes do contexto, oferecendo uma experiência mais intuitiva e confiável em sistemas AR impulsionados por IA em tempo real. Ao contrário dos métodos tradicionais, que dependem de múltiplas técnicas para diferentes aspectos da explicação, o PILAR emprega uma abordagem unificada baseada em LLM que adapta dinamicamente as explicações às necessidades do usuário, promovendo maior confiança e engajamento. Implementamos o conceito PILAR em uma aplicação AR do mundo real (por exemplo, recomendações de receitas personalizadas), um protótipo de código aberto que integra detecção de objetos em tempo real, recomendação de receitas e explicações personalizadas baseadas em LLM das receitas recomendadas com base nas preferências dietéticas dos usuários. Avaliamos a eficácia do PILAR por meio de um estudo com 16 participantes realizando tarefas de recomendação de receitas baseadas em AR, comparando uma interface de explicação baseada em LLM com uma interface tradicional baseada em templates. Os resultados mostram que a interface baseada em LLM melhora significativamente o desempenho e a experiência do usuário, com os participantes completando as tarefas 40% mais rápido e relatando maior satisfação, facilidade de uso e transparência percebida.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

Incorporando Embedding de Ruído de Nível de Erro para Melhorar a Robustez Assistida por LLM no Reconhecimento de Fala Persa

arXiv:2512.17247v1 Tipo de Anúncio: cruzado Resumo: Sistemas de Reconhecimento Automático de Fala (ASR) sofrem uma degradação significativa de desempenho em ambientes ruidosos, um desafio que é especialmente severo para idiomas de baixo recurso, como o persa. Mesmo modelos de ponta, como o Whisper, lutam para manter a precisão sob diferentes razões sinal-ruído (SNRs). Este estudo apresenta uma estrutura robusta de correção de erros de ASR sensível ao ruído que combina múltiplas hipóteses e modelagem consciente do ruído. Usando fala persa ruidosa, geramos 5 melhores hipóteses a partir de um decodificador Whisper-large modificado. O Ruído de Nível de Erro (ELN) é introduzido como uma representação que captura desacordos em nível semântico e de token entre as hipóteses, quantificando as distorções linguísticas causadas pelo ruído. Assim, o ELN fornece uma medida direta da incerteza induzida pelo ruído, permitindo que o LLM raciocine sobre a confiabilidade de cada hipótese durante a correção. Três modelos são avaliados: (1) um modelo base LLaMA-2-7B sem ajuste fino, (2) uma variante ajustada treinada apenas em hipóteses de texto, e (3) um modelo condicionado ao ruído integrando embeddings de ELN em níveis de sentença e palavra. Resultados experimentais demonstram que o modelo condicionado ao ELN alcança reduções substanciais na Taxa de Erro de Palavra (WER). Especificamente, no desafiador conjunto de testes Mixed Noise, o modelo proposto Fine-tuned + ELN (Nosso) reduz a WER de uma linha de base de 31,10% (Raw Whisper) para 24,84%, superando significativamente a linha de base Fine-tuned (Sem ELN) de 30,79%, enquanto o modelo original LLaMA-2-7B aumentou a WER para 64,58%, demonstrando que não é capaz de corrigir erros persas por conta própria. Isso confirma a eficácia de combinar múltiplas hipóteses com embeddings conscientes do ruído para um ASR persa robusto em cenários reais ruidosos.

Fonte: arXiv cs.AI

NLP/LLMsScore 95

Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images

arXiv:2512.17306v1 Announce Type: new Abstract: Recent advances in large Vision-Language Models (VLMs) have exhibited strong reasoning capabilities on complex visual tasks by thinking with images in their Chain-of-Thought (CoT), which is achieved by actively invoking tools to analyze visual inputs rather than merely perceiving them. However, existing models often struggle to reflect on and correct themselves when attempting incorrect reasoning trajectories. To address this limitation, we propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT. Our pipeline comprises three stages: data construction, cold-start SFT and RL. Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs, where solving each task requires multi-turn tool calls to reach the correct answer. In the SFT stage, we collect tool trajectories as cold-start data, guiding a multi-turn reasoning pattern. In the RL stage, we introduce redundancy-penalized policy optimization, which incentivizes the model to develop a self-reflective reasoning pattern. The basic idea is to impose judgment on reasoning trajectories and penalize those that produce incorrect answers without sufficient multi-scale exploration. Extensive experiments demonstrate that DRIM achieves superior performance on visual understanding benchmarks.

Fonte: arXiv cs.CV

NLP/LLMsScore 96

Modelos de Raciocínio Grande Podem Melhorar a Precisão em Tarefas Matemáticas Usando Pensamento Falho?

arXiv:2512.17079v1 Tipo de Anúncio: cruzado. Resumo: O prompting de cadeia de pensamento (CoT) tornou-se central para o raciocínio matemático em grandes modelos de linguagem, no entanto, os modelos continuam frágeis a erros iniciais: um único deslize aritmético ou uma inferência injustificada normalmente se propaga sem correção até uma resposta final incorreta. Investigamos se o treinamento em rastros de raciocínio intencionalmente falhos pode ensinar os modelos a detectar e se recuperar de tais erros sem degradar a capacidade padrão de resolução de problemas. Usando problemas de nível de competição do MATH-lighteval, geramos prefixos de CoT contendo exatamente um erro controlado, seja um erro de cálculo (mudanças de sinal, termos descartados) ou um erro de raciocínio (regras mal aplicadas, passos lógicos injustificados), e ajustamos finamente o Qwen3-4B com GRPO usando uma recompensa binária de resposta final. Nosso modelo Mixed-CoT-RL iguala o RL padrão em problemas limpos (41% vs 41%) enquanto supera substancialmente em problemas preenchidos com raciocínio falho (24% vs 19%). Notavelmente, o ajuste fino apenas com RL limpo degrada a robustez abaixo da linha de base não ajustada (19% vs 20%), indicando que o treinamento convencional aumenta a suscetibilidade a preenchimentos enganosos. Entre os tipos de erro, o treinamento em erros de raciocínio gera ganhos de robustez maiores do que apenas erros de cálculo, com o treinamento misto apresentando o melhor desempenho. Essas descobertas demonstram que a exposição a rastros falhos durante o treinamento pode melhorar o comportamento de recuperação de erros sem sacrificar a precisão, sugerindo um caminho em direção a um raciocínio matemático mais robusto em LLMs.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

Adversarial VR: Um Testbed Open-Source para Avaliação da Robustez Adversarial na Detecção e Mitigação de Ciberdoença em VR

arXiv:2512.17029v1 Tipo de Anúncio: cruzado Resumo: Métodos automatizados de detecção de ciberdoença baseados em deep learning (DL), juntamente com técnicas de mitigação adaptativas, podem melhorar o conforto e a interação do usuário. No entanto, estudos recentes mostram que esses sistemas baseados em DL são suscetíveis a ataques adversariais; pequenas perturbações nas entradas dos sensores podem degradar o desempenho do modelo, acionar mitigações incorretas e interromper a experiência imersiva do usuário (UIX). Além disso, há uma falta de testbeds open-source dedicados que avaliem a robustez desses sistemas em condições adversariais, limitando a capacidade de avaliar sua eficácia no mundo real. Para abordar essa lacuna, este artigo introduz o Adversarial-VR, um novo testbed de VR em tempo real para avaliar estratégias de detecção e mitigação de ciberdoença baseadas em DL sob condições adversariais. Desenvolvido em Unity, o testbed integra dois modelos DL de última geração (SOTA): DeepTCN e Transformer, que são treinados no conjunto de dados open-source MazeSick, para detecção em tempo real da gravidade da ciberdoença e aplica um mecanismo dinâmico de tunelamento visual que ajusta o campo de visão com base nas saídas do modelo. Para avaliar a robustez, incorporamos três ataques adversariais SOTA: MI-FGSM, PGD e C&W, que conseguem prevenir a mitigação da ciberdoença enganando os resultados dos modelos de ciberdoença baseados em DL. Implementamos esses ataques usando um testbed com uma simulação de labirinto VR personalizada e um headset HTC Vive Pro Eye, e abrimos nossa implementação para adoção ampla por desenvolvedores e pesquisadores de VR. Os resultados mostram que esses ataques adversariais são capazes de enganar o sistema com sucesso. Por exemplo, o ataque C&W resulta em uma diminuição de $5.94x na precisão do modelo de ciberdoença baseado em Transformer em comparação com a precisão sem o ataque.

Fonte: arXiv cs.AI

RLScore 96

Conjunto de Dados Sintético que Preserva a Privacidade de Trajetórias Diárias Individuais para Análises de Mobilidade em Escala Urbana

arXiv:2512.17239v1 Tipo de Anúncio: cruzado Resumo: Dados de mobilidade urbana são indispensáveis para planejamento urbano, previsão de demanda de transporte, modelagem de pandemias e muitas outras aplicações; no entanto, rastros do Sistema de Posicionamento Global (GPS) derivados de telefones móveis individuais não podem geralmente ser compartilhados com terceiros devido a riscos severos de reidentificação. Registros agregados, como matrizes de origem-destino (OD), oferecem insights parciais, mas falham em capturar as principais propriedades comportamentais do movimento humano diário, limitando análises realistas em escala urbana. Este estudo apresenta um conjunto de dados sintético de mobilidade que preserva a privacidade e que reconstrói trajetórias diárias a partir de entradas agregadas. O método proposto integra fluxos OD com duas restrições comportamentais complementares: (1) quantis de tempo de permanência-viagem que estão disponíveis apenas como estatísticas resumidas grosseiras e (2) a lei universal para a distribuição diária do número de locais visitados. A incorporação desses elementos em uma estrutura de otimização multiobjetivo permite a reprodução de distribuições realistas de mobilidade humana, garantindo que nenhum identificador pessoal seja necessário. A estrutura proposta é validada em duas regiões contrastantes do Japão: (1) os 23 distritos especiais de Tóquio, representando um ambiente metropolitano denso; e (2) a Prefeitura de Fukuoka, onde padrões de mobilidade urbana e suburbana coexistem. Os dados sintéticos de mobilidade resultantes reproduzem distribuições de tempo de permanência-viagem e frequência de visitas com alta fidelidade, enquanto as variações na consistência de OD permanecem dentro da faixa natural de flutuações diárias. Os resultados deste estudo estabelecem um caminho prático de síntese sob restrições do mundo real, proporcionando a governos, planejadores urbanos e indústrias acesso escalável a dados de mobilidade de alta resolução para análises confiáveis, sem a necessidade de registros pessoais sensíveis, e apoiando implementações práticas em domínios de políticas e comerciais.

Fonte: arXiv cs.AI

NLP/LLMsScore 95

CheXPO-v2: Otimização de Preferências para VLMs de Raio-X de Tórax com Consistência de Gráfico de Conhecimento

Modelos de Visão-Linguagem Médica (VLMs) são propensos a alucinações, comprometendo a confiabilidade clínica. Embora métodos de aprendizado por reforço como Group Relative Policy Optimization (GRPO) ofereçam uma solução de alinhamento de baixo custo, sua dependência de recompensas esparsas e baseadas em resultados encoraja inadvertidamente os modelos a 'pensar demais'. Propomos o CheXPO-v2, um novo framework de alinhamento que muda de supervisão baseada em resultados para supervisão de processos.

Fonte: arXiv cs.CV

VisionScore 95

Any-Optical-Model: Um Modelo de Fundação Universal para Sensoriamento Remoto Óptico

Os satélites ópticos, com suas diversas configurações de bandas e distâncias de amostragem no solo, fornecem evidências indispensáveis para tarefas que vão desde a vigilância de ecossistemas até a resposta a emergências. No entanto, discrepâncias significativas na composição de bandas e na resolução espacial entre diferentes sensores ópticos apresentam grandes desafios para os Modelos de Fundação de Sensoriamento Remoto (RSFMs) existentes.

Fonte: arXiv cs.CV

RLScore 96

Aprimorando a Classificação de Espécies de Árvores: Insights do YOLOv8 e IA Explicável Aplicados a Projeções de Nuvem de Pontos TLS

Classificar espécies de árvores é uma área de pesquisa central em sensoriamento remoto florestal há décadas. Novos sensores e abordagens de classificação, como TLS e deep learning, alcançam precisão de ponta, mas seus processos de decisão permanecem obscuros. Propomos um método inovador que liga explicações do Finer-CAM a segmentos de projeções TLS, avaliando sistematicamente quais características impulsionam a discriminação de espécies.

Fonte: arXiv cs.AI

NLP/LLMsScore 95

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

arXiv:2512.16978v1 Announce Type: new Abstract: Long-form multimodal video understanding requires integrating vision, speech, and ambient audio with coherent long-range reasoning. Existing benchmarks emphasize either temporal length or multimodal richness, but rarely both and while some incorporate open-ended questions and advanced metrics, they mostly rely on single-score accuracy, obscuring failure modes. We introduce LongShOTBench, a diagnostic benchmark with open-ended, intent-driven questions; single- and multi-turn dialogues; and tasks requiring multimodal reasoning and agentic tool use across video, audio, and speech. Each item includes a reference answer and graded rubric for interpretable, and traceable evaluation. LongShOTBench is produced via a scalable, human-validated pipeline to ensure coverage and reproducibility. All samples in our LongShOTBench are human-verified and corrected. Furthermore, we present LongShOTAgent, an agentic system that analyzes long videos via preprocessing, search, and iterative refinement. On LongShOTBench, state-of-the-art MLLMs show large gaps: Gemini-2.5-Flash achieves 52.95%, open-source models remain below 30%, and LongShOTAgent attains 44.66%. These results underscore the difficulty of real-world long-form video understanding. LongShOTBench provides a practical, reproducible foundation for evaluating and improving MLLMs. All resources are available on GitHub: https://github.com/mbzuai-oryx/longshot.

Fonte: arXiv cs.CV

NLP/LLMsScore 96

Radar de Risco Sistêmico: Uma Estrutura de Grafo Multicamadas para Aviso Precoce de Colapso de Mercado

arXiv:2512.17185v1 Tipo de Anúncio: cruzado Resumo: Crises financeiras surgem quando vulnerabilidades estruturais se acumulam entre setores, mercados e comportamentos dos investidores. Prever essas transições sistêmicas é desafiador porque elas surgem de interações em evolução entre os participantes do mercado, e não apenas de movimentos de preços isolados. Apresentamos o Radar de Risco Sistêmico (SRR), uma estrutura que modela os mercados financeiros como grafos multicamadas para detectar sinais precoces de fragilidade sistêmica e transições de regime de colapso. Avaliamos o SRR em três crises principais: o colapso da Dot-com, a Crise Financeira Global e o choque da COVID-19. Nossos experimentos comparam GNNs instantâneas, um protótipo GNN temporal simplificado e linhas de base padrão (regressão logística e Random Forest). Os resultados mostram que informações estruturais da rede fornecem sinais de alerta precoce úteis em comparação com modelos baseados em características isoladamente. Esta instância baseada em correlação do SRR demonstra que características derivadas de grafos capturam mudanças significativas na estrutura do mercado durante eventos de estresse. As descobertas motivam a extensão do SRR com camadas de grafo adicionais (exposição setorial/fatorial, sentimento) e arquiteturas temporais mais expressivas (codificadores LSTM/GRU ou Transformer) para lidar melhor com diversos tipos de crise.

Fonte: arXiv cs.AI

NLP/LLMsScore 95

CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning

arXiv:2512.17312v1 Announce Type: new Abstract: Recent releases such as o3 highlight human-like "thinking with images" reasoning that combines structured tool use with stepwise verification, yet most open-source approaches still rely on text-only chains, rigid visual schemas, or single-step pipelines, limiting flexibility, interpretability, and transferability on complex tasks. We introduce CodeDance, which explores executable code as a general solver for visual reasoning. Unlike fixed-schema calls (e.g., only predicting bounding-box coordinates), CodeDance defines, composes, and executes code to orchestrate multiple tools, compute intermediate results, and render visual artifacts (e.g., boxes, lines, plots) that support transparent, self-checkable reasoning. To guide this process, we introduce a reward for balanced and adaptive tool-call, which balances exploration with efficiency and mitigates tool overuse. Interestingly, beyond the expected capabilities taught by atomic supervision, we empirically observe novel emergent behaviors during RL training: CodeDance demonstrates novel tool invocations, unseen compositions, and cross-task transfer. These behaviors arise without task-specific fine-tuning, suggesting a general and scalable mechanism of executable visual reasoning. Extensive experiments across reasoning benchmarks (e.g., visual search, math, chart QA) show that CodeDance not only consistently outperforms schema-driven and text-only baselines, but also surpasses advanced closed models such as GPT-4o and larger open-source models.

Fonte: arXiv cs.CV

NLP/LLMsScore 95

Video Detective: Seek Critical Clues Recurrently to Answer Question from Long Videos

arXiv:2512.17229v1 Announce Type: new Abstract: Long Video Question-Answering (LVQA) presents a significant challenge for Multi-modal Large Language Models (MLLMs) due to immense context and overloaded information, which could also lead to prohibitive memory consumption. While existing methods attempt to address these issues by reducing visual tokens or extending model's context length, they may miss useful information or take considerable computation. In fact, when answering given questions, only a small amount of crucial information is required. Therefore, we propose an efficient question-aware memory mechanism, enabling MLLMs to recurrently seek these critical clues. Our approach, named VideoDetective, simplifies this task by iteratively processing video sub-segments. For each sub-segment, a question-aware compression strategy is employed by introducing a few special memory tokens to achieve purposefully compression. This allows models to effectively seek critical clues while reducing visual tokens. Then, due to history context could have a significant impact, we recurrently aggregate and store these memory tokens to update history context, which would be reused for subsequent sub-segments. Furthermore, to more effectively measure model's long video understanding ability, we introduce GLVC (Grounding Long Video Clues), a long video question-answering dataset, which features grounding critical and concrete clues scattered throughout entire videos. Experimental results demonstrate our method enables MLLMs with limited context length of 32K to efficiently process 100K tokens (3600 frames, an hour-long video sampled at 1fps), requiring only 2 minutes and 37GB GPU memory usage. Evaluation results across multiple long video benchmarks illustrate our method can more effectively seek critical clues from massive information.

Fonte: arXiv cs.CV

NLP/LLMsScore 95

Learning When to Look: A Disentangled Curriculum for Strategic Perception in Multimodal Reasoning

arXiv:2512.17227v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) demonstrate significant potential but remain brittle in complex, long-chain visual reasoning tasks. A critical failure mode is "visual forgetting", where models progressively lose visual grounding as reasoning extends, a phenomenon aptly described as "think longer, see less". We posit this failure stems from current training paradigms prematurely entangling two distinct cognitive skills: (1) abstract logical reasoning "how-to-think") and (2) strategic visual perception ("when-to-look"). This creates a foundational cold-start deficiency -- weakening abstract reasoning -- and a strategic perception deficit, as models lack a policy for when to perceive. In this paper, we propose a novel curriculum-based framework to disentangle these skills. First, we introduce a disentangled Supervised Fine-Tuning (SFT) curriculum that builds a robust abstract reasoning backbone on text-only data before anchoring it to vision with a novel Perception-Grounded Chain-of-Thought (PG-CoT) paradigm. Second, we resolve the strategic perception deficit by formulating timing as a reinforcement learning problem. We design a Pivotal Perception Reward that teaches the model when to look by coupling perceptual actions to linguistic markers of cognitive uncertainty (e.g., "wait", "verify"), thereby learning an autonomous grounding policy. Our contributions include the formalization of these two deficiencies and the development of a principled, two-stage framework to address them, transforming the model from a heuristic-driven observer to a strategic, grounded reasoner. \textbf{Code}: \url{https://github.com/gaozilve-max/learning-when-to-look}.

Fonte: arXiv cs.CV

VisionScore 96

Bots Não Ficam Parados: Um Estudo Longitudinal sobre Mudança de Comportamento de Bots, Deriva Temporal e Evolução da Estrutura de Recursos

arXiv:2512.17067v1 Tipo de Anúncio: cruzado Resumo: Bots sociais estão agora profundamente integrados em plataformas online para promoção, persuasão e manipulação. A maioria dos sistemas de detecção de bots ainda trata as características comportamentais como estáticas, assumindo implicitamente que os bots se comportam de maneira estacionária ao longo do tempo. Testamos essa suposição para bots promocionais do Twitter, analisando a mudança tanto em sinais comportamentais individuais quanto nas relações entre eles. Usando 2.615 contas de bots promocionais e 2,8 milhões de tweets, construímos séries temporais anuais para dez meta-recursos baseados em conteúdo. Testes de Augmented Dickey-Fuller e KPSS, além de tendências lineares, mostram que todos os dez são não estacionários: nove aumentam ao longo do tempo, enquanto a diversidade linguística diminui ligeiramente. Estratificando por geração de ativação e idade da conta, revelamos diferenças sistemáticas: bots de segunda geração são os mais ativos e pesados em links; bots de curta duração mostram atividade intensa e repetitiva com uso intenso de hashtags/URLs; bots de longa duração são menos ativos, mas mais linguisticamente diversos e usam emojis de forma mais variável. Em seguida, analisamos a coocorrência entre gerações usando 18 características binárias interpretáveis que abrangem ações, similaridade de tópicos, URLs, hashtags, sentimento, emojis e mídia (153 pares). Testes qui-quadrado indicam que quase todos os pares são dependentes. As correlações de Spearman mudam em força e, às vezes, polaridade: muitos links (por exemplo, múltiplas hashtags com mídia; sentimento com URLs) se fortalecem, enquanto outros mudam de fracos positivos para fracos ou moderadamente negativos. Gerações posteriores mostram combinações mais estruturadas de sinais. Juntas, essas análises fornecem evidências de que bots sociais promocionais se adaptam ao longo do tempo tanto no nível de meta-recursos individuais quanto no nível de interdependências de recursos, com implicações diretas para o design e a avaliação de sistemas de detecção de bots treinados em características comportamentais históricas.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

Um Referencial de Saúde da Mulher para Modelos de Linguagem de Grande Escala

arXiv:2512.17028v1 Tipo de Anúncio: cruzado. Resumo: À medida que modelos de linguagem de grande escala (LLMs) se tornam fontes primárias de informação em saúde para milhões, sua precisão na saúde da mulher permanece criticamente inexplorada. Apresentamos o Women's Health Benchmark (WHB), o primeiro referencial que avalia o desempenho de LLMs especificamente na saúde da mulher. Nosso referencial compreende 96 modelos rigorosamente validados cobrindo cinco especialidades médicas (obstetrícia e ginecologia, medicina de emergência, cuidados primários, oncologia e neurologia), três tipos de consulta (consulta de paciente, consulta de clínico e consulta de evidência/política) e oito tipos de erro (erros de dosagem/medicação, falta de informações críticas, diretrizes/recomendações de tratamento desatualizadas, conselhos de tratamento incorretos, informações factuais incorretas, diagnóstico diferencial ausente/incorreto, urgência perdida e recomendações inadequadas). Avaliamos 13 LLMs de última geração e revelamos lacunas alarmantes: os modelos atuais apresentam taxas de falha de aproximadamente 60% no referencial de saúde da mulher, com desempenho variando dramaticamente entre especialidades e tipos de erro. Notavelmente, os modelos universalmente têm dificuldades com indicadores de 'urgência perdida', enquanto modelos mais novos como o GPT-5 mostram melhorias significativas em evitar recomendações inadequadas. Nossos achados ressaltam que chatbots de IA ainda não são totalmente capazes de fornecer conselhos confiáveis em saúde da mulher.

Fonte: arXiv cs.AI

RLScore 96

Navegando Expansões Taxonômicas de Conjuntos de Entidades Impulsionadas por Bases de Conhecimento

arXiv:2512.16953v1 Tipo de Anúncio: novo Resumo: Reconhecer semelhanças entre entidades é central tanto para a cognição humana quanto para a inteligência computacional. Dentro desse cenário mais amplo, a Expansão de Conjuntos de Entidades é uma tarefa proeminente que visa pegar um conjunto inicial de (tuplas de) entidades e identificar outras que compartilham propriedades semânticas relevantes com as anteriores -- potencialmente repetindo o processo para formar conjuntos cada vez mais amplos. No entanto, essa abordagem ``linear'' não revela as estruturas ``taxonômicas'' mais ricas presentes em recursos de conhecimento. Uma estrutura recente baseada em lógica introduz a noção de um grafo de expansão: um grafo acíclico dirigido enraizado onde cada nó representa uma generalização semântica rotulada por uma fórmula lógica, e as arestas codificam inclusão semântica estrita. Essa estrutura apoia expansões taxonômicas de conjuntos de entidades impulsionadas por bases de conhecimento. No entanto, o tamanho potencialmente grande de tais grafos pode tornar a materialização completa impraticável em cenários do mundo real. Para superar isso, formalizamos tarefas de raciocínio que verificam se duas tuplas pertencem a nós comparáveis, incomparáveis ou ao mesmo nó no grafo. Nossos resultados mostram que, sob suposições realistas -- como limitar a entrada ou restringir descrições de entidades -- essas tarefas podem ser implementadas de forma eficiente. Isso permite a navegação local e incremental de grafos de expansão, apoiando aplicações práticas sem exigir a construção completa do grafo.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

MemoryGraft: Comprometimento Persistente de Agentes LLM via Recuperação de Experiências Envenenadas

arXiv:2512.16962v1 Tipo de Anúncio: cruzado Resumo: Agentes de Modelos de Linguagem de Grande Escala (LLM) dependem cada vez mais de memória de longo prazo e Geração Aumentada por Recuperação (RAG) para persistir experiências e refinar o desempenho futuro. Embora essa capacidade de aprendizado de experiências aumente a autonomia dos agentes, ela introduz uma superfície de ataque crítica e inexplorada, ou seja, a fronteira de confiança entre o núcleo de raciocínio de um agente e seu próprio passado. Neste artigo, apresentamos o MemoryGraft. É um novo ataque de injeção indireta que compromete o comportamento do agente não por meio de jailbreaks imediatos, mas implantando experiências maliciosas bem-sucedidas na memória de longo prazo do agente. Ao contrário das injeções de prompt tradicionais que são transitórias, ou da contaminação padrão de RAG que visa o conhecimento factual, o MemoryGraft explora a heurística de imitação semântica do agente, que é a tendência de replicar padrões de tarefas bem-sucedidas recuperadas. Demonstramos que um atacante que pode fornecer artefatos benignos em nível de ingestão que o agente lê durante a execução pode induzi-lo a construir um repositório RAG envenenado onde um pequeno conjunto de templates de procedimentos maliciosos é persistido ao lado de experiências benignas. Quando o agente encontra posteriormente tarefas semanticamente semelhantes, a recuperação de união sobre similaridade lexical e de embedding revela de forma confiável essas memórias enxertadas, e o agente adota os padrões inseguros incorporados, levando a uma deriva comportamental persistente ao longo das sessões. Validamos o MemoryGraft no agente DataInterpreter da MetaGPT com GPT-4o e descobrimos que um pequeno número de registros envenenados pode representar uma grande fração das experiências recuperadas em cargas de trabalho benignas, transformando a auto-melhoria baseada em experiências em um vetor para comprometimento furtivo e durável. Para facilitar a reprodutibilidade e futuras pesquisas, nosso código e dados de avaliação estão disponíveis em https://github.com/Jacobhhy/Agent-Memory-Poisoning.

Fonte: arXiv cs.AI

RLScore 93

Traduzindo o Efeito Rashomon para Tarefas de Decisão Sequencial

arXiv:2512.17470v1 Tipo de Anúncio: novo Resumo: O efeito Rashomon descreve o fenômeno onde múltiplos modelos treinados nos mesmos dados produzem previsões idênticas, enquanto diferem nas características que utilizam internamente. Este efeito tem sido amplamente estudado em tarefas de classificação, mas não em decisão sequencial, onde um agente aprende uma política para alcançar um objetivo ao tomar ações em um ambiente. Neste artigo, traduzimos o efeito Rashomon para a decisão sequencial. Definimos isso como múltiplas políticas que exibem comportamento idêntico, visitando os mesmos estados e selecionando as mesmas ações, enquanto diferem em sua estrutura interna, como atribuições de características. Verificar comportamento idêntico em decisão sequencial difere da classificação. Na classificação, previsões podem ser comparadas diretamente com rótulos verdadeiros. Na decisão sequencial com transições estocásticas, a mesma política pode ter sucesso ou falhar em qualquer trajetória única devido à aleatoriedade. Abordamos isso usando métodos de verificação formal que constroem e comparam o comportamento probabilístico completo de cada política no ambiente. Nossos experimentos demonstram que o efeito Rashomon existe na decisão sequencial. Além disso, mostramos que conjuntos construídos a partir do conjunto Rashomon exibem maior robustez a mudanças de distribuição do que políticas individuais. Adicionalmente, políticas permissivas derivadas do conjunto Rashomon reduzem os requisitos computacionais para verificação enquanto mantêm desempenho ótimo.

Fonte: arXiv cs.AI

NLP/LLMsScore 95

DAVE: A VLM Vision Encoder for Document Understanding and Web Agents

arXiv:2512.17221v1 Announce Type: new Abstract: While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial information essential for document understanding and web agents. To bridge this gap, we introduce DAVE, a vision encoder purpose-built for VLMs and tailored for these tasks. Our training pipeline is designed to leverage abundant unlabeled data to bypass the need for costly large-scale annotations for document and web images. We begin with a self-supervised pretraining stage on unlabeled images, followed by a supervised autoregressive pretraining stage, where the model learns tasks like parsing and localization from limited, high-quality data. Within the supervised stage, we adopt two strategies to improve our encoder's alignment with both general visual knowledge and diverse document and web agentic tasks: (i) We introduce a novel model-merging scheme, combining encoders trained with different text decoders to ensure broad compatibility with different web agentic architectures. (ii) We use ensemble training to fuse features from pretrained generalist encoders (e.g., SigLIP2) with our own document and web-specific representations. Extensive experiments on classic document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness of our approach, establishing DAVE as a strong vision encoder for document and web applications.

Fonte: arXiv cs.CV

NLP/LLMsScore 96

Sobre o Papel da Informação Contextual e Estados do Ego no Comportamento de Agentes LLM para Diálogos de Análise Transacional

arXiv:2512.17060v1 Tipo de Anúncio: cruzado. Resumo: Agentes impulsionados por LLM agora são utilizados em muitas áreas, desde suporte ao cliente até educação, e há um interesse crescente em sua capacidade de agir mais como humanos. Isso inclui campos como pesquisa social, política e psicológica, onde o objetivo é modelar dinâmicas de grupo e comportamento social. No entanto, os atuais agentes LLM frequentemente carecem da profundidade psicológica e consistência necessárias para capturar os padrões reais do pensamento humano. Eles geralmente fornecem respostas diretas ou estatisticamente prováveis, mas perdem os objetivos mais profundos, conflitos emocionais e motivações que impulsionam as interações humanas reais. Este artigo propõe um Sistema Multi-Agente (MAS) inspirado na teoria da Análise Transacional (TA). No sistema proposto, cada agente é dividido em três estados do ego - Pai, Adulto e Criança. Os estados do ego são tratados como estruturas de conhecimento separadas com suas próprias perspectivas e estilos de raciocínio. Para enriquecer seu processo de resposta, eles têm acesso a um mecanismo de recuperação de informações que lhes permite recuperar informações contextuais relevantes de seus repositórios vetoriais. Esta arquitetura é avaliada por meio de testes de ablação em um cenário de diálogo simulado, comparando agentes com e sem recuperação de informações. Os resultados são promissores e abrem novas direções para explorar como estruturas fundamentadas psicologicamente podem enriquecer o comportamento dos agentes. A contribuição é uma arquitetura de agente que integra a teoria da Análise Transacional com a recuperação de informações contextuais para aprimorar o realismo de simulações multi-agente baseadas em LLM.

Fonte: arXiv cs.AI

RLScore 96

Conhecimento Inesperado: Auditoria das Recomendações de Busca do Wikipedia e Grokipedia

arXiv:2512.17027v1 Tipo de Anúncio: cruzado. Resumo: Plataformas de conhecimento enciclopédico são portas de entrada essenciais pelas quais os usuários exploram informações online. O recente lançamento do Grokipedia, uma enciclopédia totalmente gerada por IA, introduz uma nova alternativa às plataformas tradicionais e bem estabelecidas como o Wikipedia. Nesse contexto, os mecanismos de busca desempenham um papel importante em guiar os caminhos exploratórios dos usuários, no entanto, seu comportamento em diferentes sistemas enciclopédicos permanece pouco explorado. Neste trabalho, abordamos essa lacuna fornecendo a primeira análise comparativa dos mecanismos de busca no Wikipedia e Grokipedia. Usando quase 10.000 palavras neutras em inglês e suas substrings como consultas, coletamos mais de 70.000 resultados de mecanismos de busca e examinamos seu alinhamento semântico, sobreposição e estrutura tópica. Descobrimos que ambas as plataformas frequentemente geram resultados que estão fraca ou moderadamente relacionados à consulta original e, em muitos casos, apresentam conteúdos inesperados a partir de consultas inócuas. Apesar dessas propriedades compartilhadas, os dois sistemas frequentemente produzem conjuntos de recomendações substancialmente diferentes para a mesma consulta. Através da anotação tópica e da análise de trajetórias, identificamos ainda diferenças sistemáticas em como as categorias de conteúdo são apresentadas e como os resultados dos mecanismos de busca evoluem ao longo de múltiplas etapas de exploração. No geral, nossas descobertas mostram que resultados inesperados de mecanismos de busca são uma característica comum de ambas as plataformas, mesmo que apresentem discrepâncias em termos de distribuição tópica e sugestões de consulta.

Fonte: arXiv cs.AI

VisionScore 95

WDFFU-Mamba: A Wavelet-guided Dual-attention Feature Fusion Mamba for Breast Tumor Segmentation in Ultrasound Images

arXiv:2512.17278v1 Announce Type: new Abstract: Breast ultrasound (BUS) image segmentation plays a vital role in assisting clinical diagnosis and early tumor screening. However, challenges such as speckle noise, imaging artifacts, irregular lesion morphology, and blurred boundaries severely hinder accurate segmentation. To address these challenges, this work aims to design a robust and efficient model capable of automatically segmenting breast tumors in BUS images.We propose a novel segmentation network named WDFFU-Mamba, which integrates wavelet-guided enhancement and dual-attention feature fusion within a U-shaped Mamba architecture. A Wavelet-denoised High-Frequency-guided Feature (WHF) module is employed to enhance low-level representations through noise-suppressed high-frequency cues. A Dual Attention Feature Fusion (DAFF) module is also introduced to effectively merge skip-connected and semantic features, improving contextual consistency.Extensive experiments on two public BUS datasets demonstrate that WDFFU-Mamba achieves superior segmentation accuracy, significantly outperforming existing methods in terms of Dice coefficient and 95th percentile Hausdorff Distance (HD95).The combination of wavelet-domain enhancement and attention-based fusion greatly improves both the accuracy and robustness of BUS image segmentation, while maintaining computational efficiency.The proposed WDFFU-Mamba model not only delivers strong segmentation performance but also exhibits desirable generalization ability across datasets, making it a promising solution for real-world clinical applications in breast tumor ultrasound analysis.

Fonte: arXiv cs.CV

RLScore 96

Sobre Tempo: Aprendizado por Reforço Sem Modelo com Máquinas de Recompensa Temporizadas

arXiv:2512.17637v1 Tipo de Anúncio: novo Resumo: A especificação de recompensas desempenha um papel central no aprendizado por reforço (RL), orientando o comportamento do agente. Para expressar recompensas não-Markovianas, formalismos como máquinas de recompensa foram introduzidos para capturar dependências em histórias. No entanto, máquinas de recompensa tradicionais carecem da capacidade de modelar restrições de tempo precisas, limitando seu uso em aplicações sensíveis ao tempo. Neste artigo, propomos máquinas de recompensa temporizadas (TRMs), que são uma extensão das máquinas de recompensa que incorporam restrições de tempo na estrutura de recompensa. As TRMs permitem especificações mais expressivas com lógica de recompensa ajustável, por exemplo, impondo custos por atrasos e concedendo recompensas por ações pontuais. Estudamos estruturas de RL sem modelo (ou seja, Q-learning tabular) para aprender políticas ótimas com TRMs sob semânticas digitais e em tempo real. Nossos algoritmos integram a TRM ao aprendizado por meio de abstrações de autômatos temporizados e empregam heurísticas de imaginação contrafactual que exploram a estrutura da TRM para melhorar a busca. Experimentalmente, demonstramos que nosso algoritmo aprende políticas que alcançam altas recompensas enquanto satisfazem as restrições de tempo especificadas pela TRM em benchmarks populares de RL. Além disso, realizamos estudos comparativos de desempenho sob diferentes semânticas de TRM, juntamente com ablações que destacam os benefícios da imaginação contrafactual.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

Aprendizado por Reforço para Agente Autoaperfeiçoável com Biblioteca de Habilidades

arXiv:2512.17102v1 Tipo de Anúncio: novo Resumo: Agentes baseados em Modelos de Linguagem Grande (LLM) demonstraram capacidades notáveis em raciocínio complexo e interações de múltiplas etapas, mas enfrentam dificuldades para melhorar e se adaptar continuamente quando implantados em novos ambientes. Uma abordagem promissora é a implementação de bibliotecas de habilidades que permitem aos agentes aprender, validar e aplicar novas habilidades. No entanto, as abordagens atuais de bibliotecas de habilidades dependem principalmente da solicitação de LLM, tornando a implementação consistente da biblioteca de habilidades desafiadora. Para superar esses desafios, propomos uma abordagem baseada em Aprendizado por Reforço (RL) para aprimorar as capacidades de autoaperfeiçoamento dos agentes com uma biblioteca de habilidades. Especificamente, introduzimos o Skill Augmented GRPO for self-Evolution (SAGE), uma nova estrutura de RL que incorpora sistematicamente habilidades ao aprendizado. O componente chave da estrutura, Sequential Rollout, implanta iterativamente os agentes em uma cadeia de tarefas similares para cada rollout. À medida que os agentes navegam pela cadeia de tarefas, as habilidades geradas a partir de tarefas anteriores se acumulam na biblioteca e se tornam disponíveis para tarefas subsequentes. Além disso, a estrutura aprimora a geração e utilização de habilidades através de uma Recompensa Integrada de Habilidade que complementa as recompensas baseadas em resultados originais. Resultados experimentais em AppWorld demonstram que o SAGE, quando aplicado a um modelo ajustado supervisionado com experiência de especialistas, alcança 8,9% a mais na Conclusão de Objetivos de Cenário, enquanto requer 26% menos etapas de interação e gera 59% menos tokens, superando substancialmente as abordagens existentes em precisão e eficiência.

Fonte: arXiv cs.AI

NLP/LLMsScore 95

FORMSpoT: A Decade of Tree-Level, Country-Scale Forest Monitoring

arXiv:2512.17021v1 Announce Type: new Abstract: The recent decline of the European forest carbon sink highlights the need for spatially explicit and frequently updated forest monitoring tools. Yet, existing satellite-based disturbance products remain too coarse to detect changes at the scale of individual trees, typically below 100 m$^{2}$. Here, we introduce FORMSpoT (Forest Mapping with SPOT Time series), a decade-long (2014-2024) nationwide mapping of forest canopy height at 1.5 m resolution, together with annual disturbance polygons (FORMSpoT-$\Delta$) covering mainland France. Canopy heights were derived from annual SPOT-6/7 composites using a hierarchical transformer model (PVTv2) trained on high-resolution airborne laser scanning (ALS) data. To enable robust change detection across heterogeneous acquisitions, we developed a dedicated post-processing pipeline combining co-registration and spatio-temporal total variation denoising. Validation against ALS revisits across 19 sites and 5,087 National Forest Inventory plots shows that FORMSpoT-$\Delta$ substantially outperforms existing disturbance products. In mountainous forests, where disturbances are small and spatially fragmented, FORMSpoT-$\Delta$ achieves an F1-score of 0.44, representing an order of magnitude higher than existing benchmarks. By enabling tree-level monitoring of forest dynamics at national scale, FORMSpoT-$\Delta$ provides a unique tool to analyze management practices, detect early signals of forest decline, and better quantify carbon losses from subtle disturbances such as thinning or selective logging. These results underscore the critical importance of sustaining very high-resolution satellite missions like SPOT and open-data initiatives such as DINAMIS for monitoring forests under climate change.

Fonte: arXiv cs.CV

VisionScore 95

PhysFire-WM: A Physics-Informed World Model for Emulating Fire Spread Dynamics

arXiv:2512.17152v1 Announce Type: new Abstract: Fine-grained fire prediction plays a crucial role in emergency response. Infrared images and fire masks provide complementary thermal and boundary information, yet current methods are predominantly limited to binary mask modeling with inherent signal sparsity, failing to capture the complex dynamics of fire. While world models show promise in video generation, their physical inconsistencies pose significant challenges for fire forecasting. This paper introduces PhysFire-WM, a Physics-informed World Model for emulating Fire spread dynamics. Our approach internalizes combustion dynamics by encoding structured priors from a Physical Simulator to rectify physical discrepancies, coupled with a Cross-task Collaborative Training strategy (CC-Train) that alleviates the issue of limited information in mask-based modeling. Through parameter sharing and gradient coordination, CC-Train effectively integrates thermal radiation dynamics and spatial boundary delineation, enhancing both physical realism and geometric accuracy. Extensive experiments on a fine-grained multimodal fire dataset demonstrate the superior accuracy of PhysFire-WM in fire spread prediction. Validation underscores the importance of physical priors and cross-task collaboration, providing new insights for applying physics-informed world models to disaster prediction.

Fonte: arXiv cs.CV

NLP/LLMsScore 95

AVM: Towards Structure-Preserving Neural Response Modeling in the Visual Cortex Across Stimuli and Individuals

arXiv:2512.16948v1 Announce Type: new Abstract: While deep learning models have shown strong performance in simulating neural responses, they often fail to clearly separate stable visual encoding from condition-specific adaptation, which limits their ability to generalize across stimuli and individuals. We introduce the Adaptive Visual Model (AVM), a structure-preserving framework that enables condition-aware adaptation through modular subnetworks, without modifying the core representation. AVM keeps a Vision Transformer-based encoder frozen to capture consistent visual features, while independently trained modulation paths account for neural response variations driven by stimulus content and subject identity. We evaluate AVM in three experimental settings, including stimulus-level variation, cross-subject generalization, and cross-dataset adaptation, all of which involve structured changes in inputs and individuals. Across two large-scale mouse V1 datasets, AVM outperforms the state-of-the-art V1T model by approximately 2% in predictive correlation, demonstrating robust generalization, interpretable condition-wise modulation, and high architectural efficiency. Specifically, AVM achieves a 9.1% improvement in explained variance (FEVE) under the cross-dataset adaptation setting. These results suggest that AVM provides a unified framework for adaptive neural modeling across biological and experimental conditions, offering a scalable solution under structural constraints. Its design may inform future approaches to cortical modeling in both neuroscience and biologically inspired AI systems.

Fonte: arXiv cs.CV

RLScore 96

Riscos de Segurança de Veículos Agentes: Uma Análise Sistemática de Ameaças Cognitivas e Intercamadas

arXiv:2512.17041v1 Tipo de Anúncio: novo Resumo: A IA Agente está sendo cada vez mais explorada e introduzida em veículos tanto manuais quanto autônomos, levando à noção de Veículos Agentes (AgVs), com capacidades como personalização baseada em memória, interpretação de objetivos, raciocínio estratégico e assistência mediada por ferramentas. Embora estruturas como os Riscos de Segurança da IA Agente da OWASP destaquem vulnerabilidades em sistemas de IA baseados em raciocínio, elas não são projetadas para plataformas ciberfísicas críticas para a segurança, como veículos, nem levam em conta interações com outras camadas, como camadas de percepção, comunicação e controle. Este artigo investiga ameaças de segurança em AgVs, incluindo riscos no estilo OWASP e ciberataques de outras camadas que afetam a camada agente. Ao introduzir uma arquitetura baseada em papéis para veículos agentes, consistindo em um Agente Pessoal e um Agente de Estratégia de Direção, investigaremos vulnerabilidades tanto na camada de IA agente quanto em riscos intercamadas, incluindo riscos originados de camadas superiores (por exemplo, camada de percepção, camada de controle, etc.). Uma matriz de severidade e uma análise de cadeia de ataque ilustram como pequenas distorções podem escalar para comportamentos desalinhados ou inseguros em veículos dirigidos por humanos e autônomos. A estrutura resultante fornece a primeira base estruturada para analisar riscos de segurança da IA agente em plataformas de veículos atuais e emergentes.

Fonte: arXiv cs.AI

VisionScore 95

Homografia Infinita como Condicionamento Robusto para Geração de Vídeo Controlada por Câmera

O progresso recente em modelos de difusão de vídeo gerou interesse crescente na geração de vídeo de nova perspectiva controlada por câmera para cenas dinâmicas. Um desafio chave é garantir a fidelidade à pose da câmera especificada, mantendo a consistência de visualização e raciocinando sobre a geometria ocluída a partir de observações limitadas. Apresentamos o InfCam, um framework de geração de vídeo para vídeo controlado por câmera com alta fidelidade de pose.

Fonte: arXiv cs.CV

RLScore 93

Valor sob Ignorância na Inteligência Artificial Universal

arXiv:2512.17086v1 Tipo de Anúncio: novo Resumo: Generalizamos o agente de aprendizado por reforço AIXI para admitir uma classe mais ampla de funções de utilidade. Atribuir uma utilidade a cada possível histórico de interação nos força a confrontar a ambiguidade de que algumas hipóteses na distribuição de crenças do agente apenas preveem um prefixo finito da história, o que às vezes é interpretado como implicando uma chance de morte igual a uma quantidade chamada perda semimedida. Essa interpretação de morte sugere uma maneira de atribuir utilidades a tais prefixos históricos. Argumentamos que é tão natural ver as distribuições de crença como distribuições de probabilidade imprecisas, com a perda semimedida representando total ignorância. Isso nos motiva a considerar as consequências de calcular utilidades esperadas com integrais de Choquet a partir da teoria da probabilidade imprecisa, incluindo uma investigação de seu nível de computabilidade. Recuperamos a função de valor recursiva padrão como um caso especial. No entanto, nossas utilidades esperadas mais gerais sob a interpretação de morte não podem ser caracterizadas como tais integrais de Choquet.

Fonte: arXiv cs.AI

NLP/LLMsScore 93

UniRel-R1: Raciocínio LLM ajustado por RL para Respostas a Perguntas Relacionais em Grafos de Conhecimento

arXiv:2512.17043v1 Tipo de Anúncio: novo Resumo: A Resposta a Perguntas em Grafos de Conhecimento (KGQA) tradicionalmente se concentrou em consultas centradas em entidades que retornam uma única entidade de resposta. No entanto, consultas do mundo real são frequentemente relacionais, buscando entender como as entidades estão associadas. Neste trabalho, introduzimos o KGQA centrado em relações, um cenário complementar onde a resposta é um subgrafo que captura as conexões semânticas entre entidades, em vez de uma entidade individual. O principal desafio reside na abundância de subgrafos candidatos, onde conexões triviais ou excessivamente comuns muitas vezes obscurecem a identificação de respostas únicas e informativas. Para enfrentar isso, propomos o UniRel-R1, uma estrutura unificada que integra seleção de subgrafos, poda de grafo em múltiplas etapas e um LLM ajustado com aprendizado por reforço. A função de recompensa é projetada para incentivar subgrafos compactos e específicos com relações mais informativas e entidades intermediárias de menor grau. Experimentos extensivos mostram que o UniRel-R1 alcança ganhos significativos em conectividade e recompensa em relação a linhas de base Vanilla e generaliza efetivamente para entidades e relações não vistas.

Fonte: arXiv cs.AI

NLP/LLMsScore 96

Classificação de Hipóteses Inspirada em Solomonoff com LLMs para Previsão sob Incerteza

arXiv:2512.17145v1 Tipo de Anúncio: novo Resumo: Raciocinar sob incerteza é um desafio fundamental em IA, especialmente para tarefas do mundo real, onde problemas com dados escassos exigem generalização sistemática. Abordagens existentes lutam para equilibrar precisão e simplicidade ao avaliar múltiplas soluções candidatas. Propomos um método inspirado em Solomonoff que pondera hipóteses geradas por LLM com base na simplicidade e no ajuste preditivo. Aplicado a tarefas de benchmark (Mini-ARC), nosso método produz misturas ponderadas por Solomonoff para previsões por célula, resultando em saídas conservadoras e cientes da incerteza, mesmo quando as hipóteses são ruidosas ou parcialmente incorretas. Comparado ao Bayesian Model Averaging (BMA), a pontuação de Solomonoff distribui a probabilidade de forma mais uniforme entre as hipóteses concorrentes, enquanto o BMA concentra peso nas candidatas mais prováveis, mas potencialmente falhas. Em diversas tarefas, isso destaca o valor de priors teóricos da informação algorítmica para um raciocínio multi-hipótese interpretável e confiável sob incerteza.

Fonte: arXiv cs.AI