Reinforcement Learning

Veja os papers deste label, com traduções para PT-BR.

Artigos

🎯Reinforcement Learning • 1000 artigos encontrados

NLP/LLMs • Score 85

Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

arXiv:2605.21988v1 Announce Type: new Abstract: Video large language models (Video LLMs) achieve strong benchmark accuracy, yet often answer video questions through shortcuts such as single-frame cues and language priors rather than by tracking spatiotemporal dynamics. This issue is exacerbated in RL post-training, where correctness-only rewards can further reinforce shortcut policies that obtain high reward without tracking video dynamics. We address this by asking a controlled counterfactual question: if the visual world changed while the question remained fixed, should the answer change or stay the same? Based on this view, we propose \textbf{Counterfactual Relational Policy Optimization (CRPO)}, a dual-branch RL framework for improving \emph{spatiotemporal sensitivity}. CRPO constructs counterfactual videos through horizontal flips and temporal reversals, trains on both original and counterfactual branches, and introduces a \textbf{Counterfactual Relation Reward (CRR)} between their answers. CRR encourages answers to change for dynamic questions and remain unchanged for static questions. This cross-branch constraint makes it difficult for shortcut policies to be consistently rewarded across both branches. To evaluate this property, we introduce \textbf{DyBench}, a paired counterfactual video benchmark with 3,014 videos covering reversible dynamics, moving direction, and event sequence, together with a strict pair-accuracy metric that prevents fixed-answer shortcuts from inflating scores. Experiments show that CRPO outperforms prior RL methods on spatiotemporal-sensitive evaluations while maintaining competitive general video performance. On Qwen3-VL-8B, CRPO improves DyBench P-Acc by +7.7 and TimeBlind I-Acc by +8.2 over the base model, indicating improved spatiotemporal sensitivity rather than stronger reliance on static shortcuts. The project website can be found at https://ddz16.github.io/crpo.github.io/ .

Fonte: arXiv cs.CV

Artigos

Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

COAgents: Multi-Agent Framework to Learn and Navigate Routing Problems Search Space

Mind the Sim-to-Real Gap & Think Like a Scientist

HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine

OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind

Value-Gradient Hypothesis of RL for LLMs

Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention

Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents

On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents

SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation

Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX

DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA

EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models

A Tale of Two Cities: Pessimism and Opportunism in Offline Dynamic Pricing

Scalable On-Policy Reinforcement Learning via Adaptive Batch Scaling

Uncertainty quantification for Markov chain induced martingales with application to temporal difference learning

Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration

Algorithm Design and Stronger Guarantees for the Improving Multi-Armed Bandits Problem

ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving

Do Not Trust The Auctioneer: Learning to Bid in Feedback-Manipulated Auctions

For How Long Should We Be Punching? Learning Action Duration in Fighting Games

Auto-Dreamer: Learning Offline Memory Consolidation for Language Agents

Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs

Smaller Abstract State Spaces Enable Cross-Scale Generalization in Reinforcement Learning

FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning

Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty

ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents

Spectral Souping: A Unified Framework for Online Preference Alignment

Batched Single-Index Global Multi-Armed Bandits with Covariates

Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor

Catching a Moving Subspace: Low-Rank Bandits Beyond Stationarity

LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

STRIDE: Feedback Linguístico Aprendível em Etapas para Raciocínio de LLMs

Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs

What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

From Cumulative Constraints to Adaptive Runtime Safety Control for Nonstationary Reinforcement Learning

Metric-Gradient Projection for Stable Multi-Agent Policy Learning

Graph-Driven Cross-Industry Real-Time Monitoring Framework for Anti-Money Laundering Detection in Converged Mobility-Energy Supply Chain Networks

Composition of Memory Experts for Diffusion World Models

Generative Auto-Bidding with Unified Modeling and Exploration

Projecting Latent RL Actions: Towards Generalizable and Scalable Graph Combinatorial Optimization

PROWL: Prioritized Regret-Driven Optimization for World Model Learning

Precision Physical Activity Prescription via Reinforcement Learning for Functional Actions

Memory-Augmented Reinforcement Learning Agent for CAD Generation

Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning

Online Market Making and the Value of Observing the Order Book

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution

Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning

MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization

Not all uncertainty is alike: volatility, stochasticity, and exploration

StreamPro: From Reactive Perception to Proactive Decision-Making in Streaming Video

Investigating Action Encodings in Recurrent Neural Networks in Reinforcement Learning

Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management

When Actions Disappear: Adversarial Action Removal in Self-Play Reinforcement Learning

Peak-Detector: Explainable Peak Detection via Instruction-Tuned Large Language Models in Physiological Sign

Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment

DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models

REC-RL: Referring expression counting via Gaussian and range-based reward optimization

Nested Spatio-Temporal Time Series Forecasting

On Gaussian approximation for entropy-regularized Q-learning with function approximation

Taming "Zombie'' Agents: A Markov State-Aware Framework for Resilient Multi-Agent Evolution

Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

A Structural Threshold in Decision Capacity Governs Collapse in Self-Play Reinforcement Learning

QuantFPFlow: Quantum Amplitude Estimation for Fokker--Planck Policy Optimisation in Continuous Reinforcement Learning

From Imitation to Interaction: Mastering Game of Schnapsen with Shallow Reinforcement Learning

Language Game: Talking to Non-Human Systems

NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning

Identifiable Token Correspondence for World Models

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play

Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models

Learning in Position-Aware Multinomial Logit Bandits: From Multiplicative to General Position Effects

Weak-to-Strong Elicitation via Mismatched Wrong Drafts

Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning