Artigos Acadêmicos

Veja todos os papers open access (IA/ML), traduções para PT-BR e navegue por labels.

Labels

Todos os artigos

611 artigo(s) no acervo

Vision • Score 92

TimeColor: Flexible Reference Colorization via Temporal Concatenation

arXiv:2601.00296v1 Announce Type: new Abstract: Most colorization models condition only on a single reference, typically the first frame of the scene. However, this approach ignores other sources of conditional data, such as character sheets, background images, or arbitrary colorized frames. We propose TimeColor, a sketch-based video colorization model that supports heterogeneous, variable-count references with the use of explicit per-reference region assignment. TimeColor encodes references as additional latent frames which are concatenated temporally, permitting them to be processed concurrently in each diffusion step while keeping the model's parameter count fixed. TimeColor also uses spatiotemporal correspondence-masked attention to enforce subject-reference binding in addition to modality-disjoint RoPE indexing. These mechanisms mitigate shortcutting and cross-identity palette leakage. Experiments on SAKUGA-42M under both single- and multi-reference protocols show that TimeColor improves color fidelity, identity consistency, and temporal stability over prior baselines.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

O Cavalo de Troia no Vocabulário: Sabotagem Sutil da Composição de LLM

O ecossistema de LLM de pesos abertos é cada vez mais definido por técnicas de composição de modelos que remixam capacidades de diversas fontes. Um pré-requisito crítico para aplicar esses métodos é o transplante de tokenizers, que alinha vocabulários incompatíveis a um espaço de embedding compartilhado. Demonstramos que esse passo de interoperabilidade introduz uma vulnerabilidade na cadeia de suprimentos.

Fonte: arXiv cs.LG

RL • Score 93

Ideação Progressiva usando um Framework de IA Agente para Co-Criação Humano-IA

A geração de ideias verdadeiramente novas e diversificadas é crucial para o design de engenharia contemporâneo, mas continua sendo um desafio cognitivo significativo para designers novatos. Propomos o MIDAS (Meta-cognitive Ideation through Distributed Agentic AI System), um framework inovador que substitui o paradigma de IA única por uma 'equipe' distribuída de agentes de IA especializados, projetados para emular o fluxo de trabalho de ideação meta-cognitiva humana.

Fonte: arXiv cs.AI

RL • Score 93

Implantações Econômicas Inteligentes de Baixa Altitude da Próxima Geração: A Perspectiva O-RAN

Apesar do crescente interesse em aplicações de economia de baixa altitude (LAE), como logística baseada em UAV e resposta a emergências, desafios fundamentais permanecem na orquestração dessas missões em ambientes complexos e com restrições de sinal. Este artigo apresenta um framework LAE habilitado para O-RAN que otimiza operações críticas por meio de coordenação entre a arquitetura RAN desagregada e controladores inteligentes.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Rumo a Sistemas de IA Potencializados por Fotônica em Grande Escala: Da Automação de Design Físico à Coexploração de Sistema e Algoritmo

Neste trabalho, identificamos três considerações essenciais para a realização de sistemas práticos de IA fotônica em escala: (1) suporte a operações tensorais dinâmicas para modelos modernos; (2) gerenciamento sistemático de sobrecargas de conversão, controle e movimentação de dados; e (3) robustez sob não idealidades de hardware. Desenvolvemos uma ferramenta de suporte ao design de IA fotônica desde a exploração inicial até a realização física.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Efficient Prediction of Dense Visual Embeddings via Distillation and RGB-D Transformers

arXiv:2601.00359v1 Announce Type: new Abstract: In domestic environments, robots require a comprehensive understanding of their surroundings to interact effectively and intuitively with untrained humans. In this paper, we propose DVEFormer - an efficient RGB-D Transformer-based approach that predicts dense text-aligned visual embeddings (DVE) via knowledge distillation. Instead of directly performing classical semantic segmentation with fixed predefined classes, our method uses teacher embeddings from Alpha-CLIP to guide our efficient student model DVEFormer in learning fine-grained pixel-wise embeddings. While this approach still enables classical semantic segmentation, e.g., via linear probing, it further enables flexible text-based querying and other applications, such as creating comprehensive 3D maps. Evaluations on common indoor datasets demonstrate that our approach achieves competitive performance while meeting real-time requirements, operating at 26.3 FPS for the full model and 77.0 FPS for a smaller variant on an NVIDIA Jetson AGX Orin. Additionally, we show qualitative results that highlight the effectiveness and possible use cases in real-world applications. Overall, our method serves as a drop-in replacement for traditional segmentation approaches while enabling flexible natural-language querying and seamless integration into 3D mapping pipelines for mobile robotics.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Pat-DEVAL: Chain-of-Legal-Thought Evaluation for Patent Description

arXiv:2601.00166v1 Announce Type: new Abstract: Patent descriptions must deliver comprehensive technical disclosure while meeting strict legal standards such as enablement and written description requirements. Although large language models have enabled end-to-end automated patent drafting, existing evaluation approaches fail to assess long-form structural coherence and statutory compliance specific to descriptions. We propose Pat-DEVAL, the first multi-dimensional evaluation framework dedicated to patent description bodies. Leveraging the LLM-as-a-judge paradigm, Pat-DEVAL introduces Chain-of-Legal-Thought (CoLT), a legally-constrained reasoning mechanism that enforces sequential patent-law-specific analysis. Experiments validated by patent expert on our Pap2Pat-EvalGold dataset demonstrate that Pat-DEVAL achieves a Pearson correlation of 0.69, significantly outperforming baseline metrics and existing LLM evaluators. Notably, the framework exhibits a superior correlation of 0.73 in Legal-Professional Compliance, proving that the explicit injection of statutory constraints is essential for capturing nuanced legal validity. By establishing a new standard for ensuring both technical soundness and legal compliance, Pat-DEVAL provides a robust methodological foundation for the practical deployment of automated patent drafting systems.

Fonte: arXiv cs.CL

Evaluation/Benchmarks • Score 92

Redes de Imputação Condicional Generativa de Valores Ausentes

Neste estudo, apresentamos uma estratégia condicional generativa sofisticada para imputar valores ausentes em conjuntos de dados, uma área de grande importância na análise estatística. Esclarecemos os fundamentos teóricos das Redes de Imputação Condicional Generativa de Valores Ausentes (GCMI) e demonstramos suas propriedades robustas em contextos de Missing Completely at Random (MCAR) e Missing at Random (MAR).

Fonte: arXiv stat.ML

NLP/LLMs • Score 93

Correspondência Perfeita de Peso Mínimo Neural para Códigos de Erro Quântico

A realização do potencial completo da computação quântica requer a Correção de Erros Quânticos (QEC). A QEC reduz as taxas de erro ao codificar informações lógicas em qubits físicos redundantes. Neste trabalho, propomos um decodificador orientado a dados chamado Correspondência Perfeita de Peso Mínimo Neural (NMWPM), que utiliza uma arquitetura híbrida para prever pesos de arestas dinâmicos, demonstrando uma redução significativa na Taxa de Erro Lógica (LER) em comparação com as referências padrão.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Ouça o Batimento em Fases: Biometria de ECG Consciente de Fases Fundamentada Fisiologicamente

A eletrocardiografia (ECG) é utilizada para autenticação de identidade em dispositivos vestíveis devido às suas características específicas de cada indivíduo e à sua natureza inerente de vivacidade. Propomos um framework Hierarchical Phase-Aware Fusion (HPAF) que evita explicitamente o entrelaçamento de características através de um design em três estágios.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

DA-DPO: Otimização de Preferências Consciente da Dificuldade e Custo-Eficiente para Reduzir Alucinações em MLLMs

O Direct Preference Optimization (DPO) demonstrou grande potencial para mitigar alucinações em Multimodal Large Language Models (MLLMs). No entanto, abordagens existentes frequentemente sofrem com overfitting devido ao desequilíbrio de dificuldade nos dados de preferência. Propomos o Difficulty-Aware Direct Preference Optimization (DA-DPO), um framework custo-efetivo que equilibra o processo de aprendizado.

Fonte: arXiv cs.AI

Vision • Score 96

Campos Cerebrais Neurais: Uma Abordagem Inspirada em NeRF para Gerar Eletrodos de EEG Inexistentes

Os dados de eletroencefalografia (EEG) apresentam desafios únicos de modelagem devido à variação de comprimento, baixíssima relação sinal-ruído e diferenças significativas entre participantes. Este trabalho apresenta um novo método inspirado em Neural Radiance Fields (NeRF) para processar sinais de EEG, permitindo a visualização contínua da atividade cerebral e a simulação de dados de eletrodos inexistentes.

Fonte: arXiv cs.AI

Vision • Score 96

Avaliação de Detectores de Anomalias para Problemas de Classificação Industrial Altamente Desequilibrados Simulados

O machine learning oferece soluções potenciais para problemas atuais em sistemas industriais, como controle de qualidade e manutenção preditiva, mas enfrenta barreiras únicas em aplicações industriais. Este artigo apresenta uma avaliação abrangente de algoritmos de detecção de anomalias usando um conjunto de dados simulado que reflete restrições de engenharia do mundo real.

Fonte: arXiv cs.AI

Evaluation/Benchmarks • Score 90

O Paradoxo do Clima: Por Que a Precipitação Falha em Prever a Severidade de Acidentes de Trânsito em Dados de Grande Escala nos EUA

Este estudo investiga a capacidade preditiva de fatores ambientais, temporais e espaciais sobre a severidade de acidentes de trânsito nos Estados Unidos. Utilizando um conjunto de dados de 500.000 acidentes de trânsito entre 2016 e 2023, treinamos um classificador XGBoost otimizado por meio de validação cruzada com busca aleatória. O modelo final alcança uma precisão geral de 78%, com desempenho forte na classe majoritária (Severidade 2).

Fonte: arXiv cs.LG

Evaluation/Benchmarks • Score 89

Uma Visão do Processo Gaussiano sobre Ruído de Observação e Inicialização em Redes Neurais Amplas

Realizar descida de gradiente em uma rede neural ampla é equivalente a calcular a média posterior de um Processo Gaussiano com o Neural Tangent Kernel (NTK-GP), para uma média a priori específica e com ruído de observação zero. No entanto, as formulações existentes apresentam duas limitações: (i) o NTK-GP assume alvos sem ruído, levando a uma especificação incorreta em dados ruidosos; (ii) a equivalência não se estende a médias a priori arbitrárias, essenciais para modelos bem especificados.

Fonte: arXiv stat.ML

RL • Score 93

Imitação a partir de Observações com Embeddings Gerativos em Nível de Trajetória

Consideramos o aprendizado de imitação offline a partir de observações (LfO) onde as demonstrações de especialistas são escassas e os dados offline disponíveis são subótimos e distantes do comportamento do especialista. Propomos o TGE, um embedding gerativo em nível de trajetória que constrói uma recompensa substituta densa e suave, estimando a densidade de estados do especialista em um modelo de difusão temporal treinado com dados de trajetória offline.

Fonte: arXiv cs.LG

Evaluation/Benchmarks • Score 93

Otimização de Redes Neurais LSTM para Previsão de Vendas no Varejo com Recursos Limitados: Um Estudo de Compressão de Modelo

As redes neurais LSTM (Long Short-Term Memory) padrão oferecem previsões precisas para dados de vendas no setor varejista, mas exigem grande poder computacional, o que pode ser desafiador para indústrias de varejo de médio a pequeno porte. Este artigo investiga a compressão do modelo LSTM ao reduzir gradualmente o número de unidades ocultas de 128 para 16.

Fonte: arXiv cs.LG

RL • Score 93

Modelos de Gargalo de Conceito Controláveis

Os Modelos de Gargalo de Conceito (CBMs) têm atraído atenção por sua capacidade de esclarecer o processo de previsão através de uma camada de conceito compreensível para humanos. No entanto, a maioria dos estudos anteriores focou em cenários estáticos. Propomos os Modelos de Gargalo de Conceito Controláveis (CCBMs), que suportam edições em diferentes granularidades, permitindo manutenção contínua sem a necessidade de re-treinamento.

Fonte: arXiv cs.LG

Vision • Score 96

Detecção Humana em Tempo Real para Sequências de Vídeo Capturadas Aéreas via Modelos Profundos

A detecção humana em vídeos desempenha um papel importante em diversas aplicações do mundo real. Abordagens tradicionais dependem de características feitas à mão, que são específicas para problemas e tarefas. Este trabalho utiliza métodos de aprendizado automático de características, combinando fluxo óptico e três modelos profundos diferentes para detecção humana em vídeos capturados com uma câmera não estática em plataformas aéreas.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Uma Avaliação Empírica de Abordagens Baseadas em LLM para Detecção de Vulnerabilidades de Código: RAG, SFT e Sistemas de Agentes Duplos

O rápido avanço dos Large Language Models (LLMs) apresenta novas oportunidades para a detecção automatizada de vulnerabilidades de software, uma tarefa crucial para a segurança de bases de código modernas. Este artigo apresenta um estudo comparativo sobre a eficácia de técnicas baseadas em LLM para detectar vulnerabilidades de software, avaliando três abordagens: Retrieval-Augmented Generation (RAG), Supervised Fine-Tuning (SFT) e um framework de LLM de Agente Duplo.

Fonte: arXiv cs.AI

Vision • Score 95

Boosting Segment Anything Model to Generalize Visually Non-Salient Scenarios

arXiv:2601.00537v1 Announce Type: new Abstract: Segment Anything Model (SAM), known for its remarkable zero-shot segmentation capabilities, has garnered significant attention in the community. Nevertheless, its performance is challenged when dealing with what we refer to as visually non-salient scenarios, where there is low contrast between the foreground and background. In these cases, existing methods often cannot capture accurate contours and fail to produce promising segmentation results. In this paper, we propose Visually Non-Salient SAM (VNS-SAM), aiming to enhance SAM's perception of visually non-salient scenarios while preserving its original zero-shot generalizability. We achieve this by effectively exploiting SAM's low-level features through two designs: Mask-Edge Token Interactive decoder and Non-Salient Feature Mining module. These designs help the SAM decoder gain a deeper understanding of non-salient characteristics with only marginal parameter increments and computational requirements. The additional parameters of VNS-SAM can be optimized within 4 hours, demonstrating its feasibility and practicality. In terms of data, we established VNS-SEG, a unified dataset for various VNS scenarios, with more than 35K images, in contrast to previous single-task adaptations. It is designed to make the model learn more robust VNS features and comprehensively benchmark the model's segmentation performance and generalizability on VNS scenarios. Extensive experiments across various VNS segmentation tasks demonstrate the superior performance of VNS-SAM, particularly under zero-shot settings, highlighting its potential for broad real-world applications. Codes and datasets are publicly available at https://guangqian-guo.github.io/VNS-SAM.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

From Transformers to LLMs: A Systematic Survey of Efficiency Considerations in NLP

arXiv:2406.16893v2 Announce Type: replace Abstract: The emergence of Transformer-based Large Language Models (LLMs) has substantially augmented the capabilities of Natural Language Processing (NLP), thereby intensifying the demand for computational resources. Therefore, enhancing efficiency based on factors like computational requirements, energy consumption, carbon footprint and financial cost has become a vital area of research. This motivates us to conduct a systematic literature review on Transformer-based LLMs in NLP from the perspective of efficiency. In this survey of 312 articles published between the years 2011 and 2025, efficiency-improvement endeavors have been systematically discussed targeting various aspects such as data curation, model design, model downsizing, and dynamic inferencing. This has been augmented with efficiency considerations in model adaptation strategies like pre-training, fine-tuning, prompt-engineering and Retrieval-Augmented Generation (RAG). Furthermore, a statistical analysis of the articles has been performed followed by an in-depth evaluation of the efficiency and efficacy of more than 30 renowned NLP models has been conducted on 13 evaluation benchmarks. This paper offers valuable insights for researchers, professionals as well as scholars, and explores the trend of research toward sustainable practices in NLP.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

SSI-GAN: Redes Geradoras Adversariais Semi-Supervisionadas Inspiradas no Swin para Classificação de Espículas Neurais

Os mosquitos são os principais agentes transmissores de doenças arbovirais. A classificação manual de seus padrões de espículas neurais é muito trabalhosa e cara. Para resolver a escassez de dados rotulados, propomos uma nova arquitetura de Rede Geradora Adversarial (GAN) chamada SSI-GAN, que alcançou 99,93% de precisão na classificação com apenas 3% de dados rotulados.

Fonte: arXiv cs.AI

Theory/Optimization • Score 90

Garantias de Aproximação Mais Fortes para Maximização Não-Monótona de Funções $eta$-Fracas DR-Submodulares

Maximizar objetivos submodulares sob restrições é um problema fundamental em machine learning e otimização. Estudamos a maximização de uma função $eta$-fraca DR-submodular não negativa e não monótona sobre um corpo convexo fechado para baixo. Nosso principal resultado é um algoritmo de aproximação cuja garantia depende suavemente de $eta$.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

A Chain-of-Thought Approach to Semantic Query Categorization in e-Commerce Taxonomies

arXiv:2601.00510v1 Announce Type: cross Abstract: Search in e-Commerce is powered at the core by a structured representation of the inventory, often formulated as a category taxonomy. An important capability in e-Commerce with hierarchical taxonomies is to select a set of relevant leaf categories that are semantically aligned with a given user query. In this scope, we address a fundamental problem of search query categorization in real-world e-Commerce taxonomies. A correct categorization of a query not only provides a way to zoom into the correct inventory space, but opens the door to multiple intent understanding capabilities for a query. A practical and accurate solution to this problem has many applications in e-commerce, including constraining retrieved items and improving the relevance of the search results. For this task, we explore a novel Chain-of-Thought (CoT) paradigm that combines simple tree-search with LLM semantic scoring. Assessing its classification performance on human-judged query-category pairs, relevance tests, and LLM-based reference methods, we find that the CoT approach performs better than a benchmark that uses embedding-based query category predictions. We show how the CoT approach can detect problems within a hierarchical taxonomy. Finally, we also propose LLM-based approaches for query-categorization of the same spirit, but which scale better at the range of millions of queries.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

MethConvTransformer: Um Framework de Deep Learning para Detecção de Doença de Alzheimer em Múltiplos Tecidos

A doença de Alzheimer (AD) é um distúrbio neurodegenerativo multifatorial caracterizado por declínio cognitivo progressivo. O MethConvTransformer é um framework de deep learning baseado em transformer que integra perfis de metilação de DNA de tecidos cerebrais e periféricos, permitindo a descoberta de biomarcadores. O modelo supera as abordagens convencionais de machine learning, oferecendo biomarcadores epigenéticos robustos e interpretabilidade multi-resolução.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Democratizando Sistemas de IA Eletrônico-Fotônicos: Um Fluxo de Ferramentas de Co-Projeto e Automação de Design Infundido com IA e Código Aberto

A fotônica está se tornando uma tecnologia fundamental para sistemas de IA de alto desempenho e computação científica, oferecendo velocidade, paralelismo e eficiência energética incomparáveis. No entanto, o design e a implementação de sistemas de IA eletrônico-fotônicos permanecem desafiadores devido a uma curva de aprendizado acentuada. Apresentamos um framework de co-design e automação de múltiplas camadas para democratizar o desenvolvimento de sistemas de IA fotônicos.

Fonte: arXiv cs.AI

Vision • Score 94

Fluxos de Kernel Orientados a Tarefas: Compressão de Classificação de Rótulos e Filtragem Espectral Laplaciana

Apresentamos uma teoria de aprendizado de características em redes amplas regularizadas por L2, mostrando que o aprendizado supervisionado é inerentemente compressivo. Derivamos uma ODE de kernel que prevê uma evolução espectral de 'preenchimento de água' e provamos que, para qualquer estado estacionário estável, a classificação do kernel é limitada pelo número de classes ($C$).

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

CPPO: Contrastive Perception for Vision Language Policy Optimization

arXiv:2601.00501v1 Announce Type: new Abstract: We introduce CPPO, a Contrastive Perception Policy Optimization method for finetuning vision-language models (VLMs). While reinforcement learning (RL) has advanced reasoning in language models, extending it to multimodal reasoning requires improving both the perception and reasoning aspects. Prior works tackle this challenge mainly with explicit perception rewards, but disentangling perception tokens from reasoning tokens is difficult, requiring extra LLMs, ground-truth data, forced separation of perception from reasoning by policy model, or applying rewards indiscriminately to all output tokens. CPPO addresses this problem by detecting perception tokens via entropy shifts in the model outputs under perturbed input images. CPPO then extends the RL objective function with a Contrastive Perception Loss (CPL) that enforces consistency under information-preserving perturbations and sensitivity under information-removing ones. Experiments show that CPPO surpasses previous perception-rewarding methods, while avoiding extra models, making training more efficient and scalable.

Fonte: arXiv cs.CV

NLP/LLMs • Score 93

Quando Modelos Pequenos Estão Certos por Motivos Errados: Verificação de Processos para Agentes Confiáveis

A implementação de pequenos modelos de linguagem (7-9B parâmetros) como agentes autônomos exige confiança em seu raciocínio, não apenas em suas saídas. Revelamos uma crise crítica de confiabilidade: 50-69% das respostas corretas desses modelos contêm raciocínio fundamentalmente falho, um fenômeno 'Certo por Motivos Errados' invisível às métricas de precisão padrão.

Fonte: arXiv cs.LG

RL • Score 95

AEGIS: Exploring the Limit of World Knowledge Capabilities for Unified Mulitmodal Models

arXiv:2601.00561v1 Announce Type: new Abstract: The capability of Unified Multimodal Models (UMMs) to apply world knowledge across diverse tasks remains a critical, unresolved challenge. Existing benchmarks fall short, offering only siloed, single-task evaluations with limited diagnostic power. To bridge this gap, we propose AEGIS (\emph{i.e.}, \textbf{A}ssessing \textbf{E}diting, \textbf{G}eneration, \textbf{I}nterpretation-Understanding for \textbf{S}uper-intelligence), a comprehensive multi-task benchmark covering visual understanding, generation, editing, and interleaved generation. AEGIS comprises 1,050 challenging, manually-annotated questions spanning 21 topics (including STEM, humanities, daily life, etc.) and 6 reasoning types. To concretely evaluate the performance of UMMs in world knowledge scope without ambiguous metrics, we further propose Deterministic Checklist-based Evaluation (DCE), a protocol that replaces ambiguous prompt-based scoring with atomic ``Y/N'' judgments, to enhance evaluation reliability. Our extensive experiments reveal that most UMMs exhibit severe world knowledge deficits and that performance degrades significantly with complex reasoning. Additionally, simple plug-in reasoning modules can partially mitigate these vulnerabilities, highlighting a promising direction for future research. These results highlight the importance of world-knowledge-based reasoning as a critical frontier for UMMs.

Fonte: arXiv cs.CV

RL • Score 93

E-GRPO: Passos de Alta Entropia Impulsionam Aprendizado por Reforço Eficaz para Modelos de Fluxo

O aprendizado por reforço recente aprimorou os modelos de correspondência de fluxo na alinhamento de preferências humanas. Observamos que passos de alta entropia permitem uma exploração mais eficiente, enquanto passos de baixa entropia resultam em roll-outs indistintos. Propomos o E-GRPO, uma otimização de política relativa em grupo consciente da entropia para aumentar a entropia dos passos de amostragem de SDE.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

MotionPhysics: Learnable Motion Distillation for Text-Guided Simulation

arXiv:2601.00504v1 Announce Type: new Abstract: Accurately simulating existing 3D objects and a wide variety of materials often demands expert knowledge and time-consuming physical parameter tuning to achieve the desired dynamic behavior. We introduce MotionPhysics, an end-to-end differentiable framework that infers plausible physical parameters from a user-provided natural language prompt for a chosen 3D scene of interest, removing the need for guidance from ground-truth trajectories or annotated videos. Our approach first utilizes a multimodal large language model to estimate material parameter values, which are constrained to lie within plausible ranges. We further propose a learnable motion distillation loss that extracts robust motion priors from pretrained video diffusion models while minimizing appearance and geometry inductive biases to guide the simulation. We evaluate MotionPhysics across more than thirty scenarios, including real-world, human-designed, and AI-generated 3D objects, spanning a wide range of materials such as elastic solids, metals, foams, sand, and both Newtonian and non-Newtonian fluids. We demonstrate that MotionPhysics produces visually realistic dynamic simulations guided by natural language, surpassing the state of the art while automatically determining physically plausible parameters. The code and project page are available at: https://wangmiaowei.github.io/MotionPhysics.github.io/.

Fonte: arXiv cs.CV

RL • Score 95

Decomposição Tucker Esparsa e Regularização Gráfica para Previsão de Séries Temporais de Alta Dimensionalidade

Métodos existentes de modelos autorregressivos vetoriais para análise de séries temporais multivariadas utilizam aproximação de matriz de baixa classificação ou decomposição Tucker para reduzir a dimensão do problema de superparametrização. Este artigo propõe um método de decomposição Tucker esparsa com regularização gráfica para séries temporais autorregressivas vetoriais de alta dimensionalidade.

Fonte: arXiv stat.ML

NLP/LLMs • Score 96

Detecção de Descargas de Onda Espigada (SWD) usando 1-dimensional Residual UNet

A rotulagem manual de eventos em registros de eletroencefalografia (EEG) é um processo demorado, especialmente quando as gravações são contínuas por semanas a meses. Um método para rotular automaticamente eventos relevantes de EEG reduz a carga de trabalho manual. Neste estudo, comparamos o desempenho de 14 classificadores de machine learning em um conjunto de dados anotado manualmente, encontrando que um 1D UNet é o mais eficaz para rotular SWDs.

Fonte: arXiv cs.LG

Vision • Score 96

Um Estudo Comparativo de Estratégias de Adaptação para Modelos Fundamentais de Séries Temporais na Detecção de Anomalias

A detecção de anomalias em séries temporais é essencial para a operação confiável de sistemas complexos, mas a maioria dos métodos existentes requer treinamento extenso e específico. Este estudo investiga se modelos fundamentais de séries temporais (TSFMs), pré-treinados em grandes dados heterogêneos, podem servir como bases universais para detecção de anomalias.

Fonte: arXiv cs.LG

RL • Score 93

GRL-SNAM: Aprendizado de Reforço Geométrico com Hamiltonianos Diferenciais de Caminho para Navegação e Mapeamento Simultâneos em Ambientes Desconhecidos

Apresentamos o GRL-SNAM, um framework de aprendizado de reforço geométrico para Navegação e Mapeamento Simultâneos (SNAM) em ambientes desconhecidos. O problema de SNAM é desafiador, pois requer o design de políticas hierárquicas ou conjuntas de múltiplos agentes que controlam o movimento de um robô em um ambiente sem mapa, adquirindo informações por meio de sensores.

Fonte: arXiv cs.LG

NLP/LLMs • Score 93

Mortar: Mecanismos em Evolução para Design Automático de Jogos

Apresentamos Mortar, um sistema para a evolução autônoma de mecânicas de jogos para design automático. As mecânicas de jogos definem as regras e interações que governam a jogabilidade. Mortar combina um algoritmo de qualidade-diversidade com um modelo de linguagem grande para explorar um conjunto diversificado de mecânicas, que são avaliadas por meio da síntese de jogos completos.

Fonte: arXiv cs.AI

NLP/LLMs • Score 93

GRIT -- PEFT Consciente de Geometria com Pré-condicionamento K-FAC, Reprojeção Guiada por Fisher e Adaptação Dinâmica de Rank

O ajuste fino eficiente em parâmetros (PEFT) é a abordagem padrão para adaptar LLMs, mas LoRA e QLoRA são amplamente agnósticas em relação à geometria. Introduzimos o GRIT, um procedimento LoRA dinâmico e consciente da curvatura que preserva a parametrização LoRA, melhorando a eficiência e reduzindo a deriva em direções fracas. GRIT reduz em média 46% os parâmetros treináveis, mantendo a qualidade em diferentes estilos de prompt e misturas de dados.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Universal Adaptive Constraint Propagation: Scaling Structured Inference for Large Language Models via Meta-Reinforcement Learning

arXiv:2601.00095v1 Announce Type: new Abstract: Large language models increasingly require structured inference, from JSON schema enforcement to multi-lingual parsing, where outputs must satisfy complex constraints. We introduce MetaJuLS, a meta-reinforcement learning approach that learns universal constraint propagation policies applicable across languages and tasks without task-specific retraining. By formulating structured inference as adaptive constraint propagation and training a Graph Attention Network with meta-learning, MetaJuLS achieves 1.5--2.0$\times$ speedups over GPU-optimized baselines while maintaining within 0.2\% accuracy of state-of-the-art parsers. On Universal Dependencies across 10 languages and LLM-constrained generation (LogicBench, GSM8K-Constrained), MetaJuLS demonstrates rapid cross-domain adaptation: a policy trained on English parsing adapts to new languages and tasks with 5--10 gradient steps (5--15 seconds) rather than requiring hours of task-specific training. Mechanistic analysis reveals the policy discovers human-like parsing strategies (easy-first) and novel non-intuitive heuristics. By reducing propagation steps in LLM deployments, MetaJuLS contributes to Green AI by directly reducing inference carbon footprint.

Fonte: arXiv cs.CL

RL • Score 96

Modelagem de Estratégia Baseada em Regras Quantitativas no Classic Indian Rummy: Uma Abordagem de Otimização Métrica

O variante de 13 cartas do Classic Indian Rummy é um jogo sequencial de informação incompleta que requer raciocínio probabilístico e tomada de decisão combinatória. Este artigo propõe um framework baseado em regras para o jogo estratégico, impulsionado por uma nova métrica de avaliação de mãos chamada MinDist.

Fonte: arXiv cs.AI

RL • Score 95

VisNet: Efficient Person Re-Identification via Alpha-Divergence Loss, Feature Fusion and Dynamic Multi-Task Learning

arXiv:2601.00307v1 Announce Type: new Abstract: Person re-identification (ReID) is an extremely important area in both surveillance and mobile applications, requiring strong accuracy with minimal computational cost. State-of-the-art methods give good accuracy but with high computational budgets. To remedy this, this paper proposes VisNet, a computationally efficient and effective re-identification model suitable for real-world scenarios. It is the culmination of conceptual contributions, including feature fusion at multiple scales with automatic attention on each, semantic clustering with anatomical body partitioning, a dynamic weight averaging technique to balance classification semantic regularization, and the use of loss function FIDI for improved metric learning tasks. The multiple scales fuse ResNet50's stages 1 through 4 without the use of parallel paths, with semantic clustering introducing spatial constraints through the use of rule-based pseudo-labeling. VisNet achieves 87.05% Rank-1 and 77.65% mAP on the Market-1501 dataset, having 32.41M parameters and 4.601 GFLOPs, hence, proposing a practical approach for real-time deployment in surveillance and mobile applications where computational resources are limited.

Fonte: arXiv cs.CV

RL • Score 96

IA Generativa Nativa em Nuvem para Síntese Automatizada de Planogramas: Uma Abordagem de Modelo de Difusão para Otimização de Varejo em Múltiplas Lojas

A criação de planogramas é um desafio significativo para o varejo, exigindo em média 30 horas por layout complexo. Este artigo apresenta uma arquitetura nativa em nuvem utilizando modelos de difusão para gerar automaticamente planogramas específicos para cada loja, aprendendo com arranjos de prateleiras bem-sucedidos em várias localizações de varejo.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Ajuste Fino de Modelos de Linguagem de Grande Escala para Triagem Automatizada de Depressão em Pidgin Nigeriano: Estudo Piloto GENSCORE

A depressão é um grande contribuinte para a carga de saúde mental na Nigéria, mas a cobertura de triagem é limitada devido ao baixo acesso a clínicos, estigma e barreiras linguísticas. Este estudo apresenta uma abordagem inovadora para triagem automatizada de depressão usando modelos de linguagem de grande escala (LLMs) ajustados para o Pidgin Nigeriano conversacional.

Fonte: arXiv cs.AI

Vision • Score 95

A Cascaded Information Interaction Network for Precise Image Segmentation

arXiv:2601.00562v1 Announce Type: new Abstract: Visual perception plays a pivotal role in enabling autonomous behavior, offering a cost-effective and efficient alternative to complex multi-sensor systems. However, robust segmentation remains a challenge in complex scenarios. To address this, this paper proposes a cascaded convolutional neural network integrated with a novel Global Information Guidance Module. This module is designed to effectively fuse low-level texture details with high-level semantic features across multiple layers, thereby overcoming the inherent limitations of single-scale feature extraction. This architectural innovation significantly enhances segmentation accuracy, particularly in visually cluttered or blurred environments where traditional methods often fail. Experimental evaluations on benchmark image segmentation datasets demonstrate that the proposed framework achieves superior precision, outperforming existing state-of-the-art methods. The results highlight the effectiveness of the approach and its promising potential for deployment in practical robotic applications.

Fonte: arXiv cs.CV

Evaluation/Benchmarks • Score 90

Modelagem de Sinais de ECG de Longa Duração para Prever o Risco de Insuficiência Cardíaca com IA Explicável

A insuficiência cardíaca (IC) afeta 11,8% dos adultos com 65 anos ou mais, reduzindo a qualidade de vida e a longevidade. Hipotetizamos que a inteligência artificial (IA) aplicada a dados de eletrocardiograma (ECG) de 24 horas poderia prever o risco de IC em cinco anos. Utilizamos o conjunto de dados Technion-Leumit Holter ECG (TLHE) com 69.663 gravações de 47.729 pacientes. Nosso modelo de deep learning, DeepHHF, alcançou uma área sob a curva de 0,80, superando modelos anteriores.

Fonte: arXiv cs.AI

Vision • Score 95

Robust Assembly Progress Estimation via Deep Metric Learning

arXiv:2601.00422v1 Announce Type: new Abstract: In recent years, the advancement of AI technologies has accelerated the development of smart factories. In particular, the automatic monitoring of product assembly progress is crucial for improving operational efficiency, minimizing the cost of discarded parts, and maximizing factory productivity. However, in cases where assembly tasks are performed manually over multiple days, implementing smart factory systems remains a challenge. Previous work has proposed Anomaly Triplet-Net, which estimates assembly progress by applying deep metric learning to the visual features of products. Nevertheless, when visual changes between consecutive tasks are subtle, misclassification often occurs. To address this issue, this paper proposes a robust system for estimating assembly progress, even in cases of occlusion or minimal visual change, using a small-scale dataset. Our method leverages a Quadruplet Loss-based learning approach for anomaly images and introduces a custom data loader that strategically selects training samples to enhance estimation accuracy. We evaluated our approach using a image datasets: captured during desktop PC assembly. The proposed Anomaly Quadruplet-Net outperformed existing methods on the dataset. Specifically, it improved the estimation accuracy by 1.3% and reduced misclassification between adjacent tasks by 1.9% in the desktop PC dataset and demonstrating the effectiveness of the proposed method.

Fonte: arXiv cs.CV

Vision • Score 95

A Comprehensive Dataset for Human vs. AI Generated Image Detection

arXiv:2601.00553v1 Announce Type: new Abstract: Multimodal generative AI systems like Stable Diffusion, DALL-E, and MidJourney have fundamentally changed how synthetic images are created. These tools drive innovation but also enable the spread of misleading content, false information, and manipulated media. As generated images become harder to distinguish from photographs, detecting them has become an urgent priority. To combat this challenge, We release MS COCOAI, a novel dataset for AI generated image detection consisting of 96000 real and synthetic datapoints, built using the MS COCO dataset. To generate synthetic images, we use five generators: Stable Diffusion 3, Stable Diffusion 2.1, SDXL, DALL-E 3, and MidJourney v6. Based on the dataset, we propose two tasks: (1) classifying images as real or generated, and (2) identifying which model produced a given synthetic image. The dataset is available at https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Image_Dataset.

Fonte: arXiv cs.CV

Vision • Score 95

OmniVaT: Single Domain Generalization for Multimodal Visual-Tactile Learning

arXiv:2601.00352v1 Announce Type: new Abstract: Visual-tactile learning (VTL) enables embodied agents to perceive the physical world by integrating visual (VIS) and tactile (TAC) sensors. However, VTL still suffers from modality discrepancies between VIS and TAC images, as well as domain gaps caused by non-standardized tactile sensors and inconsistent data collection procedures. We formulate these challenges as a new task, termed single domain generalization for multimodal VTL (SDG-VTL). In this paper, we propose an OmniVaT framework that, for the first time, successfully addresses this task. On the one hand, OmniVaT integrates a multimodal fractional Fourier adapter (MFFA) to map VIS and TAC embeddings into a unified embedding-frequency space, thereby effectively mitigating the modality gap without multi-domain training data or careful cross-modal fusion strategies. On the other hand, it also incorporates a discrete tree generation (DTG) module that obtains diverse and reliable multimodal fractional representations through a hierarchical tree structure, thereby enhancing its adaptivity to fluctuating domain shifts in unseen domains. Extensive experiments demonstrate the superior cross-domain generalization performance of OmniVaT on the SDG-VTL task.

Fonte: arXiv cs.CV

RL • Score 95

Disentangling Hardness from Noise: An Uncertainty-Driven Model-Agnostic Framework for Long-Tailed Remote Sensing Classification

arXiv:2601.00278v1 Announce Type: new Abstract: Long-Tailed distributions are pervasive in remote sensing due to the inherently imbalanced occurrence of grounded objects. However, a critical challenge remains largely overlooked, i.e., disentangling hard tail data samples from noisy ambiguous ones. Conventional methods often indiscriminately emphasize all low-confidence samples, leading to overfitting on noisy data. To bridge this gap, building upon Evidential Deep Learning, we propose a model-agnostic uncertainty-aware framework termed DUAL, which dynamically disentangles prediction uncertainty into Epistemic Uncertainty (EU) and Aleatoric Uncertainty (AU). Specifically, we introduce EU as an indicator of sample scarcity to guide a reweighting strategy for hard-to-learn tail samples, while leveraging AU to quantify data ambiguity, employing an adaptive label smoothing mechanism to suppress the impact of noise. Extensive experiments on multiple datasets across various backbones demonstrate the effectiveness and generalization of our framework, surpassing strong baselines such as TGN and SADE. Ablation studies provide further insights into the crucial choices of our design.

Fonte: arXiv cs.CV

Vision • Score 92

Depth-Synergized Mamba Meets Memory Experts for All-Day Image Reflection Separation

arXiv:2601.00322v1 Announce Type: new Abstract: Image reflection separation aims to disentangle the transmission layer and the reflection layer from a blended image. Existing methods rely on limited information from a single image, tending to confuse the two layers when their contrasts are similar, a challenge more severe at night. To address this issue, we propose the Depth-Memory Decoupling Network (DMDNet). It employs the Depth-Aware Scanning (DAScan) to guide Mamba toward salient structures, promoting information flow along semantic coherence to construct stable states. Working in synergy with DAScan, the Depth-Synergized State-Space Model (DS-SSM) modulates the sensitivity of state activations by depth, suppressing the spread of ambiguous features that interfere with layer disentanglement. Furthermore, we introduce the Memory Expert Compensation Module (MECM), leveraging cross-image historical knowledge to guide experts in providing layer-specific compensation. To address the lack of datasets for nighttime reflection separation, we construct the Nighttime Image Reflection Separation (NightIRS) dataset. Extensive experiments demonstrate that DMDNet outperforms state-of-the-art methods in both daytime and nighttime.

Fonte: arXiv cs.CV

Theory/Optimization • Score 92

SingBAG Pro: Accelerating point cloud-based iterative reconstruction for 3D photoacoustic imaging under arbitrary array

arXiv:2601.00551v1 Announce Type: new Abstract: High-quality three-dimensional (3D) photoacoustic imaging (PAI) is gaining increasing attention in clinical applications. To address the challenges of limited space and high costs, irregular geometric transducer arrays that conform to specific imaging regions are promising for achieving high-quality 3D PAI with fewer transducers. However, traditional iterative reconstruction algorithms struggle with irregular array configurations, suffering from high computational complexity, substantial memory requirements, and lengthy reconstruction times. In this work, we introduce SlingBAG Pro, an advanced reconstruction algorithm based on the point cloud iteration concept of the Sliding ball adaptive growth (SlingBAG) method, while extending its compatibility to arbitrary array geometries. SlingBAG Pro maintains high reconstruction quality, reduces the number of required transducers, and employs a hierarchical optimization strategy that combines zero-gradient filtering with progressively increased temporal sampling rates during iteration. This strategy rapidly removes redundant spatial point clouds, accelerates convergence, and significantly shortens overall reconstruction time. Compared to the original SlingBAG algorithm, SlingBAG Pro achieves up to a 2.2-fold speed improvement in point cloud-based 3D PA reconstruction under irregular array geometries. The proposed method is validated through both simulation and in vivo mouse experiments, and the source code is publicly available at https://github.com/JaegerCQ/SlingBAG_Pro.

Fonte: arXiv cs.CV

Vision • Score 92

DynaDrag: Dynamic Drag-Style Image Editing by Motion Prediction

arXiv:2601.00542v1 Announce Type: new Abstract: To achieve pixel-level image manipulation, drag-style image editing which edits images using points or trajectories as conditions is attracting widespread attention. Most previous methods follow move-and-track framework, in which miss tracking and ambiguous tracking are unavoidable challenging issues. Other methods under different frameworks suffer from various problems like the huge gap between source image and target edited image as well as unreasonable intermediate point which can lead to low editability. To avoid these problems, we propose DynaDrag, the first dragging method under predict-and-move framework. In DynaDrag, Motion Prediction and Motion Supervision are performed iteratively. In each iteration, Motion Prediction first predicts where the handle points should move, and then Motion Supervision drags them accordingly. We also propose to dynamically adjust the valid handle points to further improve the performance. Experiments on face and human datasets showcase the superiority over previous works.

Fonte: arXiv cs.CV

Evaluation/Benchmarks • Score 89

LooC: Effective Low-Dimensional Codebook for Compositional Vector Quantization

arXiv:2601.00222v1 Announce Type: new Abstract: Vector quantization (VQ) is a prevalent and fundamental technique that discretizes continuous feature vectors by approximating them using a codebook. As the diversity and complexity of data and models continue to increase, there is an urgent need for high-capacity, yet more compact VQ methods. This paper aims to reconcile this conflict by presenting a new approach called LooC, which utilizes an effective Low-dimensional codebook for Compositional vector quantization. Firstly, LooC introduces a parameter-efficient codebook by reframing the relationship between codevectors and feature vectors, significantly expanding its solution space. Instead of individually matching codevectors with feature vectors, LooC treats them as lower-dimensional compositional units within feature vectors and combines them, resulting in a more compact codebook with improved performance. Secondly, LooC incorporates a parameter-free extrapolation-by-interpolation mechanism to enhance and smooth features during the VQ process, which allows for better preservation of details and fidelity in feature approximation. The design of LooC leads to full codebook usage, effectively utilizing the compact codebook while avoiding the problem of collapse. Thirdly, LooC can serve as a plug-and-play module for existing methods for different downstream tasks based on VQ. Finally, extensive evaluations on different tasks, datasets, and architectures demonstrate that LooC outperforms existing VQ methods, achieving state-of-the-art performance with a significantly smaller codebook.

Fonte: arXiv cs.CV

Vision • Score 95

GranAlign: Granularity-Aware Alignment Framework for Zero-Shot Video Moment Retrieval

arXiv:2601.00584v1 Announce Type: new Abstract: Zero-shot video moment retrieval (ZVMR) is the task of localizing a temporal moment within an untrimmed video using a natural language query without relying on task-specific training data. The primary challenge in this setting lies in the mismatch in semantic granularity between textual queries and visual content. Previous studies in ZVMR have attempted to achieve alignment by leveraging high-quality pre-trained knowledge that represents video and language in a joint space. However, these approaches failed to balance the semantic granularity between the pre-trained knowledge provided by each modality for a given scene. As a result, despite the high quality of each modality's representations, the mismatch in granularity led to inaccurate retrieval. In this paper, we propose a training-free framework, called Granularity-Aware Alignment (GranAlign), that bridges this gap between coarse and fine semantic representations. Our approach introduces two complementary techniques: granularity-based query rewriting to generate varied semantic granularities, and query-aware caption generation to embed query intent into video content. By pairing multi-level queries with both query-agnostic and query-aware captions, we effectively resolve semantic mismatches. As a result, our method sets a new state-of-the-art across all three major benchmarks (QVHighlights, Charades-STA, ActivityNet-Captions), with a notable 3.23% mAP@avg improvement on the challenging QVHighlights dataset.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

From Sight to Insight: Improving Visual Reasoning Capabilities of Multimodal Models via Reinforcement Learning

arXiv:2601.00215v1 Announce Type: cross Abstract: Reinforcement learning (RL) has emerged as a promising approach for eliciting reasoning chains before generating final answers. However, multimodal large language models (MLLMs) generate reasoning that lacks integration of visual information. This limits their ability to solve problems that demand accurate visual perception, such as visual puzzles. We show that visual perception is the key bottleneck in such tasks: converting images into textual descriptions significantly improves performance, yielding gains of 26.7% for Claude 3.5 and 23.6% for Claude 3.7. To address this, we investigate reward-driven RL as a mechanism to unlock long visual reasoning in open-source MLLMs without requiring costly supervision. We design and evaluate six reward functions targeting different reasoning aspects, including image understanding, thinking steps, and answer accuracy. Using group relative policy optimization (GRPO), our approach explicitly incentivizes longer, structured reasoning and mitigates bypassing of visual information. Experiments on Qwen-2.5-VL-7B achieve 5.56% improvements over the base model, with consistent gains across both in-domain and out-of-domain settings.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Intelligent Traffic Surveillance for Real-Time Vehicle Detection, License Plate Recognition, and Speed Estimation

arXiv:2601.00344v1 Announce Type: new Abstract: Speeding is a major contributor to road fatalities, particularly in developing countries such as Uganda, where road safety infrastructure is limited. This study proposes a real-time intelligent traffic surveillance system tailored to such regions, using computer vision techniques to address vehicle detection, license plate recognition, and speed estimation. The study collected a rich dataset using a speed gun, a Canon Camera, and a mobile phone to train the models. License plate detection using YOLOv8 achieved a mean average precision (mAP) of 97.9%. For character recognition of the detected license plate, the CNN model got a character error rate (CER) of 3.85%, while the transformer model significantly reduced the CER to 1.79%. Speed estimation used source and target regions of interest, yielding a good performance of 10 km/h margin of error. Additionally, a database was established to correlate user information with vehicle detection data, enabling automated ticket issuance via SMS via Africa's Talking API. This system addresses critical traffic management needs in resource-constrained environments and shows potential to reduce road accidents through automated traffic enforcement in developing countries where such interventions are urgently needed.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

FaithSCAN: Detecção de Alucinações em Uma Única Passagem Baseada em Modelos para Respostas Visuais de Perguntas Fiéis

As alucinações de fidelidade em VQA ocorrem quando modelos de visão-linguagem produzem respostas fluentes, mas visualmente não fundamentadas, comprometendo sua confiabilidade em aplicações críticas de segurança. Propomos o FaithSCAN, uma rede leve que detecta alucinações explorando sinais internos ricos dos VLMs, superando limitações de métodos existentes em eficiência e desempenho de detecção.

Fonte: arXiv cs.AI

Vision • Score 95

NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos

arXiv:2601.00393v1 Announce Type: new Abstract: In this paper, we propose NeoVerse, a versatile 4D world model that is capable of 4D reconstruction, novel-trajectory video generation, and rich downstream applications. We first identify a common limitation of scalability in current 4D world modeling methods, caused either by expensive and specialized multi-view 4D data or by cumbersome training pre-processing. In contrast, our NeoVerse is built upon a core philosophy that makes the full pipeline scalable to diverse in-the-wild monocular videos. Specifically, NeoVerse features pose-free feed-forward 4D reconstruction, online monocular degradation pattern simulation, and other well-aligned techniques. These designs empower NeoVerse with versatility and generalization to various domains. Meanwhile, NeoVerse achieves state-of-the-art performance in standard reconstruction and generation benchmarks. Our project page is available at https://neoverse-4d.github.io

Fonte: arXiv cs.CV

Vision • Score 92

CropNeRF: A Neural Radiance Field-Based Framework for Crop Counting

arXiv:2601.00207v1 Announce Type: new Abstract: Rigorous crop counting is crucial for effective agricultural management and informed intervention strategies. However, in outdoor field environments, partial occlusions combined with inherent ambiguity in distinguishing clustered crops from individual viewpoints poses an immense challenge for image-based segmentation methods. To address these problems, we introduce a novel crop counting framework designed for exact enumeration via 3D instance segmentation. Our approach utilizes 2D images captured from multiple viewpoints and associates independent instance masks for neural radiance field (NeRF) view synthesis. We introduce crop visibility and mask consistency scores, which are incorporated alongside 3D information from a NeRF model. This results in an effective segmentation of crop instances in 3D and highly-accurate crop counts. Furthermore, our method eliminates the dependence on crop-specific parameter tuning. We validate our framework on three agricultural datasets consisting of cotton bolls, apples, and pears, and demonstrate consistent counting performance despite major variations in crop color, shape, and size. A comparative analysis against the state of the art highlights superior performance on crop counting tasks. Lastly, we contribute a cotton plant dataset to advance further research on this topic.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Application Research of a Deep Learning Model Integrating CycleGAN and YOLO in PCB Infrared Defect Detection

arXiv:2601.00237v1 Announce Type: new Abstract: This paper addresses the critical bottleneck of infrared (IR) data scarcity in Printed Circuit Board (PCB) defect detection by proposing a cross-modal data augmentation framework integrating CycleGAN and YOLOv8. Unlike conventional methods relying on paired supervision, we leverage CycleGAN to perform unpaired image-to-image translation, mapping abundant visible-light PCB images into the infrared domain. This generative process synthesizes high-fidelity pseudo-IR samples that preserve the structural semantics of defects while accurately simulating thermal distribution patterns. Subsequently, we construct a heterogeneous training strategy that fuses generated pseudo-IR data with limited real IR samples to train a lightweight YOLOv8 detector. Experimental results demonstrate that this method effectively enhances feature learning under low-data conditions. The augmented detector significantly outperforms models trained on limited real data alone and approaches the performance benchmarks of fully supervised training, proving the efficacy of pseudo-IR synthesis as a robust augmentation strategy for industrial inspection.

Fonte: arXiv cs.CV

Vision • Score 95

DichroGAN: Towards Restoration of in-air Colours of Seafloor from Satellite Imagery

arXiv:2601.00194v1 Announce Type: new Abstract: Recovering the in-air colours of seafloor from satellite imagery is a challenging task due to the exponential attenuation of light with depth in the water column. In this study, we present DichroGAN, a conditional generative adversarial network (cGAN) designed for this purpose. DichroGAN employs a two-steps simultaneous training: first, two generators utilise a hyperspectral image cube to estimate diffuse and specular reflections, thereby obtaining atmospheric scene radiance. Next, a third generator receives as input the generated scene radiance containing the features of each spectral band, while a fourth generator estimates the underwater light transmission. These generators work together to remove the effects of light absorption and scattering, restoring the in-air colours of seafloor based on the underwater image formation equation. DichroGAN is trained on a compact dataset derived from PRISMA satellite imagery, comprising RGB images paired with their corresponding spectral bands and masks. Extensive experiments on both satellite and underwater datasets demonstrate that DichroGAN achieves competitive performance compared to state-of-the-art underwater restoration techniques.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

TotalFM: An Organ-Separated Framework for 3D-CT Vision Foundation Models

arXiv:2601.00260v1 Announce Type: new Abstract: While foundation models in radiology are expected to be applied to various clinical tasks, computational cost constraints remain a major challenge when training on 3D-CT volumetric data. In this study, we propose TotalFM, a radiological foundation model that efficiently learns the correspondence between 3D-CT images and linguistic expressions based on the concept of organ separation, utilizing a large-scale dataset of 140,000 series. By automating the creation of organ volume and finding-sentence pairs through segmentation techniques and Large Language Model (LLM)-based radiology report processing, and by combining self-supervised pre-training via VideoMAE with contrastive learning using volume-text pairs, we aimed to balance computational efficiency and representation capability. In zero-shot organ-wise lesion classification tasks, the proposed model achieved higher F1 scores in 83% (5/6) of organs compared to CT-CLIP and 64% (9/14) of organs compared to Merlin. These results suggest that the proposed model exhibits high generalization performance in a clinical evaluation setting using actual radiology report sentences. Furthermore, in zero-shot finding-wise lesion classification tasks, our model achieved a higher AUROC in 83% (25/30) of finding categories compared to Merlin. We also confirmed performance comparable to existing Vision-Language Models (VLMs) in radiology report generation tasks. Our results demonstrate that the organ-separated learning framework can serve as a realistic and effective design guideline for the practical implementation of 3D-CT foundation models.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

arXiv:2601.00264v1 Announce Type: new Abstract: Multimodal learning has revolutionized general domain tasks, yet its application in scientific discovery is hindered by the profound semantic gap between complex scientific imagery and sparse textual descriptions. We present S1-MMAlign, a large-scale, multi-disciplinary multimodal dataset comprising over 15.5 million high-quality image-text pairs derived from 2.5 million open-access scientific papers. Spanning disciplines from physics and biology to engineering, the dataset captures diverse visual modalities including experimental setups, heatmaps, and microscopic imagery. To address the pervasive issue of weak alignment in raw scientific captions, we introduce an AI-ready semantic enhancement pipeline that utilizes the Qwen-VL multimodal large model series to recaption images by synthesizing context from paper abstracts and citation contexts. Technical validation demonstrates that this enhancement significantly improves data quality: SciBERT-based pseudo-perplexity metrics show reduced semantic ambiguity, while CLIP scores indicate an 18.21% improvement in image-text alignment. S1-MMAlign provides a foundational resource for advancing scientific reasoning and cross-modal understanding in the era of AI for Science. The dataset is publicly available at https://huggingface.co/datasets/ScienceOne-AI/S1-MMAlign.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Focal-RegionFace: Generating Fine-Grained Multi-attribute Descriptions for Arbitrarily Selected Face Focal Regions

arXiv:2601.00156v1 Announce Type: new Abstract: In this paper, we introduce an underexplored problem in facial analysis: generating and recognizing multi-attribute natural language descriptions, containing facial action units (AUs), emotional states, and age estimation, for arbitrarily selected face regions (termed FaceFocalDesc). We argue that the system's ability to focus on individual facial areas leads to better understanding and control. To achieve this capability, we construct a new multi-attribute description dataset for arbitrarily selected face regions, providing rich region-level annotations and natural language descriptions. Further, we propose a fine-tuned vision-language model based on Qwen2.5-VL, called Focal-RegionFace for facial state analysis, which incrementally refines its focus on localized facial features through multiple progressively fine-tuning stages, resulting in interpretable age estimation, FAU and emotion detection. Experimental results show that Focal-RegionFace achieves the best performance on the new benchmark in terms of traditional and widely used metrics, as well as new proposed metrics. This fully verifies its effectiveness and versatility in fine-grained multi-attribute face region-focal analysis scenarios.

Fonte: arXiv cs.CV

Vision • Score 95

SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting

arXiv:2601.00285v1 Announce Type: new Abstract: Reconstructing a dynamic target moving over a large area is challenging. Standard approaches for dynamic object reconstruction require dense coverage in both the viewing space and the temporal dimension, typically relying on multi-view videos captured at each time step. However, such setups are only possible in constrained environments. In real-world scenarios, observations are often sparse over time and captured sparsely from diverse viewpoints (e.g., from security cameras), making dynamic reconstruction highly ill-posed. We present SV-GS, a framework that simultaneously estimates a deformation model and the object's motion over time under sparse observations. To initialize SV-GS, we leverage a rough skeleton graph and an initial static reconstruction as inputs to guide motion estimation. (Later, we show that this input requirement can be relaxed.) Our method optimizes a skeleton-driven deformation field composed of a coarse skeleton joint pose estimator and a module for fine-grained deformations. By making only the joint pose estimator time-dependent, our model enables smooth motion interpolation while preserving learned geometric details. Experiments on synthetic datasets show that our method outperforms existing approaches under sparse observations by up to 34% in PSNR, and achieves comparable performance to dense monocular video methods on real-world datasets despite using significantly fewer frames. Moreover, we demonstrate that the input initial static reconstruction can be replaced by a diffusion-based generative prior, making our method more practical for real-world scenarios.

Fonte: arXiv cs.CV

Vision • Score 95

ActErase: A Training-Free Paradigm for Precise Concept Erasure via Activation Patching

arXiv:2601.00267v1 Announce Type: new Abstract: Recent advances in text-to-image diffusion models have demonstrated remarkable generation capabilities, yet they raise significant concerns regarding safety, copyright, and ethical implications. Existing concept erasure methods address these risks by removing sensitive concepts from pre-trained models, but most of them rely on data-intensive and computationally expensive fine-tuning, which poses a critical limitation. To overcome these challenges, inspired by the observation that the model's activations are predominantly composed of generic concepts, with only a minimal component can represent the target concept, we propose a novel training-free method (ActErase) for efficient concept erasure. Specifically, the proposed method operates by identifying activation difference regions via prompt-pair analysis, extracting target activations and dynamically replacing input activations during forward passes. Comprehensive evaluations across three critical erasure tasks (nudity, artistic style, and object removal) demonstrates that our training-free method achieves state-of-the-art (SOTA) erasure performance, while effectively preserving the model's overall generative capability. Our approach also exhibits strong robustness against adversarial attacks, establishing a new plug-and-play paradigm for lightweight yet effective concept manipulation in diffusion models.

Fonte: arXiv cs.CV

Vision • Score 95

Towards Syn-to-Real IQA: A Novel Perspective on Reshaping Synthetic Data Distributions

arXiv:2601.00225v1 Announce Type: new Abstract: Blind Image Quality Assessment (BIQA) has advanced significantly through deep learning, but the scarcity of large-scale labeled datasets remains a challenge. While synthetic data offers a promising solution, models trained on existing synthetic datasets often show limited generalization ability. In this work, we make a key observation that representations learned from synthetic datasets often exhibit a discrete and clustered pattern that hinders regression performance: features of high-quality images cluster around reference images, while those of low-quality images cluster based on distortion types. Our analysis reveals that this issue stems from the distribution of synthetic data rather than model architecture. Consequently, we introduce a novel framework SynDR-IQA, which reshapes synthetic data distribution to enhance BIQA generalization. Based on theoretical derivations of sample diversity and redundancy's impact on generalization error, SynDR-IQA employs two strategies: distribution-aware diverse content upsampling, which enhances visual diversity while preserving content distribution, and density-aware redundant cluster downsampling, which balances samples by reducing the density of densely clustered areas. Extensive experiments across three cross-dataset settings (synthetic-to-authentic, synthetic-to-algorithmic, and synthetic-to-synthetic) demonstrate the effectiveness of our method. The code is available at https://github.com/Li-aobo/SynDR-IQA.

Fonte: arXiv cs.CV

Applications • Score 89

MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing

arXiv:2601.00204v1 Announce Type: new Abstract: 3D morphing remains challenging due to the difficulty of generating semantically consistent and temporally smooth deformations, especially across categories. We present MorphAny3D, a training-free framework that leverages Structured Latent (SLAT) representations for high-quality 3D morphing. Our key insight is that intelligently blending source and target SLAT features within the attention mechanisms of 3D generators naturally produces plausible morphing sequences. To this end, we introduce Morphing Cross-Attention (MCA), which fuses source and target information for structural coherence, and Temporal-Fused Self-Attention (TFSA), which enhances temporal consistency by incorporating features from preceding frames. An orientation correction strategy further mitigates the pose ambiguity within the morphing steps. Extensive experiments show that our method generates state-of-the-art morphing sequences, even for challenging cross-category cases. MorphAny3D further supports advanced applications such as decoupled morphing and 3D style transfer, and can be generalized to other SLAT-based generative models. Project page: https://xiaokunsun.github.io/MorphAny3D.github.io/.

Fonte: arXiv cs.CV

Vision • Score 95

A Spatially Masked Adaptive Gated Network for multimodal post-flood water extent mapping using SAR and incomplete multispectral data

arXiv:2601.00123v1 Announce Type: new Abstract: Mapping water extent during a flood event is essential for effective disaster management throughout all phases: mitigation, preparedness, response, and recovery. In particular, during the response stage, when timely and accurate information is important, Synthetic Aperture Radar (SAR) data are primarily employed to produce water extent maps. Recently, leveraging the complementary characteristics of SAR and MSI data through a multimodal approach has emerged as a promising strategy for advancing water extent mapping using deep learning models. This approach is particularly beneficial when timely post-flood observations, acquired during or shortly after the flood peak, are limited, as it enables the use of all available imagery for more accurate post-flood water extent mapping. However, the adaptive integration of partially available MSI data into the SAR-based post-flood water extent mapping process remains underexplored. To bridge this research gap, we propose the Spatially Masked Adaptive Gated Network (SMAGNet), a multimodal deep learning model that utilizes SAR data as the primary input for post-flood water extent mapping and integrates complementary MSI data through feature fusion. In experiments on the C2S-MS Floods dataset, SMAGNet consistently outperformed other multimodal deep learning models in prediction performance across varying levels of MSI data availability. Furthermore, we found that even when MSI data were completely missing, the performance of SMAGNet remained statistically comparable to that of a U-Net model trained solely on SAR data. These findings indicate that SMAGNet enhances the model robustness to missing data as well as the applicability of multimodal deep learning in real-world flood management scenarios.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark

arXiv:2601.00092v1 Announce Type: new Abstract: 4D spatial intelligence involves perceiving and processing how objects move or change over time. Humans naturally possess 4D spatial intelligence, supporting a broad spectrum of spatial reasoning abilities. To what extent can Multimodal Large Language Models (MLLMs) achieve human-level 4D spatial intelligence? In this work, we present Spatial4D-Bench, a versatile 4D spatial intelligence benchmark designed to comprehensively assess the 4D spatial reasoning abilities of MLLMs. Unlike existing spatial intelligence benchmarks that are often small-scale or limited in diversity, Spatial4D-Bench provides a large-scale, multi-task evaluation benchmark consisting of ~40,000 question-answer pairs covering 18 well-defined tasks. We systematically organize these tasks into six cognitive categories: object understanding, scene understanding, spatial relationship understanding, spatiotemporal relationship understanding, spatial reasoning and spatiotemporal reasoning. Spatial4D-Bench thereby offers a structured and comprehensive benchmark for evaluating the spatial cognition abilities of MLLMs, covering a broad spectrum of tasks that parallel the versatility of human spatial intelligence. We benchmark various state-of-the-art open-source and proprietary MLLMs on Spatial4D-Bench and reveal their substantial limitations in a wide variety of 4D spatial reasoning aspects, such as route plan, action recognition, and physical plausibility reasoning. We hope that the findings provided in this work offer valuable insights to the community and that our benchmark can facilitate the development of more capable MLLMs toward human-level 4D spatial intelligence. More resources can be found on our project page.

Fonte: arXiv cs.CV

Vision • Score 95

Context-Aware Pesticide Recommendation via Few-Shot Pest Recognition for Precision Agriculture

arXiv:2601.00243v1 Announce Type: new Abstract: Effective pest management is crucial for enhancing agricultural productivity, especially for crops such as sugarcane and wheat that are highly vulnerable to pest infestations. Traditional pest management methods depend heavily on manual field inspections and the use of chemical pesticides. These approaches are often costly, time-consuming, labor-intensive, and can have a negative impact on the environment. To overcome these challenges, this study presents a lightweight framework for pest detection and pesticide recommendation, designed for low-resource devices such as smartphones and drones, making it suitable for use by small and marginal farmers. The proposed framework includes two main components. The first is a Pest Detection Module that uses a compact, lightweight convolutional neural network (CNN) combined with prototypical meta-learning to accurately identify pests even when only a few training samples are available. The second is a Pesticide Recommendation Module that incorporates environmental factors like crop type and growth stage to suggest safe and eco-friendly pesticide recommendations. To train and evaluate our framework, a comprehensive pest image dataset was developed by combining multiple publicly available datasets. The final dataset contains samples with different viewing angles, pest sizes, and background conditions to ensure strong generalization. Experimental results show that the proposed lightweight CNN achieves high accuracy, comparable to state-of-the-art models, while significantly reducing computational complexity. The Decision Support System additionally improves pest management by reducing dependence on traditional chemical pesticides and encouraging sustainable practices, demonstrating its potential for real-time applications in precision agriculture.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

IntraStyler: Exemplar-based Style Synthesis for Cross-modality Domain Adaptation

arXiv:2601.00212v1 Announce Type: new Abstract: Image-level domain alignment is the de facto approach for unsupervised domain adaptation, where unpaired image translation is used to minimize the domain gap. Prior studies mainly focus on the domain shift between the source and target domains, whereas the intra-domain variability remains under-explored. To address the latter, an effective strategy is to diversify the styles of the synthetic target domain data during image translation. However, previous methods typically require intra-domain variations to be pre-specified for style synthesis, which may be impractical. In this paper, we propose an exemplar-based style synthesis method named IntraStyler, which can capture diverse intra-domain styles without any prior knowledge. Specifically, IntraStyler uses an exemplar image to guide the style synthesis such that the output style matches the exemplar style. To extract the style-only features, we introduce a style encoder to learn styles discriminatively based on contrastive learning. We evaluate the proposed method on the largest public dataset for cross-modality domain adaptation, CrossMoDA 2023. Our experiments show the efficacy of our method in controllable style synthesis and the benefits of diverse synthetic data for downstream segmentation. Code is available at https://github.com/han-liu/IntraStyler.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

FCMBench: Um Benchmark Multimodal Abrangente de Crédito Financeiro para Aplicações do Mundo Real

À medida que a IA multimodal se torna amplamente utilizada para avaliação de risco de crédito e revisão de documentos, um benchmark específico do domínio é urgentemente necessário. Apresentamos o FCMBench-V1.0, um benchmark multimodal de crédito financeiro em larga escala, cobrindo 18 tipos de certificados principais, com 4.043 imagens em conformidade com a privacidade e 8.446 amostras de QA.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

FreeText: Training-Free Text Rendering in Diffusion Transformers via Attention Localization and Spectral Glyph Injection

arXiv:2601.00535v1 Announce Type: new Abstract: Large-scale text-to-image (T2I) diffusion models excel at open-domain synthesis but still struggle with precise text rendering, especially for multi-line layouts, dense typography, and long-tailed scripts such as Chinese. Prior solutions typically require costly retraining or rigid external layout constraints, which can degrade aesthetics and limit flexibility. We propose \textbf{FreeText}, a training-free, plug-and-play framework that improves text rendering by exploiting intrinsic mechanisms of \emph{Diffusion Transformer (DiT)} models. \textbf{FreeText} decomposes the problem into \emph{where to write} and \emph{what to write}. For \emph{where to write}, we localize writing regions by reading token-wise spatial attribution from endogenous image-to-text attention, using sink-like tokens as stable spatial anchors and topology-aware refinement to produce high-confidence masks. For \emph{what to write}, we introduce Spectral-Modulated Glyph Injection (SGMI), which injects a noise-aligned glyph prior with frequency-domain band-pass modulation to strengthen glyph structure and suppress semantic leakage (rendering the concept instead of the word). Extensive experiments on Qwen-Image, FLUX.1-dev, and SD3 variants across longText-Benchmark, CVTG, and our CLT-Bench show consistent gains in text readability while largely preserving semantic alignment and aesthetic quality, with modest inference overhead.

Fonte: arXiv cs.CV

NLP/LLMs • Score 92

Attention to Detail: Global-Local Attention for High-Resolution AI-Generated Image Detection

arXiv:2601.00141v1 Announce Type: new Abstract: The rapid development of generative AI has made AI-generated images increasingly realistic and high-resolution. Most AI-generated image detection architectures typically downsample images before inputting them into models, risking the loss of fine-grained details. This paper presents GLASS (Global-Local Attention with Stratified Sampling), an architecture that combines a globally resized view with multiple randomly sampled local crops. These crops are original-resolution regions efficiently selected through spatially stratified sampling and aggregated using attention-based scoring. GLASS can be integrated into vision models to leverage both global and local information in images of any size. Vision Transformer, ResNet, and ConvNeXt models are used as backbones, and experiments show that GLASS outperforms standard transfer learning by achieving higher predictive performance within feasible computational constraints.

Fonte: arXiv cs.CV

Applications • Score 89

Mask-Conditioned Voxel Diffusion for Joint Geometry and Color Inpainting

arXiv:2601.00368v1 Announce Type: new Abstract: We present a lightweight two-stage framework for joint geometry and color inpainting of damaged 3D objects, motivated by the digital restoration of cultural heritage artifacts. The pipeline separates damage localization from reconstruction. In the first stage, a 2D convolutional network predicts damage masks on RGB slices extracted from a voxelized object, and these predictions are aggregated into a volumetric mask. In the second stage, a diffusion-based 3D U-Net performs mask-conditioned inpainting directly on voxel grids, reconstructing geometry and color while preserving observed regions. The model jointly predicts occupancy and color using a composite objective that combines occupancy reconstruction with masked color reconstruction and perceptual regularization. We evaluate the approach on a curated set of textured artifacts with synthetically generated damage using standard geometric and color metrics. Compared to symmetry-based baselines, our method produces more complete geometry and more coherent color reconstructions at a fixed 32^3 resolution. Overall, the results indicate that explicit mask conditioning is a practical way to guide volumetric diffusion models for joint 3D geometry and color inpainting.

Fonte: arXiv cs.CV

NLP/LLMs • Score 94

Controles de Abstenção Explícita para Confiabilidade Previsível em Respostas a Perguntas em Vídeo

A implementação de modelos de visão-linguagem (VLMs) em situações críticas exige previsões seletivas, onde os sistemas se abstêm quando incertos, evitando erros custosos. Investigamos se a abstenção baseada em confiança oferece controle confiável sobre as taxas de erro em respostas a perguntas em vídeo e se esse controle se mantém robusto sob mudança de distribuição.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

ABFR-KAN: Kolmogorov-Arnold Networks for Functional Brain Analysis

arXiv:2601.00416v1 Announce Type: new Abstract: Functional connectivity (FC) analysis, a valuable tool for computer-aided brain disorder diagnosis, traditionally relies on atlas-based parcellation. However, issues relating to selection bias and a lack of regard for subject specificity can arise as a result of such parcellations. Addressing this, we propose ABFR-KAN, a transformer-based classification network that incorporates novel advanced brain function representation components with the power of Kolmogorov-Arnold Networks (KANs) to mitigate structural bias, improve anatomical conformity, and enhance the reliability of FC estimation. Extensive experiments on the ABIDE I dataset, including cross-site evaluation and ablation studies across varying model backbones and KAN configurations, demonstrate that ABFR-KAN consistently outperforms state-of-the-art baselines for autism spectrum distorder (ASD) classification. Our code is available at https://github.com/tbwa233/ABFR-KAN.

Fonte: arXiv cs.CV

Vision • Score 95

TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model

arXiv:2601.00051v1 Announce Type: new Abstract: World models aim to endow AI systems with the ability to represent, generate, and interact with dynamic environments in a coherent and temporally consistent manner. While recent video generation models have demonstrated impressive visual quality, they remain limited in real-time interaction, long-horizon consistency, and persistent memory of dynamic scenes, hindering their evolution into practical world models. In this report, we present TeleWorld, a real-time multimodal 4D world modeling framework that unifies video generation, dynamic scene reconstruction, and long-term world memory within a closed-loop system. TeleWorld introduces a novel generation-reconstruction-guidance paradigm, where generated video streams are continuously reconstructed into a dynamic 4D spatio-temporal representation, which in turn guides subsequent generation to maintain spatial, temporal, and physical consistency. To support long-horizon generation with low latency, we employ an autoregressive diffusion-based video model enhanced with Macro-from-Micro Planning (MMPL)--a hierarchical planning method that reduces error accumulation from frame-level to segment-level-alongside efficient Distribution Matching Distillation (DMD), enabling real-time synthesis under practical computational budgets. Our approach achieves seamless integration of dynamic object modeling and static scene representation within a unified 4D framework, advancing world models toward practical, interactive, and computationally accessible systems. Extensive experiments demonstrate that TeleWorld achieves strong performance in both static and dynamic world understanding, long-term consistency, and real-time generation efficiency, positioning it as a practical step toward interactive, memory-enabled world models for multimodal generation and embodied intelligence.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

RoLID-11K: A Dashcam Dataset for Small-Object Roadside Litter Detection

arXiv:2601.00398v1 Announce Type: new Abstract: Roadside litter poses environmental, safety and economic challenges, yet current monitoring relies on labour-intensive surveys and public reporting, providing limited spatial coverage. Existing vision datasets for litter detection focus on street-level still images, aerial scenes or aquatic environments, and do not reflect the unique characteristics of dashcam footage, where litter appears extremely small, sparse and embedded in cluttered road-verge backgrounds. We introduce RoLID-11K, the first large-scale dataset for roadside litter detection from dashcams, comprising over 11k annotated images spanning diverse UK driving conditions and exhibiting pronounced long-tail and small-object distributions. We benchmark a broad spectrum of modern detectors, from accuracy-oriented transformer architectures to real-time YOLO models, and analyse their strengths and limitations on this challenging task. Our results show that while CO-DETR and related transformers achieve the best localisation accuracy, real-time models remain constrained by coarse feature hierarchies. RoLID-11K establishes a challenging benchmark for extreme small-object detection in dynamic driving scenes and aims to support the development of scalable, low-cost systems for roadside-litter monitoring. The dataset is available at https://github.com/xq141839/RoLID-11K.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning

arXiv:2601.00791v1 Announce Type: cross Abstract: We present a training-free method for detecting valid mathematical reasoning in large language models through spectral analysis of attention patterns. By treating attention matrices as adjacency matrices of dynamic graphs over tokens, we extract four interpretable spectral diagnostics, the Fiedler value (algebraic connectivity), high-frequency energy ratio (HFER), graph signal smoothness, and spectral entropy, that exhibit statistically significant differences between valid and invalid mathematical proofs. Experiments across seven transformer models from four independent architectural families (Meta Llama, Alibaba Qwen, Microsoft Phi, and Mistral AI) demonstrate that this spectral signature produces effect sizes up to Cohen's $d = 3.30$ ($p < 10^{-116}$), enabling 85.0--95.6\% classification accuracy under rigorous evaluation, with calibrated thresholds reaching 93--95\% on the full dataset. The method requires no training data, fine-tuning, or learned classifiers: a single threshold on a spectral metric suffices for high accuracy. Through systematic label correction, we discover that the spectral method detects logical coherence rather than compiler acceptance, identifying mathematically valid proofs that formal verifiers reject due to technical failures. We further identify an architectural dependency: Mistral-7B's Sliding Window Attention shifts the discriminative signal from HFER to late-layer Smoothness ($d = 2.09$, $p_{\text{MW}} = 1.16 \times 10^{-48}$), revealing that attention mechanism design affects which spectral features capture reasoning validity. These findings establish spectral graph analysis as a principled framework for reasoning verification with immediate applications to hallucination detection and AI safety monitoring.

Fonte: arXiv cs.CL

RL • Score 95

TeleDoCTR: Domain-Specific and Contextual Troubleshooting for Telecommunications

arXiv:2601.00691v1 Announce Type: cross Abstract: Ticket troubleshooting refers to the process of analyzing and resolving problems that are reported through a ticketing system. In large organizations offering a wide range of services, this task is highly complex due to the diversity of submitted tickets and the need for specialized domain knowledge. In particular, troubleshooting in telecommunications (telecom) is a very time-consuming task as it requires experts to interpret ticket content, consult documentation, and search historical records to identify appropriate resolutions. This human-intensive approach not only delays issue resolution but also hinders overall operational efficiency. To enhance the effectiveness and efficiency of ticket troubleshooting in telecom, we propose TeleDoCTR, a novel telecom-related, domain-specific, and contextual troubleshooting system tailored for end-to-end ticket resolution in telecom. TeleDoCTR integrates both domain-specific ranking and generative models to automate key steps of the troubleshooting workflow which are: routing tickets to the appropriate expert team responsible for resolving the ticket (classification task), retrieving contextually and semantically similar historical tickets (retrieval task), and generating a detailed fault analysis report outlining the issue, root cause, and potential solutions (generation task). We evaluate TeleDoCTR on a real-world dataset from a telecom infrastructure and demonstrate that it achieves superior performance over existing state-of-the-art methods, significantly enhancing the accuracy and efficiency of the troubleshooting process.

Fonte: arXiv cs.CL

Vision • Score 95

BHaRNet: Reliability-Aware Body-Hand Modality Expertized Networks for Fine-grained Skeleton Action Recognition

arXiv:2601.00369v1 Announce Type: new Abstract: Skeleton-based human action recognition (HAR) has achieved remarkable progress with graph-based architectures. However, most existing methods remain body-centric, focusing on large-scale motions while neglecting subtle hand articulations that are crucial for fine-grained recognition. This work presents a probabilistic dual-stream framework that unifies reliability modeling and multi-modal integration, generalizing expertized learning under uncertainty across both intra-skeleton and cross-modal domains. The framework comprises three key components: (1) a calibration-free preprocessing pipeline that removes canonical-space transformations and learns directly from native coordinates; (2) a probabilistic Noisy-OR fusion that stabilizes reliability-aware dual-stream learning without requiring explicit confidence supervision; and (3) an intra- to cross-modal ensemble that couples four skeleton modalities (Joint, Bone, Joint Motion, and Bone Motion) to RGB representations, bridging structural and visual motion cues in a unified cross-modal formulation. Comprehensive evaluations across multiple benchmarks (NTU RGB+D~60/120, PKU-MMD, N-UCLA) and a newly defined hand-centric benchmark exhibit consistent improvements and robustness under noisy and heterogeneous conditions.

Fonte: arXiv cs.CV

Vision • Score 89

Joint Geometry-Appearance Human Reconstruction in a Unified Latent Space via Bridge Diffusion

arXiv:2601.00328v1 Announce Type: new Abstract: Achieving consistent and high-fidelity geometry and appearance reconstruction of 3D digital humans from a single RGB image is inherently a challenging task. Existing studies typically resort to decoupled pipelines for geometry estimation and appearance synthesis, often hindering unified reconstruction and causing inconsistencies. This paper introduces \textbf{JGA-LBD}, a novel framework that unifies the modeling of geometry and appearance into a joint latent representation and formulates the generation process as bridge diffusion. Observing that directly integrating heterogeneous input conditions (e.g., depth maps, SMPL models) leads to substantial training difficulties, we unify all conditions into the 3D Gaussian representations, which can be further compressed into a unified latent space through a shared sparse variational autoencoder (VAE). Subsequently, the specialized form of bridge diffusion enables to start with a partial observation of the target latent code and solely focuses on inferring the missing components. Finally, a dedicated decoding module extracts the complete 3D human geometric structure and renders novel views from the inferred latent representation. Experiments demonstrate that JGA-LBD outperforms current state-of-the-art approaches in terms of both geometry fidelity and appearance quality, including challenging in-the-wild scenarios. Our code will be made publicly available at https://github.com/haiantyz/JGA-LBD.

Fonte: arXiv cs.CV

Vision • Score 92

HarmoniAD: Harmonizing Local Structures and Global Semantics for Anomaly Detection

arXiv:2601.00327v1 Announce Type: new Abstract: Anomaly detection is crucial in industrial product quality inspection. Failing to detect tiny defects often leads to serious consequences. Existing methods face a structure-semantics trade-off: structure-oriented models (such as frequency-based filters) are noise-sensitive, while semantics-oriented models (such as CLIP-based encoders) often miss fine details. To address this, we propose HarmoniAD, a frequency-guided dual-branch framework. Features are first extracted by the CLIP image encoder, then transformed into the frequency domain, and finally decoupled into high- and low-frequency paths for complementary modeling of structure and semantics. The high-frequency branch is equipped with a fine-grained structural attention module (FSAM) to enhance textures and edges for detecting small anomalies, while the low-frequency branch uses a global structural context module (GSCM) to capture long-range dependencies and preserve semantic consistency. Together, these branches balance fine detail and global semantics. HarmoniAD further adopts a multi-class joint training strategy, and experiments on MVTec-AD, VisA, and BTAD show state-of-the-art performance with both sensitivity and robustness.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

A Coleira Agente: Extraindo Mapas Cognitivos Fuzzy de Feedback Causal com LLMs

Desenvolvemos um agente de modelo de linguagem grande (LLM) que extrai mapas cognitivos fuzzy de feedback causal (FCMs) a partir de texto bruto. O processo de aprendizado ou extração causal é agente tanto pela semi-autonomia do LLM quanto pela dinâmica do sistema FCM, que orienta os agentes LLM a buscar e processar texto causal.

Fonte: arXiv cs.AI

Privacy/Security/Fairness • Score 90

Deep Delta Learning

O artigo apresenta o Deep Delta Learning (DDL), uma nova arquitetura que generaliza a conexão residual padrão, modulando o atalho de identidade com uma transformação geométrica aprendível e dependente de dados. Essa transformação, chamada de Delta Operator, permite que a rede controle explicitamente o espectro de seu operador de transição, modelando dinâmicas complexas e não-monotônicas.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

StockBot 2.0: Vanilla LSTMs Outperform Transformer-based Forecasting for Stock Prices

arXiv:2601.00197v1 Announce Type: cross Abstract: Accurate forecasting of financial markets remains a long-standing challenge due to complex temporal and often latent dependencies, non-linear dynamics, and high volatility. Building on our earlier recurrent neural network framework, we present an enhanced StockBot architecture that systematically evaluates modern attention-based, convolutional, and recurrent time-series forecasting models within a unified experimental setting. While attention-based and transformer-inspired models offer increased modeling flexibility, extensive empirical evaluation reveals that a carefully constructed vanilla LSTM consistently achieves superior predictive accuracy and more stable buy/sell decision-making when trained under a common set of default hyperparameters. These results highlight the robustness and data efficiency of recurrent sequence models for financial time-series forecasting, particularly in the absence of extensive hyperparameter tuning or the availability of sufficient data when discretized to single-day intervals. Additionally, these results underscore the importance of architectural inductive bias in data-limited market prediction tasks.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Raciocínio em Ação: Recuperação de Conhecimento Orientada por MCTS para Modelos de Linguagem Grandes

Modelos de linguagem grandes (LLMs) geralmente melhoram seu desempenho por meio da recuperação de informações semanticamente semelhantes ou da melhoria de suas capacidades de raciocínio. Este artigo apresenta um método de recuperação de conhecimento consciente do raciocínio que enriquece os LLMs com informações alinhadas à estrutura lógica das conversas, superando a similaridade semântica superficial.

Fonte: arXiv cs.AI

NLP/LLMs • Score 92

Probabilistic Guarantees for Reducing Contextual Hallucinations in LLMs

arXiv:2601.00641v1 Announce Type: new Abstract: Large language models (LLMs) frequently produce contextual hallucinations, where generated content contradicts or ignores information explicitly stated in the prompt. Such errors are particularly problematic in deterministic automation workflows, where inputs are fixed and correctness is unambiguous. We introduce a simple and model-agnostic framework that provides explicit probabilistic guarantees for reducing hallucinations in this setting. We formalize the notion of a specific task, defined by a fixed input and a deterministic correctness criterion, and show that issuing the same prompt in independent context windows yields an exponential reduction in the probability that all model outputs are incorrect. To identify a correct answer among repeated runs, we incorporate an LLM-as-a-judge and prove that the probability that the judged pipeline fails decays at a rate determined by the judge's true- and false-positive probabilities. When the judge is imperfect, we strengthen it through majority vote over independent judge calls, obtaining ensemble-level error rates that decrease exponentially in the number of votes. This yields an explicit bound on the probability that the pipeline selects a hallucinated answer. Experiments on controlled extraction tasks with synthetic noisy judges match these predictions exactly: pipeline failure decreases exponentially with the number of repetitions, and hallucination-selection decreases exponentially with the number of judges in the ensemble. Together, these results provide a lightweight, modular, and theoretically grounded method for driving hallucination probabilities arbitrarily low in fixed-input LLM workflows-without modifying model weights, decoding strategies, or prompt engineering.

Fonte: arXiv cs.CL

Vision • Score 95

All-in-One Video Restoration under Smoothly Evolving Unknown Weather Degradations

arXiv:2601.00533v1 Announce Type: new Abstract: All-in-one image restoration aims to recover clean images from diverse unknown degradations using a single model. But extending this task to videos faces unique challenges. Existing approaches primarily focus on frame-wise degradation variation, overlooking the temporal continuity that naturally exists in real-world degradation processes. In practice, degradation types and intensities evolve smoothly over time, and multiple degradations may coexist or transition gradually. In this paper, we introduce the Smoothly Evolving Unknown Degradations (SEUD) scenario, where both the active degradation set and degradation intensity change continuously over time. To support this scenario, we design a flexible synthesis pipeline that generates temporally coherent videos with single, compound, and evolving degradations. To address the challenges in the SEUD scenario, we propose an all-in-One Recurrent Conditional and Adaptive prompting Network (ORCANet). First, a Coarse Intensity Estimation Dehazing (CIED) module estimates haze intensity using physical priors and provides coarse dehazed features as initialization. Second, a Flow Prompt Generation (FPG) module extracts degradation features. FPG generates both static prompts that capture segment-level degradation types and dynamic prompts that adapt to frame-level intensity variations. Furthermore, a label-aware supervision mechanism improves the discriminability of static prompt representations under different degradations. Extensive experiments show that ORCANet achieves superior restoration quality, temporal consistency, and robustness over image and video-based baselines. Code is available at https://github.com/Friskknight/ORCANet-SEUD.

Fonte: arXiv cs.CV

Theory/Optimization • Score 92

Inferência de Variáveis Instrumentais Não Paramétricas com Muitos Instrumentos Fracos

Estudamos a inferência em funcionais lineares no problema de variável instrumental não paramétrica (NPIV) com um instrumento de valor discreto sob um regime assintótico de muitos instrumentos fracos, onde o número de valores do instrumento cresce com o tamanho da amostra. Um exemplo motivador chave é a estimativa de efeitos causais de longo prazo em um novo experimento com apenas resultados de curto prazo.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

Adapting Natural Language Processing Models Across Jurisdictions: A pilot Study in Canadian Cancer Registries

arXiv:2601.00787v1 Announce Type: new Abstract: Population-based cancer registries depend on pathology reports as their primary diagnostic source, yet manual abstraction is resource-intensive and contributes to delays in cancer data. While transformer-based NLP systems have improved registry workflows, their ability to generalize across jurisdictions with differing reporting conventions remains poorly understood. We present the first cross-provincial evaluation of adapting BCCRTron, a domain-adapted transformer model developed at the British Columbia Cancer Registry, alongside GatorTron, a biomedical transformer model, for cancer surveillance in Canada. Our training dataset consisted of approximately 104,000 and 22,000 de-identified pathology reports from the Newfoundland & Labrador Cancer Registry (NLCR) for Tier 1 (cancer vs. non-cancer) and Tier 2 (reportable vs. non-reportable) tasks, respectively. Both models were fine-tuned using complementary synoptic and diagnosis focused report section input pipelines. Across NLCR test sets, the adapted models maintained high performance, demonstrating transformers pretrained in one jurisdiction can be localized to another with modest fine-tuning. To improve sensitivity, we combined the two models using a conservative OR-ensemble achieving a Tier 1 recall of 0.99 and reduced missed cancers to 24, compared with 48 and 54 for the standalone models. For Tier 2, the ensemble achieved 0.99 recall and reduced missed reportable cancers to 33, compared with 54 and 46 for the individual models. These findings demonstrate that an ensemble combining complementary text representations substantially reduce missed cancers and improve error coverage in cancer-registry NLP. We implement a privacy-preserving workflow in which only model weights are shared between provinces, supporting interoperable NLP infrastructure and a future pan-Canadian foundation model for cancer pathology and registry workflows.

Fonte: arXiv cs.CL

NLP/LLMs • Score 89

Sigmoid Head for Quality Estimation under Language Ambiguity

arXiv:2601.00680v1 Announce Type: new Abstract: Language model (LM) probability is not a reliable quality estimator, as natural language is ambiguous. When multiple output options are valid, the model's probability distribution is spread across them, which can misleadingly indicate low output quality. This issue is caused by two reasons: (1) LMs' final output activation is softmax, which does not allow multiple correct options to receive high probabilities simultaneuously and (2) LMs' training data is single, one-hot encoded references, indicating that there is only one correct option at each output step. We propose training a module for Quality Estimation on top of pre-trained LMs to address these limitations. The module, called Sigmoid Head, is an extra unembedding head with sigmoid activation to tackle the first limitation. To tackle the second limitation, during the negative sampling process to train the Sigmoid Head, we use a heuristic to avoid selecting potentially alternative correct tokens. Our Sigmoid Head is computationally efficient during training and inference. The probability from Sigmoid Head is notably better quality signal compared to the original softmax head. As the Sigmoid Head does not rely on human-annotated quality data, it is more robust to out-of-domain settings compared to supervised QE.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns

arXiv:2601.00588v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed in cost-sensitive and on-device scenarios, and safety guardrails have advanced mainly in English. However, real-world Chinese malicious queries typically conceal intent via homophones, pinyin, symbol-based splitting, and other Chinese-specific patterns. These Chinese-specific adversarial patterns create the safety evaluation gap that is not well captured by existing benchmarks focused on English. This gap is particularly concerning for lightweight models, which may be more vulnerable to such specific adversarial perturbations. To bridge this gap, we introduce the Chinese-Specific Safety Benchmark (CSSBench) that emphasizes these adversarial patterns and evaluates the safety of lightweight LLMs in Chinese. Our benchmark covers six domains that are common in real Chinese scenarios, including illegal activities and compliance, privacy leakage, health and medical misinformation, fraud and hate, adult content, and public and political safety, and organizes queries into multiple task types. We evaluate a set of popular lightweight LLMs and measure over-refusal behavior to assess safety-induced performance degradation. Our results show that the Chinese-specific adversarial pattern is a critical challenge for lightweight LLMs. This benchmark offers a comprehensive evaluation of LLM safety in Chinese, assisting robust deployments in practice.

Fonte: arXiv cs.CL

NLP/LLMs • Score 92

InfoSynth: Information-Guided Benchmark Synthesis for LLMs

arXiv:2601.00575v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation. However, efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher novelty and diversity compared to their seed datasets. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self-verifying pipeline for constructing high-quality, novel and diverse benchmarks for LLMs. Project Page: https://ishirgarg.github.io/infosynth_web/

Fonte: arXiv cs.CL

Applications • Score 90

A Ilusão de Insight em Modelos de Raciocínio

Modelos de raciocínio, como o DeepSeek-R1-Zero, podem ter momentos de 'Aha!', mas a relação entre mudanças intrínsecas na estratégia de raciocínio e a melhoria de desempenho ainda não está clara. Este estudo analisa mudanças durante o raciocínio em mais de 1 milhão de traços, revelando que essas mudanças são raras e não melhoram a precisão, embora seu efeito varie com a incerteza do modelo.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak

arXiv:2601.00213v1 Announce Type: cross Abstract: The widespread deployment of large language models (LLMs) has raised growing concerns about their misuse risks and associated safety issues. While prior studies have examined the safety of LLMs in general usage, code generation, and agent-based applications, their vulnerabilities in automated algorithm design remain underexplored. To fill this gap, this study investigates this overlooked safety vulnerability, with a particular focus on intelligent optimization algorithm design, given its prevalent use in complex decision-making scenarios. We introduce MalOptBench, a benchmark consisting of 60 malicious optimization algorithm requests, and propose MOBjailbreak, a jailbreak method tailored for this scenario. Through extensive evaluation of 13 mainstream LLMs including the latest GPT-5 and DeepSeek-V3.1, we reveal that most models remain highly susceptible to such attacks, with an average attack success rate of 83.59% and an average harmfulness score of 4.28 out of 5 on original harmful prompts, and near-complete failure under MOBjailbreak. Furthermore, we assess state-of-the-art plug-and-play defenses that can be applied to closed-source models, and find that they are only marginally effective against MOBjailbreak and prone to exaggerated safety behaviors. These findings highlight the urgent need for stronger alignment techniques to safeguard LLMs against misuse in algorithm design.

Fonte: arXiv cs.CL

NLP/LLMs • Score 92

Fast-weight Product Key Memory

arXiv:2601.00671v1 Announce Type: new Abstract: Sequence modeling layers in modern language models typically face a trade-off between storage capacity and computational efficiency. While Softmax attention offers unbounded storage at prohibitive quadratic costs, linear variants provide efficiency but suffer from limited, fixed-size storage. We propose Fast-weight Product Key Memory (FwPKM), a novel architecture that resolves this tension by transforming the sparse Product Key Memory (PKM) from a static module into a dynamic, "fast-weight" episodic memory. Unlike PKM, FwPKM updates its parameters dynamically at both training and inference time via local chunk-level gradient descent, allowing the model to rapidly memorize and retrieve new key-value pairs from input sequences. Experiments reveal that FwPKM functions as an effective episodic memory that complements the semantic memory of standard modules, yielding significant perplexity reductions on long-context datasets. Notably, in Needle in a Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences.

Fonte: arXiv cs.CL

Vision • Score 95

ReMA: A Training-Free Plug-and-Play Mixing Augmentation for Video Behavior Recognition

arXiv:2601.00311v1 Announce Type: new Abstract: Video behavior recognition demands stable and discriminative representations under complex spatiotemporal variations. However, prevailing data augmentation strategies for videos remain largely perturbation-driven, often introducing uncontrolled variations that amplify non-discriminative factors, which finally weaken intra-class distributional structure and representation drift with inconsistent gains across temporal scales. To address these problems, we propose Representation-aware Mixing Augmentation (ReMA), a plug-and-play augmentation strategy that formulates mixing as a controlled replacement process to expand representations while preserving class-conditional stability. ReMA integrates two complementary mechanisms. Firstly, the Representation Alignment Mechanism (RAM) performs structured intra-class mixing under distributional alignment constraints, suppressing irrelevant intra-class drift while enhancing statistical reliability. Then, the Dynamic Selection Mechanism (DSM) generates motion-aware spatiotemporal masks to localize perturbations, guiding them away from discrimination-sensitive regions and promoting temporal coherence. By jointly controlling how and where mixing is applied, ReMA improves representation robustness without additional supervision or trainable parameters. Extensive experiments on diverse video behavior benchmarks demonstrate that ReMA consistently enhances generalization and robustness across different spatiotemporal granularities.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

A Language-Agnostic Hierarchical LoRA-MoE Architecture for CTC-based Multilingual ASR

arXiv:2601.00557v1 Announce Type: new Abstract: Large-scale multilingual ASR (mASR) models such as Whisper achieve strong performance but incur high computational and latency costs, limiting their deployment on resource-constrained edge devices. In this study, we propose a lightweight and language-agnostic multilingual ASR system based on a CTC architecture with domain adaptation. Specifically, we introduce a Language-agnostic Hierarchical LoRA-MoE (HLoRA) framework integrated into an mHuBERT-CTC model, enabling end-to-end decoding via LID-posterior-driven LoRA routing. The hierarchical design consists of a multilingual shared LoRA for learning language-invariant acoustic representations and language-specific LoRA experts for modeling language-dependent characteristics. The proposed routing mechanism removes the need for prior language identity information or explicit language labels during inference, achieving true language-agnostic decoding. Experiments on MSR-86K and the MLC-SLM 2025 Challenge datasets demonstrate that HLoRA achieves competitive performance with state-of-the-art two-stage inference methods using only single-pass decoding, significantly improving decoding efficiency for low-resource mASR applications.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations

arXiv:2601.00454v1 Announce Type: new Abstract: Guardrail models are essential for ensuring the safety of Large Language Model (LLM) deployments, but processing full multi-turn conversation histories incurs significant computational cost. We propose Defensive M2S, a training paradigm that fine-tunes guardrail models on Multi-turn to Single-turn (M2S) compressed conversations rather than complete dialogue histories. We provide a formal complexity analysis showing that M2S reduces training cost from $O(n^2)$ to $O(n)$ for $n$-turn conversations. Empirically, on our training dataset (779 samples, avg. 10.6 turns), M2S requires only 169K tokens compared to 15.7M tokens for the multi-turn baseline -- a 93$\times$ reduction. We evaluate Defensive M2S across three guardrail model families (LlamaGuard, Nemotron, Qwen3Guard) and three compression templates (hyphenize, numberize, pythonize) on SafeDialBench, a comprehensive multi-turn jailbreak benchmark. Our best configuration, Qwen3Guard with hyphenize compression, achieves 93.8% attack detection recall while reducing inference tokens by 94.6% (from 3,231 to 173 tokens per conversation). This represents a 38.9 percentage point improvement over the baseline while dramatically reducing both training and inference costs. Our findings demonstrate that M2S compression can serve as an effective efficiency technique for guardrail deployment, enabling scalable safety screening of long multi-turn conversations.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach

arXiv:2601.00388v1 Announce Type: new Abstract: Recent advances in vision-language models have opened up new possibilities for reasoning-driven image geolocalization. However, existing approaches often rely on synthetic reasoning annotations or external image retrieval, which can limit interpretability and generalizability. In this paper, we present Geo-R, a retrieval-free framework that uncovers structured reasoning paths from existing ground-truth coordinates and optimizes geolocation accuracy via reinforcement learning. We propose the Chain of Region, a rule-based hierarchical reasoning paradigm that generates precise, interpretable supervision by mapping GPS coordinates to geographic entities (e.g., country, province, city) without relying on model-generated or synthetic labels. Building on this, we introduce a lightweight reinforcement learning strategy with coordinate-aligned rewards based on Haversine distance, enabling the model to refine predictions through spatially meaningful feedback. Our approach bridges structured geographic reasoning with direct spatial supervision, yielding improved localization accuracy, stronger generalization, and more transparent inference. Experimental results across multiple benchmarks confirm the effectiveness of Geo-R, establishing a new retrieval-free paradigm for scalable and interpretable image geolocalization. To facilitate further research and ensure reproducibility, both the model and code will be made publicly available.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Rumo ao Diagnóstico Diferencial Automatizado de Doenças de Pele Usando Deep Learning e Estratégias Conscientes de Imbalance

À medida que as condições dermatológicas se tornam cada vez mais comuns e a disponibilidade de dermatologistas permanece limitada, há uma necessidade crescente de ferramentas inteligentes para apoiar pacientes e clínicos no diagnóstico oportuno e preciso de doenças de pele. Neste projeto, desenvolvemos um modelo baseado em deep learning para a classificação e diagnóstico de condições cutâneas.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Exploring the Performance of Large Language Models on Subjective Span Identification Tasks

arXiv:2601.00736v1 Announce Type: new Abstract: Identifying relevant text spans is important for several downstream tasks in NLP, as it contributes to model explainability. While most span identification approaches rely on relatively smaller pre-trained language models like BERT, a few recent approaches have leveraged the latest generation of Large Language Models (LLMs) for the task. Current work has focused on explicit span identification like Named Entity Recognition (NER), while more subjective span identification with LLMs in tasks like Aspect-based Sentiment Analysis (ABSA) has been underexplored. In this paper, we fill this important gap by presenting an evaluation of the performance of various LLMs on text span identification in three popular tasks, namely sentiment analysis, offensive language identification, and claim verification. We explore several LLM strategies like instruction tuning, in-context learning, and chain of thought. Our results indicate underlying relationships within text aid LLMs in identifying precise text spans.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Memory Bank Compression for Continual Adaptation of Large Language Models

arXiv:2601.00756v1 Announce Type: cross Abstract: Large Language Models (LLMs) have become a mainstay for many everyday applications. However, as data evolve their knowledge quickly becomes outdated. Continual learning aims to update LLMs with new information without erasing previously acquired knowledge. Although methods such as full fine-tuning can incorporate new data, they are computationally expensive and prone to catastrophic forgetting, where prior knowledge is overwritten. Memory-augmented approaches address this by equipping LLMs with a memory bank, that is an external memory module which stores information for future use. However, these methods face a critical limitation, in particular, the memory bank constantly grows in the real-world scenario when large-scale data streams arrive. In this paper, we propose MBC, a model that compresses the memory bank through a codebook optimization strategy during online adaptation learning. To ensure stable learning, we also introduce an online resetting mechanism that prevents codebook collapse. In addition, we employ Key-Value Low-Rank Adaptation in the attention layers of the LLM, enabling efficient utilization of the compressed memory representations. Experiments with benchmark question-answering datasets demonstrate that MBC reduces the memory bank size to 0.3% when compared against the most competitive baseline, while maintaining high retention accuracy during online adaptation learning. Our code is publicly available at https://github.com/Thomkat/MBC.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Beyond IVR: Benchmarking Customer Support LLM Agents for Business-Adherence

arXiv:2601.00596v1 Announce Type: new Abstract: Traditional customer support systems, such as Interactive Voice Response (IVR), rely on rigid scripts and lack the flexibility required for handling complex, policy-driven tasks. While large language model (LLM) agents offer a promising alternative, evaluating their ability to act in accordance with business rules and real-world support workflows remains an open challenge. Existing benchmarks primarily focus on tool usage or task completion, overlooking an agent's capacity to adhere to multi-step policies, navigate task dependencies, and remain robust to unpredictable user or environment behavior. In this work, we introduce JourneyBench, a benchmark designed to assess policy-aware agents in customer support. JourneyBench leverages graph representations to generate diverse, realistic support scenarios and proposes the User Journey Coverage Score, a novel metric to measure policy adherence. We evaluate multiple state-of-the-art LLMs using two agent designs: a Static-Prompt Agent (SPA) and a Dynamic-Prompt Agent (DPA) that explicitly models policy control. Across 703 conversations in three domains, we show that DPA significantly boosts policy adherence, even allowing smaller models like GPT-4o-mini to outperform more capable ones like GPT-4o. Our findings demonstrate the importance of structured orchestration and establish JourneyBench as a critical resource to advance AI-driven customer support beyond IVR-era limitations.

Fonte: arXiv cs.CL

NLP/LLMs • Score 93

Modelos de Linguagem de Grande Escala Ainda Podem Explicar a Si Mesmos? Investigando o Impacto da Quantização nas Autoexplicações

A quantização é amplamente utilizada para acelerar a inferência e otimizar a implementação de modelos de linguagem de grande escala (LLMs), mas seus efeitos nas autoexplicações (SEs) permanecem inexplorados. Este estudo investiga a degradação da qualidade e fidelidade das SEs devido à quantização, analisando explicações em linguagem natural (NLEs) e exemplos contrafactuais gerados por LLMs quantizados com três técnicas comuns.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Comparative Efficiency Analysis of Lightweight Transformer Models: A Multi-Domain Empirical Benchmark for Enterprise NLP Deployment

arXiv:2601.00444v1 Announce Type: new Abstract: In the rapidly evolving landscape of enterprise natural language processing (NLP), the demand for efficient, lightweight models capable of handling multi-domain text automation tasks has intensified. This study conducts a comparative analysis of three prominent lightweight Transformer models - DistilBERT, MiniLM, and ALBERT - across three distinct domains: customer sentiment classification, news topic classification, and toxicity and hate speech detection. Utilizing datasets from IMDB, AG News, and the Measuring Hate Speech corpus, we evaluated performance using accuracy-based metrics including accuracy, precision, recall, and F1-score, as well as efficiency metrics such as model size, inference time, throughput, and memory usage. Key findings reveal that no single model dominates all performance dimensions. ALBERT achieves the highest task-specific accuracy in multiple domains, MiniLM excels in inference speed and throughput, and DistilBERT demonstrates the most consistent accuracy across tasks while maintaining competitive efficiency. All results reflect controlled fine-tuning under fixed enterprise-oriented constraints rather than exhaustive hyperparameter optimization. These results highlight trade-offs between accuracy and efficiency, recommending MiniLM for latency-sensitive enterprise applications, DistilBERT for balanced performance, and ALBERT for resource-constrained environments.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Learning Speech Representations with Variational Predictive Coding

arXiv:2601.00100v1 Announce Type: cross Abstract: Despite being the best known objective for learning speech representations, the HuBERT objective has not been further developed and improved. We argue that it is the lack of an underlying principle that stalls the development, and, in this paper, we show that predictive coding under a variational view is the principle behind the HuBERT objective. Due to its generality, our formulation provides opportunities to improve parameterization and optimization, and we show two simple modifications that bring immediate improvements to the HuBERT objective. In addition, the predictive coding formulation has tight connections to various other objectives, such as APC, CPC, wav2vec, and BEST-RQ. Empirically, the improvement in pre-training brings significant improvements to four downstream tasks: phone classification, f0 tracking, speaker recognition, and automatic speech recognition, highlighting the importance of the predictive coding interpretation.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Rule-Based Approaches to Atomic Sentence Extraction

arXiv:2601.00506v1 Announce Type: new Abstract: Natural language often combines multiple ideas into complex sentences. Atomic sentence extraction, the task of decomposing complex sentences into simpler sentences that each express a single idea, improves performance in information retrieval, question answering, and automated reasoning systems. Previous work has formalized the "split-and-rephrase" task and established evaluation metrics, and machine learning approaches using large language models have improved extraction accuracy. However, these methods lack interpretability and provide limited insight into which linguistic structures cause extraction failures. Although some studies have explored dependency-based extraction of subject-verb-object triples and clauses, no principled analysis has examined which specific clause structures and dependencies lead to extraction difficulties. This study addresses this gap by analyzing how complex sentence structures, including relative clauses, adverbial clauses, coordination patterns, and passive constructions, affect the performance of rule-based atomic sentence extraction. Using the WikiSplit dataset, we implemented dependency-based extraction rules in spaCy, generated 100 gold=standard atomic sentence sets, and evaluated performance using ROUGE and BERTScore. The system achieved ROUGE-1 F1 = 0.6714, ROUGE-2 F1 = 0.478, ROUGE-L F1 = 0.650, and BERTScore F1 = 0.5898, indicating moderate-to-high lexical, structural, and semantic alignment. Challenging structures included relative clauses, appositions, coordinated predicates, adverbial clauses, and passive constructions. Overall, rule-based extraction is reasonably accurate but sensitive to syntactic complexity.

Fonte: arXiv cs.CL

MLOps/Systems • Score 96

Benchmarking de Métodos de Pré-processamento e Integração em Genômica de Células Únicas

A análise de dados de células únicas tem o potencial de revolucionar a medicina personalizada ao caracterizar mudanças moleculares associadas a doenças em nível celular. Este estudo examina um pipeline geral para análise de dados de células únicas, avaliando diferentes métodos de normalização, redução de dimensionalidade e integração, utilizando seis conjuntos de dados variados.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

FlashInfer-Bench: Construindo o Ciclo Virtuoso para Sistemas LLM Impulsionados por IA

Avanços recentes mostram que modelos de linguagem de grande escala (LLMs) podem atuar como agentes autônomos capazes de gerar kernels de GPU, mas integrar esses kernels gerados por IA em sistemas de inferência do mundo real continua sendo um desafio. O FlashInfer-Bench aborda essa lacuna ao estabelecer um framework padronizado e de ciclo fechado que conecta geração de kernels, benchmarking e implantação.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

ECR: Manifold-Guided Semantic Cues for Compact Language Models

arXiv:2601.00543v1 Announce Type: new Abstract: Compact models often lose the structure of their embedding space. The issue shows up when the capacity is tight or the data spans several languages. Such collapse makes it difficult for downstream tasks to build on the resulting representation. Existing compression methods focus on aligning model outputs at a superficial level but fail to preserve the underlying manifold structure. This mismatch often leads to semantic drift in the compact model, causing both task behavior and linguistic properties to deviate from the reference model. To address those issues, we provide a new framework called Embedding Consistency Regulation (ECR). This framework first derives a set of semantic anchors from teacher embeddings (computed once offline). Then, the compact model learns to maintain consistent geometry around these anchors, without relying on matching logits or internal features. ECR adds only a small projection step at inference, without altering the decoding architecture or its runtime behavior. In experiments on a 100K multilingual corpus, ECR consistently stabilizes training and preserves semantic structure across tasks and languages. It also produces a more compact and task-aligned representation space, enabling low-capacity models to learn cleaner manifolds than conventional baselines. ECR works without teacher outputs and is compatible with, but independent of, distillation. Taken together, our results show that ECR helps compact models better follow task requirements and makes them easier to deploy under strict efficiency or privacy limits.

Fonte: arXiv cs.CL

Vision • Score 93

Aprendendo a Ser Reproduzível: Design de Função de Perda Personalizada para Redes Neurais Robústas

Para melhorar a reproducibilidade e a confiabilidade de modelos de deep learning, abordamos uma lacuna crítica nas metodologias de treinamento atuais: a falta de mecanismos que garantam desempenho consistente e robusto em diferentes execuções. Nossa análise empírica revela que, mesmo sob condições controladas, a precisão do modelo pode apresentar variabilidade significativa.

Fonte: arXiv cs.LG

NLP/LLMs • Score 88

BERT-JEPA: Reorganizing CLS Embeddings for Language-Invariant Semantics

arXiv:2601.00366v1 Announce Type: new Abstract: Joint Embedding Predictive Architectures (JEPA) are a novel self supervised training technique that have shown recent promise across domains. We introduce BERT-JEPA (BEPA), a training paradigm that adds a JEPA training objective to BERT-style models, working to combat a collapsed [CLS] embedding space and turning it into a language-agnostic space. This new structure leads to increased performance across multilingual benchmarks.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Além de APIs Perfeitas: Uma Avaliação Abrangente de Agentes LLM Sob a Complexidade Real de APIs

Apresentamos o WildAGTEval, um benchmark projetado para avaliar as capacidades de chamada de função de agentes de modelos de linguagem grande (LLM) sob a complexidade realista de APIs. Ao contrário de trabalhos anteriores que assumem um sistema de API idealizado, WildAGTEval considera a especificação e a execução de APIs, oferecendo cenários de complexidade variados para avaliar o desempenho dos LLMs.

Fonte: arXiv cs.AI

RL • Score 95

Retrieval--Reasoning Processes for Multi-hop Question Answering: A Four-Axis Design Framework and Empirical Trends

arXiv:2601.00536v1 Announce Type: new Abstract: Multi-hop question answering (QA) requires systems to iteratively retrieve evidence and reason across multiple hops. While recent RAG and agentic methods report strong results, the underlying retrieval--reasoning \emph{process} is often left implicit, making procedural choices hard to compare across model families. This survey takes the execution procedure as the unit of analysis and introduces a four-axis framework covering (A) overall execution plan, (B) index structure, (C) next-step control (strategies and triggers), and (D) stop/continue criteria. Using this schema, we map representative multi-hop QA systems and synthesize reported ablations and tendencies on standard benchmarks (e.g., HotpotQA, 2WikiMultiHopQA, MuSiQue), highlighting recurring trade-offs among effectiveness, efficiency, and evidence faithfulness. We conclude with open challenges for retrieval--reasoning agents, including structure-aware planning, transferable control policies, and robust stopping under distribution shift.

Fonte: arXiv cs.CL

MLOps/Systems • Score 92

Noise-Aware Named Entity Recognition for Historical VET Documents

arXiv:2601.00488v1 Announce Type: new Abstract: This paper addresses Named Entity Recognition (NER) in the domain of Vocational Education and Training (VET), focusing on historical, digitized documents that suffer from OCR-induced noise. We propose a robust NER approach leveraging Noise-Aware Training (NAT) with synthetically injected OCR errors, transfer learning, and multi-stage fine-tuning. Three complementary strategies, training on noisy, clean, and artificial data, are systematically compared. Our method is one of the first to recognize multiple entity types in VET documents. It is applied to German documents but transferable to arbitrary languages. Experimental results demonstrate that domain-specific and noise-aware fine-tuning substantially increases robustness and accuracy under noisy conditions. We provide publicly available code for reproducible noise-aware NER in domain-specific contexts.

Fonte: arXiv cs.CL

Vision • Score 95

DepFlow: Disentangled Speech Generation to Mitigate Semantic Bias in Depression Detection

arXiv:2601.00303v1 Announce Type: new Abstract: Speech is a scalable and non-invasive biomarker for early mental health screening. However, widely used depression datasets like DAIC-WOZ exhibit strong coupling between linguistic sentiment and diagnostic labels, encouraging models to learn semantic shortcuts. As a result, model robustness may be compromised in real-world scenarios, such as Camouflaged Depression, where individuals maintain socially positive or neutral language despite underlying depressive states. To mitigate this semantic bias, we propose DepFlow, a three-stage depression-conditioned text-to-speech framework. First, a Depression Acoustic Encoder learns speaker- and content-invariant depression embeddings through adversarial training, achieving effective disentanglement while preserving depression discriminability (ROC-AUC: 0.693). Second, a flow-matching TTS model with FiLM modulation injects these embeddings into synthesis, enabling control over depressive severity while preserving content and speaker identity. Third, a prototype-based severity mapping mechanism provides smooth and interpretable manipulation across the depression continuum. Using DepFlow, we construct a Camouflage Depression-oriented Augmentation (CDoA) dataset that pairs depressed acoustic patterns with positive/neutral content from a sentiment-stratified text bank, creating acoustic-semantic mismatches underrepresented in natural data. Evaluated across three depression detection architectures, CDoA improves macro-F1 by 9%, 12%, and 5%, respectively, consistently outperforming conventional augmentation strategies in depression Detection. Beyond enhancing robustness, DepFlow provides a controllable synthesis platform for conversational systems and simulation-based evaluation, where real clinical data remains limited by ethical and coverage constraints.

Fonte: arXiv cs.CL

RL • Score 95

Clustering por Denoising: Difusão latente plug-and-play para dados de célula única

O sequenciamento de RNA de célula única (scRNA-seq) permite o estudo da heterogeneidade celular. No entanto, a precisão do clustering e as análises subsequentes baseadas em rótulos celulares ainda são desafiadoras devido ao ruído de medição e à variabilidade biológica. Apresentamos um framework de difusão plug-and-play que separa o espaço de observação e o espaço de denoising.

Fonte: arXiv stat.ML

Evaluation/Benchmarks • Score 93

Correspondência de Fluxo Latente para Síntese de Voz Cantante Expressiva

A síntese de voz cantada baseada em autoencoders variacionais condicionais (cVAE) oferece inferência eficiente e alta qualidade de áudio ao aprender um espaço latente condicionado por pontuação e um espaço latente posterior condicionado por gravações. No entanto, a correspondência imperfeita entre as distribuições pode degradar a expressividade fina, como vibrato e micro-prosódia. Propomos o FM-Singer, que introduz a correspondência de fluxo condicional (CFM) no espaço latente.

Fonte: arXiv cs.AI

Vision • Score 95

Simulação como Supervisão: Pré-treinamento Mecânico para Descoberta Científica

A modelagem científica enfrenta um trade-off entre a interpretabilidade da teoria mecanicista e o poder preditivo do machine learning. Apresentamos as Simulation-Grounded Neural Networks (SGNNs), um framework que incorpora conhecimento de domínio nos dados de treinamento, permitindo que o modelo aprenda padrões amplos de possibilidade física e seja mais robusto a erros de especificação do modelo.

Fonte: arXiv stat.ML

NLP/LLMs • Score 96

Grande Estudo de Caso Empírico: Go-Explore adaptado para Testes de Red Team de IA

Agentes LLM de produção com capacidades de uso de ferramentas requerem testes de segurança, apesar de seu treinamento em segurança. Adaptamos o Go-Explore para avaliar o GPT-4o-mini em 28 execuções experimentais, abordando seis questões de pesquisa. Nossos resultados mostram que a variação de sementes aleatórias domina os parâmetros algorítmicos, resultando em uma variação de 8x nos resultados.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Limite de Largura Infinita de uma Única Camada de Atenção: Análise via Programas Tensor

No presente artigo, identificamos rigorosamente a distribuição do limite de largura infinita de variáveis dentro de uma única camada de atenção, utilizando o framework Tensor Programs. Derivamos a forma exata dessa lei limite, demonstrando que ela se desvia fundamentalmente da Gaussianidade. Nossos experimentos numéricos validam as previsões teóricas, confirmando a eficácia da teoria em largura finita e a descrição precisa de atenções com cabeçotes finitos.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

Correção Residual Segura Inspirada em Causalidade para Séries Temporais Multivariadas

Embora os preditores multivariados modernos, como Transformers e GNNs, apresentem forte desempenho em benchmarks, eles frequentemente sofrem de erros sistemáticos em variáveis ou horizontes específicos e, criticamente, carecem de garantias contra degradação de desempenho na implementação. Para abordar essa lacuna de segurança, propomos o CRC (Correção Residual Segura Inspirada em Causalidade), um framework projetado para garantir não degradação.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

PolarGrad: Uma Classe de Otimizadores de Gradiente de Matriz a Partir de uma Perspectiva Unificadora de Pré-condicionamento

O aumento contínuo da escala dos modelos de deep learning e dos dados de treinamento destaca a importância crítica de métodos de otimização eficientes. Neste artigo, introduzimos um framework unificador para analisar métodos de pré-condicionamento 'conscientes de matriz', levando a uma nova classe de métodos de otimização que demonstram convergência mais rápida.

Fonte: arXiv stat.ML

Vision • Score 93

Redes Neurais Espinhadas Personalizadas com Sinapses Ferroelectricas para Processamento de Sinais EEG

As interfaces cérebro-computador (BCIs) baseadas em eletroencefalografia (EEG) são fortemente afetadas por sinais neurais não estacionários, limitando a generalização de modelos independentes de sujeito. Este trabalho demonstra que redes neurais espinhadas (SNNs) podem ser implementadas em dispositivos sinápticos memristivos ferroelectricos para decodificação adaptativa de imagens motoras baseadas em EEG, mesmo sob restrições de dispositivo.

Fonte: arXiv cs.AI

RL • Score 95

Projetando uma Rede de Sensores Ótima Através da Minimização da Perda de Informação

O design experimental ótimo é um tópico clássico em estatística, com muitos problemas e soluções bem estudados. Este trabalho investiga o posicionamento de sensores para monitorar processos espaço-temporais, considerando a dimensão temporal em nossa modelagem e otimização. Apresentamos um novo critério de posicionamento de sensores baseado em modelo, juntamente com um algoritmo de otimização altamente eficiente.

Fonte: arXiv stat.ML

Vision • Score 89

Transporte Ótimo Sliced em Streaming

O transporte ótimo sliced (SOT), ou distância Wasserstein sliced (SW), é amplamente reconhecido por sua escalabilidade estatística e computacional. Neste trabalho, aprimoramos ainda mais a escalabilidade computacional ao propor o primeiro método para estimar SW a partir de fluxos de amostra, chamado extit{streaming sliced Wasserstein} (Stream-SW).

Fonte: arXiv stat.ML

NLP/LLMs • Score 96

Um Modelo de Linguagem Grande Aprimorado por Visão e Conhecimento para Inferência Generalizável do Comportamento de Travessia de Pedestres

Os paradigmas existentes para inferir o comportamento de travessia de pedestres apresentam generalização limitada e desempenho inadequado em novos locais. Este estudo introduz o Pedestrian Crossing LLM (PedX-LLM), um framework aprimorado que transforma a inferência de travessia de padrões específicos do local para raciocínio comportamental generalizável, alcançando 82,0% de precisão balanceada e superando métodos tradicionais.

Fonte: arXiv cs.AI

Multimodal • Score 89

uGMM-NN: Rede Neural de Modelo de Mistura Gaussiana Univariada

Este artigo apresenta a Rede Neural de Modelo de Mistura Gaussiana Univariada (uGMM-NN), uma nova arquitetura neural que incorpora raciocínio probabilístico diretamente nas unidades computacionais de redes profundas. Cada nó do uGMM-NN parametriza suas ativações como uma mistura gaussiana univariada, permitindo representações mais ricas e capturando multimodalidade e incerteza.

Fonte: arXiv stat.ML

RL • Score 95

Mitigando o viés otimista na estimativa e otimização de risco entrópico

A medida de risco entrópico é amplamente utilizada em decisões críticas em economia, ciência da gestão, finanças e sistemas de controle críticos, pois captura riscos extremos associados a perdas incertas. Este trabalho apresenta um procedimento de bootstrap paramétrico que corrige o viés do estimador empírico de risco entrópico, melhorando a precisão na tomada de decisões.

Fonte: arXiv stat.ML

NLP/LLMs • Score 96

Construindo um Matemático Neuro-Simbólico a Partir de Princípios Fundamentais

Modelos de Linguagem de Grande Escala (LLMs) apresentam falhas lógicas persistentes em raciocínios complexos devido à falta de um framework axiomático interno. Propomos o Mathesis, uma arquitetura neuro-simbólica que codifica estados matemáticos como hipergrafos de ordem superior e utiliza um Kernel de Raciocínio Simbólico (SRK), um motor lógico diferenciável que mapeia restrições para uma paisagem de energia contínua.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Ajuste Fino Online de Decision Transformers com Gradientes de RL Puro

Os Decision Transformers (DTs) surgiram como um poderoso framework para tomada de decisão sequencial, formulando o aprendizado por reforço offline (RL) como um problema de modelagem de sequência. No entanto, a extensão dos DTs para configurações online com gradientes de RL puro permanece amplamente inexplorada. Identificamos o relabeling de retorno retrospectivo como um obstáculo crítico para o ajuste fino baseado em RL.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Language as Mathematical Structure: Examining Semantic Field Theory Against Language Games

arXiv:2601.00448v1 Announce Type: new Abstract: Large language models (LLMs) offer a new empirical setting in which long-standing theories of linguistic meaning can be examined. This paper contrasts two broad approaches: social constructivist accounts associated with language games, and a mathematically oriented framework we call Semantic Field Theory. Building on earlier work by the author, we formalize the notions of lexical fields (Lexfelder) and linguistic fields (Lingofelder) as interacting structures in a continuous semantic space. We then analyze how core properties of transformer architectures-such as distributed representations, attention mechanisms, and geometric regularities in embedding spaces-relate to these concepts. We argue that the success of LLMs in capturing semantic regularities supports the view that language exhibits an underlying mathematical structure, while their persistent limitations in pragmatic reasoning and context sensitivity are consistent with the importance of social grounding emphasized in philosophical accounts of language use. On this basis, we suggest that mathematical structure and language games can be understood as complementary rather than competing perspectives. The resulting framework clarifies the scope and limits of purely statistical models of language and motivates new directions for theoretically informed AI architectures.

Fonte: arXiv cs.CL

Theory/Optimization • Score 89

Pesquisa sobre Newsvendor Orientado a Dados: Análise Unificada e Espectro de Arrependimentos Alcançáveis

No problema do Newsvendor, o objetivo é adivinhar o número que será retirado de uma distribuição, com consequências assimétricas para palpites altos ou baixos. Esta pesquisa analisa variantes do Newsvendor orientado a dados, preenchendo lacunas na literatura e simplificando provas, e mostra que o espectro de arrependimentos entre $1/ ext{sqrt}{n}$ e $1/n$ pode ser alcançado.

Fonte: arXiv stat.ML

Privacy/Security/Fairness • Score 90

Personalização Federada de Grandes Modelos: Abordagens, Experimentos e Insights

Neste artigo, exploramos a personalização federada de grandes modelos e destacamos os principais desafios que isso representa dentro do framework de aprendizado federado. Revisamos várias técnicas populares de personalização de grandes modelos e discutimos como essas técnicas podem ser implementadas no contexto do aprendizado federado.

Fonte: arXiv cs.LG

Vision • Score 95

Compressed Map Priors for 3D Perception

arXiv:2601.00139v1 Announce Type: new Abstract: Human drivers rarely travel where no person has gone before. After all, thousands of drivers use busy city roads every day, and only one can claim to be the first. The same holds for autonomous computer vision systems. The vast majority of the deployment area of an autonomous vision system will have been visited before. Yet, most autonomous vehicle vision systems act as if they are encountering each location for the first time. In this work, we present Compressed Map Priors (CMP), a simple but effective framework to learn spatial priors from historic traversals. The map priors use a binarized hashmap that requires only $32\text{KB}/\text{km}^2$, a $20\times$ reduction compared to the dense storage. Compressed Map Priors easily integrate into leading 3D perception systems at little to no extra computational costs, and lead to a significant and consistent improvement in 3D object detection on the nuScenes dataset across several architectures.

Fonte: arXiv cs.CV

Vision • Score 96

Do Barro ao Código: Raciocínio Tipológico e Material nas Interpretações de IA das Torres de Pombos Iranianas

Este estudo investiga como sistemas de IA generativa interpretam a inteligência arquitetônica embutida na forma vernacular. Usando a torre de pombos iraniana como estudo de caso, a pesquisa testa três modelos de difusão: Midjourney v6, DALL-E 3 e DreamStudio baseado em Stable Diffusion XL (SDXL), em três estágios de prompt: referencial, adaptativo e especulativo.

Fonte: arXiv cs.AI

Privacy/Security/Fairness • Score 89

Classificação Ajustada por Incerteza para Precificação de Ativos com Machine Learning

O machine learning é central para a precificação empírica de ativos, mas a construção de portfólios ainda se baseia em previsões pontuais e ignora em grande parte a incerteza de estimativa específica de ativos. Propomos uma mudança simples: classificar ativos usando limites de previsão ajustados por incerteza em vez de apenas previsões pontuais.

Fonte: arXiv stat.ML

RL • Score 96

Rumo a uma Teoria Física da Inteligência

Apresentamos uma teoria física da inteligência fundamentada no processamento irreversível de informações em sistemas sujeitos a leis de conservação. Um sistema inteligente é modelado como um processo acoplado agente-ambiente, cuja evolução transforma informações em trabalho direcionado a objetivos. Introduzimos o framework Conservation-Congruent Encoding (CCE) para conectar informações ao estado físico.

Fonte: arXiv cs.AI

RL • Score 92

Efeitos da Alocação Estrutural da Diversidade de Tarefas Geométricas em Modelos Lineares de Meta-Aprendizado

O meta-aprendizado busca aproveitar informações de tarefas relacionadas para melhorar a previsão em dados não rotulados para novas tarefas com um número limitado de observações rotuladas. Embora a diversidade de tarefas seja considerada benéfica, estudos recentes mostram que ela pode degradar o desempenho de previsão em meta-aprendizado, dependendo da alocação da variabilidade geométrica das tarefas.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

The Role of Mixed-Language Documents for Multilingual Large Language Model Pretraining

arXiv:2601.00364v1 Announce Type: new Abstract: Multilingual large language models achieve impressive cross-lingual performance despite largely monolingual pretraining. While bilingual data in pretraining corpora is widely believed to enable these abilities, details of its contributions remain unclear. We investigate this question by pretraining models from scratch under controlled conditions, comparing the standard web corpus with a monolingual-only version that removes all multilingual documents. Despite constituting only 2% of the corpus, removing bilingual data causes translation performance to drop 56% in BLEU, while behaviour on cross-lingual QA and general reasoning tasks remains stable, with training curves largely overlapping the baseline. To understand this asymmetry, we categorize bilingual data into parallel (14%), code-switching (72%), and miscellaneous documents (14%) based on the semantic relevance of content in different languages. We then conduct granular ablations by reintroducing parallel or code-switching data into the monolingual-only corpus. Our experiments reveal that parallel data almost fully restores translation performance (91% of the unfiltered baseline), whereas code-switching contributes minimally. Other cross-lingual tasks remain largely unaffected by either type. These findings reveal that translation critically depends on systematic token-level alignments from parallel data, whereas cross-lingual understanding and reasoning appear to be achievable even without bilingual data.

Fonte: arXiv cs.CL

RL • Score 95

Integração de Multi-Armed Bandit, Aprendizado Ativo e Computação Distribuída para Otimização Escalável

Problemas modernos de otimização em domínios científicos e de engenharia frequentemente dependem de avaliações black-box caras. Propomos o ALMAB-DC, um framework modular e unificado para otimização black-box escalável que integra aprendizado ativo, multi-armed bandits e computação distribuída, com aceleração opcional por GPU. Resultados empíricos mostram que ALMAB-DC supera consistentemente otimizadores black-box de última geração.

Fonte: arXiv stat.ML

Vision • Score 95

Estimativa de densidade espectral de séries temporais funcionais em grandes domínios usando deep learning

Derivamos um estimador da densidade espectral de uma série temporal funcional que é a saída de uma rede neural multilayer perceptron. O estimador é motivado por dificuldades na computação de estimadores de densidade espectral existentes para séries temporais de funções definidas em grades muito grandes, como em modelos climáticos e exames médicos.

Fonte: arXiv stat.ML

NLP/LLMs • Score 96

Pergunte, Esclareça, Otimize: Colaboração Humano-LLM para um Controle de Estoque Mais Inteligente

A gestão de estoque continua sendo um desafio para muitas pequenas e médias empresas que carecem de expertise para implementar métodos avançados de otimização. Este artigo investiga se os Large Language Models (LLMs) podem ajudar a preencher essa lacuna, propondo um framework híbrido que separa rigorosamente o raciocínio semântico do cálculo matemático.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Universos Paralelos, Linguagens Paralelas: Um Estudo Abrangente sobre Geração de Exemplos Contrafactuais Multilíngues Baseados em LLM

Os contrafactuais referem-se a entradas minimamente editadas que alteram a previsão de um modelo, servindo como uma abordagem promissora para explicar o comportamento do modelo. Este estudo investiga a eficácia de LLMs na geração de contrafactuais multilíngues, revelando padrões de edição e tipos de erros comuns, além de comparar a eficácia de abordagens baseadas em tradução.

Fonte: arXiv cs.AI

Theory/Optimization • Score 92

Rede Neural de Entrada Esparsa usando Regularização Côncava em Grupo

Neste artigo, investigamos o problema da seleção de características em redes neurais, propondo um framework de redes neurais de entrada esparsa com regularização côncava em grupo. Este método visa superar a seleção de variáveis irrelevantes, utilizando uma penalização côncava adequada na norma $l_2$ dos pesos, resultando em um modelo que utiliza apenas um subconjunto reduzido das variáveis originais.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

Detecção de Confundidores Não Observados: Uma Abordagem de Regressão Kernelizada

Detectar confundidores não observados é crucial para inferência causal confiável em estudos observacionais. Métodos existentes exigem suposições de linearidade ou múltiplos ambientes heterogêneos, limitando a aplicabilidade em configurações não lineares de ambiente único. Propomos a Detecção de Confundidores por Regressão Kernelizada (KRCD), um método inovador que utiliza espaços de Hilbert de kernel reprodutivo para modelar dependências complexas.

Fonte: arXiv stat.ML

NLP/LLMs • Score 96

Framework Auto-reparador Agente Bio-inspirado para Sistemas de Computação Distribuída Resilientes

Este artigo apresenta o ReCiSt, um framework auto-reparador bio-inspirado projetado para alcançar resiliência em Sistemas de Computação Distribuída (DCCS). O ReCiSt reconstrói fases biológicas em camadas computacionais que realizam isolamento autônomo de falhas, diagnóstico causal, recuperação adaptativa e consolidação de conhecimento a partir de agentes impulsionados por Language Model (LM).

Fonte: arXiv cs.AI

MLOps/Systems • Score 96

Avatar Forcing: Geração Interativa de Avatares de Cabeça em Tempo Real para Conversação Natural

A geração de cabeças falantes cria avatares realistas a partir de retratos estáticos para comunicação virtual e criação de conteúdo. No entanto, modelos atuais não transmitem a sensação de comunicação verdadeiramente interativa, gerando respostas unidimensionais que carecem de engajamento emocional. Propomos o Avatar Forcing, um novo framework para geração de avatares que modela interações em tempo real entre usuários e avatares através de difusão forçada.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Toward Better Temporal Structures for Geopolitical Events Forecasting

arXiv:2601.00430v1 Announce Type: new Abstract: Forecasting on geopolitical temporal knowledge graphs (TKGs) through the lens of large language models (LLMs) has recently gained traction. While TKGs and their generalization, hyper-relational temporal knowledge graphs (HTKGs), offer a straightforward structure to represent simple temporal relationships, they lack the expressive power to convey complex facts efficiently. One of the critical limitations of HTKGs is a lack of support for more than two primary entities in temporal facts, which commonly occur in real-world events. To address this limitation, in this work, we study a generalization of HTKGs, Hyper-Relational Temporal Knowledge Generalized Hypergraphs (HTKGHs). We first derive a formalization for HTKGHs, demonstrating their backward compatibility while supporting two complex types of facts commonly found in geopolitical incidents. Then, utilizing this formalization, we introduce the htkgh-polecat dataset, built upon the global event database POLECAT. Finally, we benchmark and analyze popular LLMs on the relation prediction task, providing insights into their adaptability and capabilities in complex forecasting scenarios.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

From Evidence-Based Medicine to Knowledge Graph: Retrieval-Augmented Generation for Sports Rehabilitation and a Domain Benchmark

arXiv:2601.00216v1 Announce Type: new Abstract: In medicine, large language models (LLMs) increasingly rely on retrieval-augmented generation (RAG) to ground outputs in up-to-date external evidence. However, current RAG approaches focus primarily on performance improvements while overlooking evidence-based medicine (EBM) principles. This study addresses two key gaps: (1) the lack of PICO alignment between queries and retrieved evidence, and (2) the absence of evidence hierarchy considerations during reranking. We present a generalizable strategy for adapting EBM to graph-based RAG, integrating the PICO framework into knowledge graph construction and retrieval, and proposing a Bayesian-inspired reranking algorithm to calibrate ranking scores by evidence grade without introducing predefined weights. We validated this framework in sports rehabilitation, a literature-rich domain currently lacking RAG systems and benchmarks. We released a knowledge graph (357,844 nodes and 371,226 edges) and a reusable benchmark of 1,637 QA pairs. The system achieved 0.830 nugget coverage, 0.819 answer faithfulness, 0.882 semantic similarity, and 0.788 PICOT match accuracy. In a 5-point Likert evaluation, five expert clinicians rated the system 4.66-4.84 across factual accuracy, faithfulness, relevance, safety, and PICO alignment. These findings demonstrate that the proposed EBM adaptation strategy improves retrieval and answer quality and is transferable to other clinical domains. The released resources also help address the scarcity of RAG datasets in sports rehabilitation.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Agentes Potencializados por LLMs Tendem a Ter Viés Contra Humanos? Explorando a Vulnerabilidade Dependente da Crença

Agentes potencializados por LLMs podem apresentar não apenas viés demográfico, mas também viés intergrupal desencadeado por pistas mínimas de 'nós' versus 'eles'. Este estudo investiga como a crença de um agente sobre a presença de humanos pode influenciar seu comportamento, introduzindo um novo vetor de ataque chamado Belief Poisoning Attack (BPA).

Fonte: arXiv cs.AI

RL • Score 96

Métodos Semânticos Podem Aprimorar Táticas em Esportes Coletivos? Uma Metodologia para Futebol com Aplicações Mais Amplas

Este artigo explora como o raciocínio em espaço semântico, tradicionalmente utilizado em linguística computacional, pode ser estendido à tomada de decisão tática em esportes coletivos. A metodologia proposta modela configurações táticas como estruturas semânticas composicionais, representando cada jogador como um vetor multidimensional que integra atributos técnicos, físicos e psicológicos.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

HFedMoE: Aprendizado Federado Heterogêneo Consciente de Recursos com Mixture-of-Experts

Embora o aprendizado federado (FL) permita o ajuste fino de grandes modelos de linguagem (LLMs) sem comprometer a privacidade dos dados, o tamanho substancial de um LLM torna o treinamento em dispositivos impraticável para clientes com recursos limitados, como dispositivos móveis. Modelos Mixture-of-Experts (MoE) surgiram como uma solução eficiente em termos de computação, ativando apenas um subconjunto esparso de especialistas durante o treinamento do modelo.

Fonte: arXiv cs.LG

Vision • Score 95

It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models

arXiv:2601.00090v1 Announce Type: new Abstract: Contemporary text-to-image models exhibit a surprising degree of mode collapse, as can be seen when sampling several images given the same text prompt. While previous work has attempted to address this issue by steering the model using guidance mechanisms, or by generating a large pool of candidates and refining them, in this work we take a different direction and aim for diversity in generations via noise optimization. Specifically, we show that a simple noise optimization objective can mitigate mode collapse while preserving the fidelity of the base model. We also analyze the frequency characteristics of the noise and show that alternative noise initializations with different frequency profiles can improve both optimization and search. Our experiments demonstrate that noise optimization yields superior results in terms of generation quality and variety.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Do LLMs Judge Distantly Supervised Named Entity Labels Well? Constructing the JudgeWEL Dataset

arXiv:2601.00411v1 Announce Type: new Abstract: We present judgeWEL, a dataset for named entity recognition (NER) in Luxembourgish, automatically labelled and subsequently verified using large language models (LLM) in a novel pipeline. Building datasets for under-represented languages remains one of the major bottlenecks in natural language processing, where the scarcity of resources and linguistic particularities make large-scale annotation costly and potentially inconsistent. To address these challenges, we propose and evaluate a novel approach that leverages Wikipedia and Wikidata as structured sources of weak supervision. By exploiting internal links within Wikipedia articles, we infer entity types based on their corresponding Wikidata entries, thereby generating initial annotations with minimal human intervention. Because such links are not uniformly reliable, we mitigate noise by employing and comparing several LLMs to identify and retain only high-quality labelled sentences. The resulting corpus is approximately five times larger than the currently available Luxembourgish NER dataset and offers broader and more balanced coverage across entity categories, providing a substantial new resource for multilingual and low-resource NER research.

Fonte: arXiv cs.CL

RL • Score 96

ClinicalReTrial: Um Agente de IA Autoevolutivo para Otimização de Protocolos de Ensaios Clínicos

O fracasso de ensaios clínicos continua sendo um gargalo central no desenvolvimento de medicamentos, onde pequenas falhas no design do protocolo podem comprometer irreversivelmente os resultados. Este artigo propõe o ClinicalReTrial, um framework de agente de IA autoevolutivo que aborda essa lacuna ao tratar o raciocínio de ensaios clínicos como um problema iterativo de redesign de protocolo.

Fonte: arXiv cs.AI

Theory/Optimization • Score 93

Otimização Bi-objetiva Guiada por Interpretabilidade: Alinhando Precisão e Explicabilidade

Este artigo apresenta a Otimização Bi-objetiva Guiada por Interpretabilidade (IGBO), um framework que treina modelos interpretáveis incorporando conhecimento de domínio estruturado por meio de uma formulação bi-objetiva. O IGBO codifica hierarquias de importância de características como um Grafo Acíclico Direcionado (DAG) e utiliza Gradientes Integrados Temporais (TIG) para medir a importância das características.

Fonte: arXiv cs.LG

NLP/LLMs • Score 90

Um Macaco de IA Consegue Uvas com Certeza -- Redes Neurais Sphere para Tomada de Decisão Confiável

Este artigo compara três categorias metodológicas de raciocínio neural: raciocínio LLM, raciocínio baseado em aprendizado supervisionado e raciocínio baseado em modelo explícito. Mostramos que o raciocínio via aprendizado supervisionado é menos atraente do que o raciocínio por construção de modelo explícito, e propomos uma nova versão de Redes Neurais Sphere que permite uma tomada de decisão confiável.

Fonte: arXiv cs.AI

MLOps/Systems • Score 93

Compreendendo Emoção no Discurso: Insights de Reconhecimento e Padrões Linguísticos para Geração

O reconhecimento de emoções em conversas (ERC) alcançou alta precisão, mas duas lacunas críticas permanecem: uma compreensão limitada das escolhas arquitetônicas que realmente importam e a falta de análise linguística conectando reconhecimento à geração. Abordamos ambas as lacunas por meio de uma análise sistemática do conjunto de dados IEMOCAP.

Fonte: arXiv cs.AI

RL • Score 96

Amostras Adversariais Não São Criadas Iguais

No último década, diversas teorias foram propostas para explicar a vulnerabilidade generalizada das redes neurais profundas a ataques de evasão adversariais. Este trabalho defende que amostras que utilizam características frágeis, mas preditivas, e aquelas que não utilizam, representam dois tipos de fraquezas adversariais e devem ser diferenciadas na avaliação da robustez adversarial.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Talk Less, Verify More: Improving LLM Assistants with Semantic Checks and Execution Feedback

arXiv:2601.00224v1 Announce Type: new Abstract: As large language model (LLM) assistants become increasingly integrated into enterprise workflows, their ability to generate accurate, semantically aligned, and executable outputs is critical. However, current conversational business analytics (CBA) systems often lack built-in verification mechanisms, leaving users to manually validate potentially flawed results. This paper introduces two complementary verification techniques: Q*, which performs reverse translation and semantic matching between code and user intent, and Feedback+, which incorporates execution feedback to guide code refinement. Embedded within a generator-discriminator framework, these mechanisms shift validation responsibilities from users to the system. Evaluations on three benchmark datasets, Spider, Bird, and GSM8K, demonstrate that both Q* and Feedback+ reduce error rates and task completion time. The study also identifies reverse translation as a key bottleneck, highlighting opportunities for future improvement. Overall, this work contributes a design-oriented framework for building more reliable, enterprise-grade GenAI systems capable of trustworthy decision support.

Fonte: arXiv cs.CL

Vision • Score 90

Produção de Entropia em Machine Learning Sob Fluxo de Probabilidade de Fokker-Planck

Modelos de machine learning implantados em ambientes não estacionários enfrentam degradação de desempenho devido ao data drift. Embora existam muitas heurísticas de detecção de drift, a maioria carece de uma interpretação dinâmica fundamentada e fornece orientações limitadas sobre como equilibrar a frequência de retraining com o custo operacional. Neste trabalho, propomos um framework de retraining baseado em entropia fundamentado em dinâmicas estocásticas fora do equilíbrio.

Fonte: arXiv cs.LG

RL • Score 96

Uma abordagem multi-algoritmo para o balanceamento da carga de trabalho operacional de recursos humanos em um sistema de entrega urbana de última milha

A atribuição eficiente de carga de trabalho à força de trabalho é crucial em sistemas de entrega de pacotes de última milha. Este artigo aborda o problema do balanceamento da carga de trabalho operacional em sistemas de entrega urbana, propondo uma abordagem multi-algoritmo que otimiza o tempo de entrega e garante uma distribuição equilibrada da carga de trabalho entre os trabalhadores.

Fonte: arXiv cs.AI

RL • Score 96

Colocação Ótima de Táxis Consciente do Tráfego Usando Aprendizado por Reforço Baseado em Redes Neurais Gráficas

No contexto do transporte em cidades inteligentes, o emparelhamento eficiente da oferta de táxis com a demanda de passageiros requer a integração em tempo real de dados da rede de tráfego urbano e padrões de mobilidade. Este artigo apresenta um framework de aprendizado por reforço (RL) baseado em grafos para a colocação ótima de táxis em ambientes metropolitanos.

Fonte: arXiv cs.LG

NLP/LLMs • Score 92

Physio-DPO: Aligning Large Language Models with the Protein Energy Landscape to Eliminate Structural Hallucinations

arXiv:2601.00647v1 Announce Type: new Abstract: Large Protein Language Models have shown strong potential for generative protein design, yet they frequently produce structural hallucinations, generating sequences with high linguistic likelihood that fold into thermodynamically unstable conformations. Existing alignment approaches such as Direct Preference Optimization are limited in this setting, as they model preferences as binary labels and ignore the continuous structure of the physical energy landscape. We propose Physio-DPO, a physics informed alignment framework that grounds protein language models in thermodynamic stability. Physio-DPO introduces a magnitude aware objective that scales optimization updates according to the energy gap between native structures and physics perturbed hard negatives. Experiments show that Physio-DPO consistently outperforms strong baselines including SFT, PPO, and standard DPO, reducing self consistency RMSD to 1.28 \AA\ and increasing foldability to 92.8%. Qualitative analysis further demonstrates that Physio-DPO effectively mitigates structural hallucinations by recovering biophysical interactions such as hydrophobic core packing and hydrogen bond networks.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

RIMRULE: Improving Tool-Using Language Agents via MDL-Guided Rule Learning

arXiv:2601.00086v1 Announce Type: new Abstract: Large language models (LLMs) often struggle to use tools reliably in domain-specific settings, where APIs may be idiosyncratic, under-documented, or tailored to private workflows. This highlights the need for effective adaptation to task-specific tools. We propose RIMRULE, a neuro-symbolic approach for LLM adaptation based on dynamic rule injection. Compact, interpretable rules are distilled from failure traces and injected into the prompt during inference to improve task performance. These rules are proposed by the LLM itself and consolidated using a Minimum Description Length (MDL) objective that favors generality and conciseness. Each rule is stored in both natural language and a structured symbolic form, supporting efficient retrieval at inference time. Experiments on tool-use benchmarks show that this approach improves accuracy on both seen and unseen tools without modifying LLM weights. It outperforms prompting-based adaptation methods and complements finetuning. Moreover, rules learned from one LLM can be reused to improve others, including long reasoning LLMs, highlighting the portability of symbolic knowledge across architectures.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Knowledge Distillation for Temporal Knowledge Graph Reasoning with Large Language Models

arXiv:2601.00202v1 Announce Type: new Abstract: Reasoning over temporal knowledge graphs (TKGs) is fundamental to improving the efficiency and reliability of intelligent decision-making systems and has become a key technological foundation for future artificial intelligence applications. Despite recent progress, existing TKG reasoning models typically rely on large parameter sizes and intensive computation, leading to high hardware costs and energy consumption. These constraints hinder their deployment on resource-constrained, low-power, and distributed platforms that require real-time inference. Moreover, most existing model compression and distillation techniques are designed for static knowledge graphs and fail to adequately capture the temporal dependencies inherent in TKGs, often resulting in degraded reasoning performance. To address these challenges, we propose a distillation framework specifically tailored for temporal knowledge graph reasoning. Our approach leverages large language models as teacher models to guide the distillation process, enabling effective transfer of both structural and temporal reasoning capabilities to lightweight student models. By integrating large-scale public knowledge with task-specific temporal information, the proposed framework enhances the student model's ability to model temporal dynamics while maintaining a compact and efficient architecture. Extensive experiments on multiple publicly available benchmark datasets demonstrate that our method consistently outperforms strong baselines, achieving a favorable trade-off between reasoning accuracy, computational efficiency, and practical deployability.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Os Chatbots LLMs Falam Demais? O Benchmark YapBench

Modelos de Linguagem de Grande Escala (LLMs) como ChatGPT, Claude e Gemini atuam cada vez mais como copilotos de propósito geral, mas frequentemente respondem com excessiva extensão em solicitações simples, aumentando a carga cognitiva e inflacionando o custo de inferência baseado em tokens. Apresentamos o YapBench, um benchmark leve para quantificar a sobregeração visível ao usuário em prompts de brevidade ideal.

Fonte: arXiv cs.LG

Theory/Optimization • Score 89

Do Aprendizado Contínuo ao SGD e Retorno: Melhores Taxas para Modelos Lineares Contínuos

Neste estudo, analisamos o cenário comum de aprendizado contínuo, onde um modelo superparametrizado é ajustado sequencialmente a um conjunto de tarefas realizáveis em conjunto. Provamos que o ajuste de uma tarefa é equivalente a um único passo de stochastic gradient descent (SGD) em um objetivo modificado, estabelecendo novas taxas de esquecimento universais.

Fonte: arXiv stat.ML

Evaluation/Benchmarks • Score 90

Previsão de Tempo em Corridas de Ciclismo: Uma Abordagem Personalizada de Machine Learning Usando Topologia de Rota e Carga de Treinamento

Prever a duração de ciclismo para uma rota específica é essencial para o planejamento de treinos e preparação de eventos. Este trabalho apresenta uma abordagem de machine learning que prevê a duração da pedalada utilizando características da topologia da rota combinadas com o estado físico atual do atleta, derivado de métricas de carga de treinamento.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

JP-TL-Bench: Avaliação Ancorada de LLM em Par para Tradução Bidirecional Japonês-Inglês

Apresentamos o JP-TL-Bench, um benchmark leve e aberto projetado para guiar o desenvolvimento iterativo de sistemas de tradução Japonês-Inglês. O desafio muitas vezes é 'qual dessas duas boas traduções é melhor?' em vez de 'esta tradução é aceitável?'. Essa distinção é crucial para o Japonês-Inglês, onde escolhas sutis em polidez, implicatura, elipse e registro afetam fortemente a naturalidade percebida.

Fonte: arXiv cs.AI

Evaluation/Benchmarks • Score 96

Um Modelo de Aprendizado Profundo com Atenção Esparsa Integrando Recursos Multimodais Heterogêneos para o Perfil de Gravidade da Doença de Parkinson

Caracterizar a apresentação heterogênea da doença de Parkinson (DP) requer a integração de marcadores biológicos e clínicos dentro de um framework preditivo unificado. Propomos a Class-Weighted Sparse-Attention Fusion Network (SAFN), um framework de aprendizado profundo interpretável para o perfil multimodal robusto, que supera limitações de interpretabilidade e desequilíbrio de classes.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Trajectory Guard -- Um Modelo Leve e Consciente de Sequência para Detecção de Anomalias em Tempo Real em AI Agente

Agentes autônomos de LLM geram planos de ação de múltiplos passos que podem falhar devido a desalinhamento contextual ou incoerência estrutural. Métodos existentes de detecção de anomalias não são adequados para esse desafio. Apresentamos o Trajectory Guard, um Autoencoder Recorrente Siamês que aprende alinhamento de tarefa e trajetória, permitindo a detecção unificada de planos incorretos e estruturas de planos malformadas.

Fonte: arXiv cs.LG

RL • Score 90

Regularização Geométrica em Mistura de Especialistas: A Desconexão Entre Pesos e Ativações

Modelos de Mistura de Especialistas (MoE) alcançam eficiência por meio de ativação esparsa, mas o papel da regularização geométrica na especialização dos especialistas permanece incerto. Aplicamos perda de ortogonalidade para impor diversidade entre especialistas e descobrimos que ela falha em vários aspectos, como aumento da sobreposição no espaço de pesos e resultados de desempenho inconsistentes.

Fonte: arXiv cs.LG

Applications • Score 90

Redes Profundas Aprendem Modelos Hierárquicos Profundos

Consideramos o aprendizado supervisionado com $n$ rótulos e mostramos que o SGD em camadas em redes residuais pode aprender de forma eficiente uma classe de modelos hierárquicos. Essa classe de modelos supera aqueles que foram anteriormente demonstrados como aprendíveis por algoritmos de deep learning, alcançando o limite de profundidade da aprendibilidade eficiente.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Cadeias Neurais e Sistemas Dinâmicos Discretos

Neste trabalho, inspecionamos a analogia entre aplicações de machine learning (ML) baseadas na arquitetura transformer sem autoatenção, denominadas { t cadeias neurais}, e sistemas dinâmicos discretos associados a versões discretizadas de equações integrais e diferenciais parciais neurais (NIE, PDE). Apresentamos uma análise comparativa da solução numérica das equações de Burgers e Eikonal via discretização numérica padrão e aprendizado PINN.

Fonte: arXiv cs.LG

Vision • Score 93

Detecção Inteligente de Falhas no Sistema de Energia Elétrica de Nanosatélites

Este artigo apresenta um novo método de detecção de falhas no sistema de energia elétrica de nanosatélites sem um Subsystema de Controle de Determinação de Atitude (ADCS) em órbita LEO. O sistema é simulado sem falhas com base em uma rede neural, utilizando radiação solar e temperatura da superfície do painel solar como dados de entrada.

Fonte: arXiv cs.LG

Vision • Score 95

Bandidos Contextuais Aditivos Esparsos: Uma Abordagem Não Paramétrica para Tomada de Decisão Online com Covariáveis de Alta Dimensionalidade

Serviços personalizados são centrais para a economia digital atual, e suas decisões sequenciais são frequentemente modeladas como bandidos contextuais. Aplicações modernas enfrentam dois desafios principais: covariáveis de alta dimensionalidade e a necessidade de modelos não paramétricos para capturar relações complexas entre recompensa e covariáveis. Propomos um algoritmo de bandido contextual baseado em um modelo de recompensa aditiva esparsa que aborda ambos os desafios.

Fonte: arXiv stat.ML

Evaluation/Benchmarks • Score 89

MCD: Discriminação Contrastiva Marginal para estimativa de densidade condicional

Consideramos o problema da estimativa de densidade condicional, um tema de grande interesse nas áreas de estatística e machine learning. Nosso método, chamado Discriminação Contrastiva Marginal (MCD), reformula a função de densidade condicional em dois fatores, a função de densidade marginal da variável-alvo e uma razão de funções de densidade que pode ser estimada por meio de classificação binária.

Fonte: arXiv stat.ML

RL • Score 96

Dominação Quântica King-Ring no Xadrez: Uma Abordagem QAOA

O Quantum Approximate Optimization Algorithm (QAOA) é amplamente testado em instâncias aleatórias sintéticas, mas carece de estrutura semântica e interpretabilidade humana. Apresentamos a Dominação Quântica King-Ring (QKRD), um benchmark em escala NISQ derivado de posições táticas de xadrez, oferecendo 5.000 instâncias estruturadas. Usando QKRD, avaliamos escolhas de design do QAOA e mostramos que técnicas informadas por problemas revelam vantagens ocultas em instâncias aleatórias.

Fonte: arXiv cs.LG

RL • Score 96

Computação de Reservatório Sequencial para Previsão Espacial e Temporal de Alta Dimensionalidade de Forma Eficiente

A previsão de sistemas espaciais e temporais de alta dimensionalidade continua sendo um desafio computacional para redes neurais recorrentes (RNNs) e modelos de memória de longo e curto prazo (LSTM). Introduzimos uma arquitetura de Computação de Reservatório Sequencial (Sequential RC) que decompõe um grande reservatório em uma série de reservatórios menores e interconectados, melhorando a eficiência e reduzindo custos computacionais.

Fonte: arXiv cs.LG

RL • Score 96

Uma Análise Comparativa de Métodos de Machine Learning Interpretabéis

Nos últimos anos, o Machine Learning (ML) tem sido amplamente adotado em diversos setores, incluindo áreas críticas como saúde, finanças e direito. Essa dependência crescente levantou preocupações sobre a interpretabilidade e a responsabilidade dos modelos, especialmente com a imposição de restrições legais e regulatórias sobre o uso de modelos black-box. Este estudo apresenta uma avaliação comparativa de 16 métodos inerentemente interpretabéis, abrangendo 216 conjuntos de dados tabulares do mundo real.

Fonte: arXiv cs.LG

Vision • Score 96

Atribuição de Conteúdo Gerado por IA Desconhecida e Consciente

O avanço rápido de modelos generativos fotorealistas tornou crucial atribuir a origem do conteúdo sintético, passando da detecção binária de real ou falso para identificar o modelo específico que produziu uma imagem. Este estudo investiga a distinção de saídas de um modelo gerador-alvo (ex: OpenAI Dalle 3) em relação a outras fontes.

Fonte: arXiv cs.LG

RL • Score 96

O Transporte Ótimo Pode Melhorar o Aprendizado por Reforço Inverso Federado?

Neste artigo, introduzimos uma abordagem baseada em transporte ótimo para o aprendizado por reforço inverso federado (IRL). Cada cliente realiza localmente um IRL de Máxima Entropia, respeitando suas limitações computacionais e de privacidade. As funções de recompensa resultantes são fundidas via um barycenter de Wasserstein, que considera sua estrutura geométrica subjacente. Este trabalho oferece um framework eficiente em comunicação para derivar uma recompensa compartilhada que se generaliza entre agentes e ambientes heterogêneos.

Fonte: arXiv cs.LG

Vision • Score 96

Engenharia de Recursos Híbridos Otimizada para Detecção de Arritmias Eficiente em Recursos em Sinais de ECG: Um Framework de Otimização

As doenças cardiovasculares, especialmente as arritmias, continuam a ser uma das principais causas de mortalidade global, exigindo monitoramento contínuo via Internet das Coisas Médicas (IoMT). Este estudo propõe um framework centrado em dados e eficiente em recursos que prioriza a engenharia de características em vez da complexidade, alcançando alta precisão diagnóstica com um modelo leve.

Fonte: arXiv cs.LG

RL • Score 90

Um Framework Agente para Programação Neuro-Simbólica

Integrar restrições simbólicas em modelos de deep learning pode torná-los mais robustos, interpretáveis e eficientes em termos de dados. No entanto, essa tarefa continua sendo desafiadora e demorada. Propomos o AgenticDomiKnowS (ADS) para eliminar essa dependência, permitindo que usuários construam rapidamente programas neuro-simbólicos.

Fonte: arXiv cs.AI

Evaluation/Benchmarks • Score 96

Ajuste Fino Robusto de Grafos com Prompting Adversarial de Grafos

O método de Ajuste Fino Eficiente em Parâmetros (PEFT) se destacou como um paradigma dominante para adaptar modelos GNN pré-treinados a tarefas específicas. No entanto, métodos PEFT existentes geralmente mostram vulnerabilidades significativas a ruídos e ataques na topologia de grafos e atributos/nomeações de nós. Propomos integrar aprendizado adversarial ao prompting de grafos, desenvolvendo um novo framework de Adversarial Graph Prompting (AGP) para alcançar um ajuste fino robusto.

Fonte: arXiv cs.LG

RL • Score 96

Predição Precoce de Cirrose Hepática com Antecedência de Até Três Anos: Um Estudo de Machine Learning Comparando com o FIB-4

Objetivo: Desenvolver e avaliar modelos de machine learning (ML) para prever a cirrose hepática incidente um, dois e três anos antes do diagnóstico, utilizando dados de registros eletrônicos de saúde (EHR) coletados rotineiramente, e comparar seu desempenho com o escore FIB-4. Métodos: Realizamos um estudo de coorte retrospectivo usando dados de EHR desidentificados de um grande sistema de saúde acadêmico.

Fonte: arXiv cs.LG

RL • Score 90

Bandit Kernelizado Laplaciano

Estudamos bandits contextuais multiusuário onde os usuários estão relacionados por um grafo e suas funções de recompensa apresentam comportamento não linear e homofilia gráfica. Introduzimos uma penalidade conjunta fundamentada para a coleção de funções de recompensa dos usuários, combinando um termo de suavidade gráfica com uma penalidade de rugosidade individual.

Fonte: arXiv cs.LG

RL • Score 93

Exploração nos Limites

No problema de identificação do melhor braço (BAI) com confiança fixa, o objetivo é identificar rapidamente a opção ótima enquanto controla a probabilidade de erro abaixo de um limite desejado. Introduzimos uma formulação relaxada que requer controle de erro válido assintoticamente em relação a um tamanho mínimo de amostra, permitindo uma melhor adequação a cenários do mundo real.

Fonte: arXiv cs.LG

Vision • Score 96

IMBWatch -- uma abordagem de Rede Neural Gráfica Espacial-Temporal para detectar Negócios de Massagem Ilícitos

Os Negócios de Massagem Ilícitos (IMBs) são uma forma encoberta e persistente de exploração organizada que operam sob a fachada de serviços de bem-estar legítimos. A detecção de IMBs é difícil devido a anúncios digitais codificados e mudanças frequentes de pessoal e locais. Apresentamos o IMBWatch, um framework de rede neural gráfica espacial-temporal (ST-GNN) para a detecção em larga escala de IMBs, que combina operações de convolução gráfica com mecanismos de atenção temporal.

Fonte: arXiv cs.LG

RL • Score 93

Proteção de Erro Desigual Aprendida por Reforço para Embeddings Semânticos Quantizados

Este artigo aborda o desafio premente de preservar o significado semântico em sistemas de comunicação com largura de banda limitada. Introduzimos um novo framework de aprendizado por reforço que alcança proteção desigual por dimensão via codificação de repetição adaptativa, utilizando uma métrica de distorção semântica composta que equilibra a similaridade global de embeddings com a preservação em nível de entidade.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Aprendizado por Reforço com Aproximação de Função para Processos Não-Markovianos

Estudamos métodos de aprendizado por reforço com aproximação de função linear sob processos de estado e custo não-Markovianos. Consideramos inicialmente o método de avaliação de política e demonstramos que o algoritmo converge sob condições adequadas de ergodicidade. Além disso, mostramos que o limite corresponde ao ponto fixo de um operador conjunto composto por uma projeção ortogonal e o operador de Bellman de um processo de decisão auxiliar extit{Markov}.

Fonte: arXiv cs.LG

RL • Score 96

Yahtzee: Técnicas de Aprendizado por Reforço para Jogos Combinatórios Estocásticos

Yahtzee é um jogo clássico de dados com uma estrutura estocástica e combinatória, apresentando recompensas atrasadas, o que o torna um interessante benchmark de RL em escala média. Este trabalho formula Yahtzee como um Processo de Decisão de Markov (MDP) e treina agentes de auto-jogo utilizando diversos métodos de gradiente de política.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Robust Uncertainty Quantification for Factual Generation of Large Language Models

arXiv:2601.00348v1 Announce Type: new Abstract: The rapid advancement of large language model(LLM) technology has facilitated its integration into various domains of professional and daily life. However, the persistent challenge of LLM hallucination has emerged as a critical limitation, significantly compromising the reliability and trustworthiness of AI-generated content. This challenge has garnered significant attention within the scientific community, prompting extensive research efforts in hallucination detection and mitigation strategies. Current methodological frameworks reveal a critical limitation: traditional uncertainty quantification approaches demonstrate effectiveness primarily within conventional question-answering paradigms, yet exhibit notable deficiencies when confronted with non-canonical or adversarial questioning strategies. This performance gap raises substantial concerns regarding the dependability of LLM responses in real-world applications requiring robust critical thinking capabilities. This study aims to fill this gap by proposing an uncertainty quantification scenario in the task of generating with multiple facts. We have meticulously constructed a set of trap questions contained with fake names. Based on this scenario, we innovatively propose a novel and robust uncertainty quantification method(RU). A series of experiments have been conducted to verify its effectiveness. The results show that the constructed set of trap questions performs excellently. Moreover, when compared with the baseline methods on four different models, our proposed method has demonstrated great performance, with an average increase of 0.1-0.2 in ROCAUC values compared to the best performing baseline method, providing new sights and methods for addressing the hallucination issue of LLMs.

Fonte: arXiv cs.CL

Vision • Score 96

Aprendizado por Reforço Multiagente para Jogos de Liquidez

Este trabalho explora o uso de métodos de enxame na modelagem de liquidez dos mercados financeiros, unindo Jogos de Liquidez e Enxames Racionais. A pesquisa propõe um modelo teórico onde agentes independentes maximizam a liquidez do mercado sem necessidade de coordenação, contribuindo para a eficiência do mercado e a lucratividade individual.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Framework de Otimização Bayesiana Dinâmica para Ajuste de Instruções na Descoberta de Equações Diferenciais Parciais

Modelos de Linguagem de Grande Escala (LLMs) mostram potencial para a descoberta de equações, mas suas saídas são altamente sensíveis à formulação dos prompts, um fenômeno que chamamos de fragilidade das instruções. Para resolver isso, propomos o NeuroSymBO, que reformula a engenharia de prompts como um problema de decisão sequencial.

Fonte: arXiv cs.LG

RL • Score 92

Aprendizado ativo para modelos reduzidos baseados em dados de sistemas diferenciais paramétricos com inferência bayesiana de operadores

Este trabalho desenvolve um framework de aprendizado ativo para enriquecer de forma inteligente modelos reduzidos de ordem (ROMs) de sistemas dinâmicos paramétricos, que podem servir como base para ativos virtuais em um digital twin. Os ROMs baseados em dados são modelos de machine learning científicos explicáveis e computacionalmente eficientes que visam preservar a física subjacente de simulações dinâmicas complexas.

Fonte: arXiv stat.ML

Evaluation/Benchmarks • Score 95

Reparametrização Categórica com Modelos de Difusão Denoising

A otimização baseada em gradiente com variáveis categóricas geralmente depende de estimadores de função de pontuação, que são imparciais, mas ruidosos, ou de relaxamentos contínuos que substituem a distribuição discreta por um substituto suave. Neste artigo, estendemos essa família de relaxamentos introduzindo uma reparametrização suave baseada em difusão para distribuições categóricas, permitindo um sampler de difusão sem treinamento.

Fonte: arXiv stat.ML

Vision • Score 96

Detecção Adaptativa de Coordenação Causal para Mídias Sociais: Um Framework Guiado por Memória com Aprendizado Semi-Supervisionado

Detectar comportamentos inautênticos coordenados em mídias sociais é um desafio crítico. Propomos o framework Adaptive Causal Coordination Detection (ACCD), que utiliza uma arquitetura progressiva em três estágios para aprender e reter configurações de detecção otimizadas. O ACCD melhora a identificação de relações causais e reduz a necessidade de rotulagem manual, alcançando um F1-score de 87,3% em detecção de ataques coordenados.

Fonte: arXiv cs.AI

Privacy/Security/Fairness • Score 89

Identificação e Estimativa sob Múltiplas Versões de Tratamento: Abordagem Mixture-of-Experts

A suposição de valor de tratamento unitário estável (SUTVA) inclui a condição de que não existem múltiplas versões de tratamento na inferência causal. Este trabalho introduz o framework Mixture-of-Experts na inferência causal e desenvolve uma metodologia para estimar os efeitos causais de versões latentes, permitindo a estimativa explícita de efeitos causais específicos de versão, mesmo que as versões não sejam observadas.

Fonte: arXiv stat.ML

NLP/LLMs • Score 96

Benchmarking de substitutos neurais em fluxos multifísicos espaço-temporais realistas

Prever dinâmicas multifísicas é computacionalmente caro e desafiador devido ao acoplamento severo de processos físicos heterogêneos e multiescala. Apresentamos o REALM (REalistic AI Learning for Multiphysics), um framework rigoroso de benchmarking para testar substitutos neurais em fluxos reativos desafiadores, com 11 conjuntos de dados de alta fidelidade e um protocolo padronizado de treinamento e avaliação.

Fonte: arXiv cs.LG

RL • Score 93

Serviço Eficiente de Mistura de Agentes via Roteamento em Estrutura de Árvore, Poda Adaptativa e Sobreposição de Pré-preenchimento e Decodificação Consciente de Dependências

A inferência de Mistura de Agentes (MoA) pode sofrer com comunicação densa entre agentes e baixa utilização de hardware, o que aumenta a latência de serviço. Apresentamos um design de serviço que aborda esses gargalos por meio de um co-design de algoritmo e sistema, resultando em uma redução significativa da latência de ponta a ponta (até 90%) mantendo precisão comparável.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

NL2CA: Auto-formalizando a Tomada de Decisão Cognitiva a partir da Linguagem Natural Usando um Framework Unsupervised CriticNL2LTL

Modelos de computação cognitiva oferecem uma maneira formal e interpretável de caracterizar a deliberação e a tomada de decisão humanas, mas seu desenvolvimento ainda é intensivo em mão de obra. Neste artigo, propomos o NL2CA, um método inovador para auto-formalizar regras de tomada de decisão cognitiva a partir de descrições em linguagem natural da experiência humana, totalmente automatizado e sem intervenção humana.

Fonte: arXiv cs.AI

RL • Score 93

Sobre a Taxa de Convergência do Gradiente Descendente LoRA

O algoritmo de adaptação de baixa rank (LoRA) para ajuste fino de grandes modelos tem se tornado popular devido ao seu desempenho notável e baixos requisitos computacionais. Este trabalho apresenta pela primeira vez uma análise de convergência não assintótica do algoritmo de gradiente descendente LoRA original, sem pressupostos que limitam a suavidade de Lipschitz.

Fonte: arXiv cs.LG

RL • Score 95

Amostragem de distribuições multimodais com pontos de partida aquecidos: Limites não assintóticos para o Reweighted Annealed Leap-Point Sampler

A amostragem de distribuições multimodais é um desafio central na inferência bayesiana e machine learning. Este trabalho introduz o Reweighted ALPS (Re-ALPS), uma versão modificada do Annealed Leap-Point Sampler (ALPS), que elimina a suposição de aproximação gaussiana e apresenta um limite de tempo polinomial em um contexto geral.

Fonte: arXiv stat.ML

RL • Score 89

Perspectivas para vantagem quântica em machine learning a partir da representabilidade de funções

Demonstrar vantagem quântica em tarefas de machine learning requer navegar por um complexo panorama de modelos e algoritmos propostos. Introduzimos um framework que conecta a estrutura de circuitos quânticos parametrizados à natureza matemática das funções que podem realmente aprender. Essa análise revela distinções críticas entre modelos totalmente simuláveis e aqueles que permanecem robustamente quânticos.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

Ensinando e Criticando a Conceituação e Operacionalização em NLP

Pesquisadores de NLP frequentemente invocam conceitos abstratos como 'interpretabilidade', 'viés', 'raciocínio' e 'estereótipos' sem defini-los. Este artigo descreve um seminário criado para estudantes explorarem questões de conceituação e operacionalização, com uma lista de leitura interdisciplinar e ênfase em discussão e crítica.

Fonte: arXiv cs.CL

NLP/LLMs • Score 92

Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

arXiv:2512.18880v1 Announce Type: new Abstract: Accurate estimation of item (question or task) difficulty is critical for educational assessment but suffers from the cold start problem. While Large Language Models demonstrate superhuman problem-solving capabilities, it remains an open question whether they can perceive the cognitive struggles of human learners. In this work, we present a large-scale empirical analysis of Human-AI Difficulty Alignment for over 20 models across diverse domains such as medical knowledge and mathematical reasoning. Our findings reveal a systematic misalignment where scaling up model size is not reliably helpful; instead of aligning with humans, models converge toward a shared machine consensus. We observe that high performance often impedes accurate difficulty estimation, as models struggle to simulate the capability limitations of students even when being explicitly prompted to adopt specific proficiency levels. Furthermore, we identify a critical lack of introspection, as models fail to predict their own limitations. These results suggest that general problem-solving capability does not imply an understanding of human cognitive struggles, highlighting the challenge of using current models for automated difficulty prediction.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Learning to Prioritize IT Tickets: A Comparative Evaluation of Embedding-based Approaches and Fine-Tuned Transformer Models

arXiv:2512.17916v1 Announce Type: new Abstract: Prioritizing service tickets in IT Service Management (ITSM) is critical for operational efficiency but remains challenging due to noisy textual inputs, subjective writing styles, and pronounced class imbalance. We evaluate two families of approaches for ticket prioritization: embedding-based pipelines that combine dimensionality reduction, clustering, and classical classifiers, and a fine-tuned multilingual transformer that processes both textual and numerical features. Embedding-based methods exhibit limited generalization across a wide range of thirty configurations, with clustering failing to uncover meaningful structures and supervised models highly sensitive to embedding quality. In contrast, the proposed transformer model achieves substantially higher performance, with an average F1-score of 78.5% and weighted Cohen's kappa values of nearly 0.80, indicating strong alignment with true labels. These results highlight the limitations of generic embeddings for ITSM data and demonstrate the effectiveness of domain-adapted transformer architectures for operational ticket prioritization.

Fonte: arXiv cs.CL

RecSys • Score 96

Gêmeos Digitais Probabilísticos de Usuários: Aprendizado de Representação Latente com Semântica Estatisticamente Validada

Entender a identidade e o comportamento do usuário é central para aplicações como personalização, recomendação e suporte à decisão. Propomos um framework de gêmeo digital probabilístico onde cada usuário é modelado como um estado estocástico latente que gera dados comportamentais observados. Este framework é aplicado a um conjunto de dados de respostas de usuários para capturar aspectos estáveis da identidade do usuário.

Fonte: arXiv cs.LG

Vision • Score 96

Grad: Geração de Difusão de Relações Guiadas para Aumento de Grafos na Detecção de Fraude em Grafos

Atualmente, a Detecção de Fraude em Grafos (GFD) em cenários financeiros tornou-se um tópico de pesquisa urgente para proteger a segurança de pagamentos online. Com a evolução das estratégias de camuflagem dos fraudadores, propomos o modelo Grad, que utiliza um módulo de aprendizado contrastivo supervisionado para melhorar a diferença entre fraudes e usuários benignos, gerando relações homofílicas auxiliares.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Rumo a Agentes Eficientes: Um Co-Design da Arquitetura de Inferência e do Sistema

O rápido desenvolvimento de agentes baseados em large language models (LLMs) abriu novas possibilidades para raciocínio autônomo em múltiplas interações e tomada de decisão com ferramentas. No entanto, sua implementação no mundo real é dificultada por ineficiências severas que surgem não da inferência isolada do modelo, mas da latência sistêmica acumulada ao longo dos ciclos de raciocínio, crescimento de contexto e interações heterogêneas com ferramentas.

Fonte: arXiv cs.CL

RL • Score 96

Aprendizado por Reforço Contínuo Guiado por Demonstração em Ambientes Dinâmicos

O aprendizado por reforço (RL) se destaca em várias aplicações, mas enfrenta dificuldades em ambientes dinâmicos. O aprendizado por reforço contínuo (CRL) permite que agentes de RL aprendam e se adaptem continuamente, mas equilibrar estabilidade e plasticidade continua sendo um desafio. Propomos o aprendizado por reforço contínuo guiado por demonstração (DGCRL), que utiliza um repositório de demonstrações externas para guiar a exploração e adaptação do RL.

Fonte: arXiv cs.LG

RL • Score 96

ARC: Aproveitando Representações Composicionais para Aprendizado entre Problemas em VRPs

Os Problemas de Roteamento de Veículos (VRPs) com atributos diversos do mundo real têm gerado interesse recente em abordagens de aprendizado entre problemas que generalizam eficientemente entre variantes. Propomos o ARC (Attribute Representation via Compositional Learning), um framework de aprendizado entre problemas que aprende representações de atributos desentranhadas, decompondo-as em dois componentes complementares.

Fonte: arXiv cs.LG

RL • Score 96

FairExpand: Justiça Individual em Grafos com Informações de Similaridade Parcial

A justiça individual, que exige que indivíduos semelhantes sejam tratados de forma semelhante por sistemas algorítmicos, é um princípio central em machine learning justo. Este trabalho apresenta o FairExpand, um framework flexível que promove a justiça individual em cenários de informações parciais, superando a limitação de métodos existentes que requerem informações de similaridade pré-definidas para todos os pares de nós.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Uma Rede Híbrida Indutiva-Transdutiva para Imputação de Fluxo de Tráfego em Locais Não Amostrados

Imputar com precisão o fluxo de tráfego em locais não sensorizados é desafiador. Propomos a HINT, uma Rede Híbrida Indutiva-Transdutiva, que utiliza uma estratégia de treinamento INDU-TRANSDUTIVA para tratar a velocidade como um sinal transdutivo, enquanto aprende o fluxo indutivamente. HINT supera consistentemente as linhas de base indutivas em três conjuntos de dados do mundo real.

Fonte: arXiv cs.LG

RL • Score 95

Seleção de Recursos Não Supervisionada via Autoencoder Robusto e Aprendizado Adaptativo de Grafo

A seleção eficaz de recursos é essencial para a análise de dados de alta dimensão e machine learning. A seleção de recursos não supervisionada (UFS) visa agrupar dados e identificar as características mais discriminativas. Propomos o modelo Robust Autoencoder-based Unsupervised Feature Selection (RAEUFS), que utiliza um autoencoder profundo para aprender representações de recursos não lineares, melhorando a robustez contra outliers.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

Formulação Automática de Problemas Independente de Solver via LLMs para Design Orientado a Simulação de Alto Custo

No domínio de design orientado a simulação de alto custo, traduzir requisitos de design ambíguos em uma formulação de otimização matemática é um gargalo para otimizar o desempenho do produto. Propomos o APF, um framework para formulação automatizada de problemas independente de solver via LLMs, que converte automaticamente os requisitos em linguagem natural dos engenheiros em modelos de otimização executáveis.

Fonte: arXiv cs.CL

RL • Score 89

Sobre Interpolação Estocástica Condicional para Redução de Dimensão Suficiente Não Linear Generativa

Identificar estruturas suficientes de baixa dimensão na redução de dimensão suficiente (SDR) não linear é um problema fundamental e desafiador. Propomos um novo método, generative sufficient dimension reduction (GenSDR), que utiliza modelos generativos modernos e demonstra a capacidade de recuperar completamente a informação contida no campo central $ au$-field em níveis populacional e amostral.

Fonte: arXiv stat.ML

RL • Score 95

Teorema Central do Limite para médias ergódicas de cadeias de Markov e a comparação de algoritmos de amostragem para distribuições com cauda pesada

Estabelecer teoremas do limite central (CLTs) para médias ergódicas de cadeias de Markov é um problema fundamental em probabilidade e suas aplicações. Este artigo fornece condições necessárias verificáveis para CLTs de médias ergódicas em espaços de estados gerais, com foco em condições de drift que também oferecem limites inferiores para as taxas de convergência à estacionaridade.

Fonte: arXiv stat.ML

NLP/LLMs • Score 96

ChronoDreamer: Modelo de Mundo Condicionado por Ação como um Simulador Online para Planejamento Robótico

Apresentamos o ChronoDreamer, um modelo de mundo condicionado por ação para manipulação robótica rica em contatos. Com base em um histórico de quadros RGB egocêntricos, mapas de contato, ações e estados das juntas, o ChronoDreamer prevê quadros de vídeo futuros, distribuições de contato e ângulos das juntas através de um transformer espaço-temporal treinado com previsão mascarada no estilo MaskGIT.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

LLM-based Few-Shot Early Rumor Detection with Imitation Agent

arXiv:2512.18352v1 Announce Type: new Abstract: Early Rumor Detection (EARD) aims to identify the earliest point at which a claim can be accurately classified based on a sequence of social media posts. This is especially challenging in data-scarce settings. While Large Language Models (LLMs) perform well in few-shot NLP tasks, they are not well-suited for time-series data and are computationally expensive for both training and inference. In this work, we propose a novel EARD framework that combines an autonomous agent and an LLM-based detection model, where the agent acts as a reliable decision-maker for \textit{early time point determination}, while the LLM serves as a powerful \textit{rumor detector}. This approach offers the first solution for few-shot EARD, necessitating only the training of a lightweight agent and allowing the LLM to remain training-free. Extensive experiments on four real-world datasets show our approach boosts performance across LLMs and surpasses existing EARD methods in accuracy and earliness.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Aprendizado por transferência baseado em operador neural convolucional para resolver PDEs

O operador neural convolucional é uma arquitetura baseada em CNN recentemente proposta para garantir equivalência contínua-discreta que preserva a estrutura e possibilitar o aprendizado genuíno e sem aliasing de operadores de solução de PDEs. Este operador neural demonstrou superar, em certos casos, modelos de referência como DeepONet e Fourier neural operator em termos de precisão de surrogates.

Fonte: arXiv cs.LG

NLP/LLMs • Score 92

DramaBench: Um Framework de Avaliação em Seis Dimensões para Continuação de Roteiros Dramáticos

A continuação de roteiros dramáticos exige modelos que mantenham a consistência dos personagens, avancem a trama de forma coerente e preservem a estrutura dramática, capacidades que os benchmarks existentes não avaliam de forma abrangente. Apresentamos o DramaBench, o primeiro benchmark em larga escala para avaliar a continuação de roteiros dramáticos em seis dimensões independentes.

Fonte: arXiv cs.CL

RL • Score 96

Seleção de Dados Comportamentais Offline

O comportamento de clonagem é uma abordagem amplamente adotada para aprendizado de políticas offline a partir de demonstrações de especialistas. Este artigo revela a saturação de dados comportamentais offline, onde o desempenho da política rapidamente se estabiliza com uma pequena fração do conjunto de dados. Propomos o método Stepwise Dual Ranking (SDR) para extrair um subconjunto compacto e informativo de grandes conjuntos de dados comportamentais offline.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

FASTRIC: Linguagem de Especificação de Prompt para Interações LLM Verificáveis

Modelos de Linguagem de Grande Escala (LLMs) executam protocolos complexos de interação, mas carecem de especificações formais para verificar a execução em relação à intenção do designer. Apresentamos o FASTRIC, uma Linguagem de Especificação de Prompt que torna Máquinas de Estados Finitos (FSMs) implícitas explícitas em prompts de linguagem natural, permitindo a verificação de conformidade por meio da análise de rastreamento de execução.

Fonte: arXiv cs.CL

Theory/Optimization • Score 92

Garantindo Robustez de Calibração na Predição Conformal Dividida Sob Ataques Adversariais

A predição conformal (CP) oferece garantias de cobertura de amostra finita e independente da distribuição, mas depende criticamente da intercambiabilidade, uma condição frequentemente violada sob mudança de distribuição. Estudamos a robustez da predição conformal dividida sob perturbações adversariais durante o teste, focando na validade da cobertura e no tamanho do conjunto de predição resultante.

Fonte: arXiv stat.ML

Vision • Score 93

Os Salmões Mortos da Interpretabilidade em IA

Neste estudo de neurociência, os autores colocaram um salmão morto em um scanner de MRI e mostraram imagens de humanos em situações sociais, revelando artefatos de 'salmão morto' em análises de IA. Este trabalho propõe uma reinterpretação estatística-causal, tratando explicações de sistemas computacionais como parâmetros de um modelo estatístico, enfatizando a importância de testar hipóteses alternativas.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Rumo ao Desaprendizado que Preserva o Raciocínio em Modelos de Linguagem Grande Multimodal

O desaprendizado de máquinas visa apagar dados solicitados de modelos treinados sem re-treinamento completo. Para Modelos de Linguagem Grande Multimodal com Raciocínio (RMLLMs), isso é desafiador, pois etapas intermediárias podem vazar informações sensíveis. Apresentamos o RMLLMU-Bench, o primeiro benchmark para desaprendizado de RMLLM que avalia o vazamento de raciocínio e a retenção de raciocínio.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Sophia: Um Framework de Agente Persistente de Vida Artificial

O desenvolvimento de LLMs elevou os agentes de IA de ferramentas específicas para entidades de tomada de decisão de longa duração. No entanto, a maioria das arquiteturas permanece estática e reativa, limitadas a cenários definidos manualmente. Propomos um terceiro estrato, o Sistema 3, que supervisiona a identidade narrativa do agente e a adaptação a longo prazo, culminando em Sophia, um wrapper de 'Agente Persistente' que integra um ciclo contínuo de autoaperfeiçoamento.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Training LLMs with LogicReward for Faithful and Rigorous Reasoning

arXiv:2512.18196v1 Announce Type: new Abstract: Although LLMs exhibit strong reasoning capabilities, existing training methods largely depend on outcome-based feedback, which can produce correct answers with flawed reasoning. Prior work introduces supervision on intermediate steps but still lacks guarantees of logical soundness, which is crucial in high-stakes scenarios where logical consistency is paramount. To address this, we propose LogicReward, a novel reward system that guides model training by enforcing step-level logical correctness with a theorem prover. We further introduce Autoformalization with Soft Unification, which reduces natural language ambiguity and improves formalization quality, enabling more effective use of the theorem prover. An 8B model trained on data constructed with LogicReward surpasses GPT-4o and o4-mini by 11.6\% and 2\% on natural language inference and logical reasoning tasks with simple training procedures. Further analysis shows that LogicReward enhances reasoning faithfulness, improves generalizability to unseen tasks such as math and commonsense reasoning, and provides a reliable reward signal even without ground-truth labels. We will release all data and code at https://llm-symbol.github.io/LogicReward.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Repensando a Inteligência Multi-Agente Através da Lente de Redes de Pequeno Mundo

Modelos de linguagem grandes (LLMs) possibilitaram sistemas multi-agente (MAS) onde múltiplos agentes argumentam, criticam e coordenam para resolver tarefas complexas, tornando a topologia de comunicação uma escolha de design fundamental. Neste trabalho, revisitamos a teoria clássica sobre redes de pequeno mundo (SW) e investigamos como a conectividade SW pode ser utilizada como um princípio de design para MAS.

Fonte: arXiv cs.AI

RL • Score 95

Neural CDEs as Correctors for Learned Time Series Models

arXiv:2512.12116v2 Announce Type: replace-cross Abstract: Learned time-series models, whether continuous- or discrete-time, are widely used to forecast the states of a dynamical system. Such models generate multi-step forecasts either directly, by predicting the full horizon at once, or iteratively, by feeding back their own predictions at each step. In both cases, the multi-step forecasts are prone to errors. To address this, we propose a Predictor-Corrector mechanism where the Predictor is any learned time-series model and the Corrector is a neural controlled differential equation. The Predictor forecasts, and the Corrector predicts the errors of the forecasts. Adding these errors to the forecasts improves forecast performance. The proposed Corrector works with irregularly sampled time series and continuous- and discrete-time Predictors. Additionally, we introduce two regularization strategies to improve the extrapolation performance of the Corrector with accelerated training. We evaluate our Corrector with diverse Predictors, e.g., neural ordinary differential equations, Contiformer, and DLinear, on synthetic, physics simulation, and real-world forecasting datasets. The experiments demonstrate that the Predictor-Corrector mechanism consistently improves the performance compared to Predictor alone.

Fonte: arXiv stat.ML

NLP/LLMs • Score 96

RL de Lançamento Único Estável e Eficiente para Raciocínio Multimodal

O Reinforcement Learning com Recompensas Verificáveis (RLVR) se tornou um paradigma chave para melhorar as capacidades de raciocínio de Modelos de Linguagem Grande Multimodal (MLLMs). Introduzimos o $ extbf{MSSR}$, um framework RLVR sem grupos que alcança otimização estável e desempenho eficaz em raciocínio multimodal, superando limitações de variantes de lançamento único em contextos multimodais.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

ASTIF: Integração Semântica-Temporal Adaptativa para Previsão de Preços de Criptomoedas

A previsão de séries temporais financeiras é um desafio de fusão de informações, mas a maioria dos modelos existentes depende de arquiteturas estáticas que têm dificuldade em integrar fontes de conhecimento heterogêneas. Propomos o ASTIF, um sistema híbrido inteligente que adapta sua estratégia de previsão em tempo real através de meta-aprendizado baseado em confiança, integrando componentes complementares para melhorar a precisão das previsões.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

NEURO-GUARD: Generalização Neuro-Simbólica e Roteamento Adaptativo Imparcial para Diagnósticos -- IA Médica Explicável

O diagnóstico baseado em imagem, preciso e interpretável, continua sendo um desafio central na IA médica, especialmente em ambientes com dados limitados e decisões clínicas críticas. Apresentamos o NEURO-GUARD, um novo framework guiado por conhecimento que integra Vision Transformers (ViTs) com raciocínio orientado por linguagem, melhorando desempenho e robustez em diferentes domínios.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

ESearch-R1: Aprendendo Agentes MLLM Conscientes de Custo para Busca Embodida Interativa via Aprendizado por Reforço

Modelos de Linguagem de Grande Escala Multimodal (MLLMs) capacitaram agentes embodidos com habilidades notáveis em planejamento e raciocínio. No entanto, ao enfrentar instruções ambíguas em linguagem natural, os agentes atuais frequentemente falham em equilibrar o alto custo da exploração física com o custo cognitivo da interação humana. Para preencher essa lacuna, propomos o ESearch-R1, um framework de raciocínio embodido consciente de custo.

Fonte: arXiv cs.AI

Vision • Score 96

Rede Bayesiana Multimodal para Avaliação Robusta de Vítimas em Triagem Autônoma

Incidentes de Múltiplas Vítimas podem sobrecarregar sistemas médicos de emergência, e atrasos ou erros na avaliação das vítimas podem resultar em mortes evitáveis. Apresentamos um framework de suporte à decisão que combina saídas de múltiplos modelos de computer vision, estimando sinais de hemorragia severa, dificuldade respiratória, alerta físico ou trauma visível, em uma rede Bayesiana construída inteiramente a partir de regras definidas por especialistas.

Fonte: arXiv cs.AI

NLP/LLMs • Score 93

Hippocampo Externo: Mapas Cognitivos Topológicos para Orientar o Raciocínio de Modelos de Linguagem Grande

Este artigo propõe o framework Hippocampo Externo, que modela o raciocínio de modelos de linguagem a partir de uma perspectiva de dinâmica cognitiva como o fluxo de energia de informação no espaço semântico. O framework constrói mapas cognitivos topológicos através de projeção de redução de dimensionalidade, permitindo navegação precisa e intervenção do fluxo de energia durante o teste, sem requisitos computacionais substanciais.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

FC-MIR: Um Framework de Consciência de Tela Móvel para Recomendação Consciente de Intenção Baseada em Raciocínio Multimodal de Trajetória Comprimida por Frame

Identificar a intenção do usuário a partir de trajetórias de operação da interface móvel é crucial para avançar na compreensão da UI e habilitar agentes de automação de tarefas. Propomos o framework FC-MIR, que utiliza amostragem de keyframes e concatenação adaptativa para reduzir a redundância visual e aumentar a eficiência da inferência, integrando MLLMs de última geração para sumarização de trajetórias e previsão de intenção.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

MEEA: Otimização Confrontacional Baseada no Efeito de Exposição Mere para Jailbreaking de LLMs

O rápido avanço dos grandes modelos de linguagem (LLMs) intensificou preocupações sobre a robustez de seu alinhamento de segurança. Propomos o MEEA (Mere Exposure Effect Attack), um framework automatizado inspirado na psicologia para avaliar a robustez de segurança em interações multi-turno, utilizando o efeito de exposição mere. Nossos experimentos mostram que o MEEA supera consistentemente as taxas de sucesso de ataque de modelos como GPT-4 e Claude-3.5.

Fonte: arXiv cs.AI

Evaluation/Benchmarks • Score 93

Avaliação Comparativa de Machine Learning Explicável Versus Regressão Linear para Prever a Taxa de Mortalidade por Câncer de Pulmão em Nível de Condado nos Estados Unidos

O câncer de pulmão (CP) é uma das principais causas de mortalidade relacionada ao câncer nos Estados Unidos. A previsão precisa das taxas de mortalidade por CP é crucial para orientar intervenções direcionadas e abordar disparidades de saúde. Este estudo aplicou três modelos: random forest (RF), gradient boosting regression (GBR) e regressão linear (LR) para prever as taxas de mortalidade por CP em nível de condado.

Fonte: arXiv cs.LG

RL • Score 96

Podemos Testar Teorias da Consciência em IA? Ablations, Marcadores e Robustez

A busca por indicadores confiáveis de consciência se fragmentou em campos teóricos concorrentes (Global Workspace Theory (GWT), Integrated Information Theory (IIT) e Higher-Order Theories (HOT)), cada um propondo assinaturas neurais distintas. Adotamos uma abordagem de neuro-fenomenologia sintética, construindo agentes artificiais para testar as consequências funcionais dessas teorias através de ablações arquitetônicas precisas. Relatamos dissociações que sugerem que essas teorias descrevem camadas funcionais complementares.

Fonte: arXiv cs.AI

Theory/Optimization • Score 89

Nonasymptotic Convergence Rates for Plug-and-Play Methods With MMSE Denoisers

arXiv:2510.27211v4 Announce Type: replace-cross Abstract: It is known that the minimum-mean-squared-error (MMSE) denoiser under Gaussian noise can be written as a proximal operator, which suffices for asymptotic convergence of plug-and-play (PnP) methods but does not reveal the structure of the induced regularizer or give convergence rates. We show that the MMSE denoiser corresponds to a regularizer that can be written explicitly as an upper Moreau envelope of the negative log-marginal density, which in turn implies that the regularizer is 1-weakly convex. Using this property, we derive (to the best of our knowledge) the first sublinear convergence guarantee for PnP proximal gradient descent with an MMSE denoiser. We validate the theory with a one-dimensional synthetic study that recovers the implicit regularizer. We also validate the theory with imaging experiments (deblurring and computed tomography), which exhibit the predicted sublinear behavior.

Fonte: arXiv stat.ML

RL • Score 93

ORPR: Um Modelo de Aprendizado Guiado por OR para Gestão de Estoque com Pré-treinamento e Reforço

À medida que a busca por sinergia entre Inteligência Artificial (AI) e Pesquisa Operacional (OR) avança na gestão de sistemas de estoque complexos, um desafio crítico permanece: como reconciliar efetivamente a percepção adaptativa da AI com o rigor estrutural da OR. Propomos um novo framework 'Pré-treinamento e Reforço' guiado por OR.

Fonte: arXiv cs.AI

RL • Score 93

Qual é o Preço da Monotonicidade? Um Benchmark Multi-Conjunto de Dados do Gradient Boosting com Restrições Monotônicas para PD de Crédito

Instituições financeiras enfrentam um trade-off entre precisão preditiva e interpretabilidade ao implantar modelos de machine learning para risco de crédito. Este artigo avalia modelos de gradient boosting com e sem restrições monotônicas em cinco conjuntos de dados públicos e três bibliotecas, definindo o Preço da Monotonicidade (PoM) como a mudança relativa em métricas de desempenho padrão.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Grandes Modelos de Linguagem como Filtros Bayesianos Descontados

Os Grandes Modelos de Linguagem (LLMs) demonstram forte generalização em poucos exemplos através do aprendizado em contexto, mas seu raciocínio em ambientes dinâmicos e estocásticos permanece opaco. Introduzimos um framework de filtragem bayesiana para avaliar a inferência online em LLMs, revelando como as atualizações de crença se comportam como filtros de esquecimento exponencial.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Observador, Não Jogador: Simulando a Teoria da Mente em LLMs através da Observação de Jogos

Apresentamos um framework interativo para avaliar se modelos de linguagem grandes (LLMs) exibem um verdadeiro 'entendimento' em um ambiente simples, mas estratégico. Utilizando o jogo Pedra-Papel-Tesoura (RPS) como exemplo, nosso sistema posiciona o LLM como um Observador, cuja tarefa é identificar as estratégias em jogo e articular o raciocínio por trás desse julgamento.

Fonte: arXiv cs.AI

RL • Score 93

Rumo ao Descida Guiada: Algoritmos de Otimização para Treinamento de Redes Neurais em Larga Escala

A otimização de redes neurais continua sendo um dos desafios mais significativos e mal compreendidos na pesquisa de IA moderna. Melhorias em algoritmos de treinamento podem levar a um aprendizado de características aprimorado em modelos fundamentais, reduções significativas no tempo de treinamento e uma melhor interpretabilidade de como as redes aprendem. Esta tese investiga a evolução dos algoritmos de otimização, revelando como um design algorítmico fundamentado pode desmistificar o processo de treinamento.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

De Atalho a Cabeça de Indução: Como a Diversidade de Dados Molda a Seleção de Algoritmos em Transformers

Transformers podem implementar tanto algoritmos generalizáveis (ex: induction heads) quanto atalhos posicionais simples (ex: memorização de posições de saída fixas). Neste trabalho, estudamos como a escolha da distribuição de dados de pré-treinamento direciona um transformer raso para um comportamento ou outro, analisando o treinamento baseado em gradiente de um transformer de camada única.

Fonte: arXiv cs.LG

Vision • Score 96

FedOAED: Denoiser Autoencoder Federado em Dispositivos para Dados Heterogêneos sob Disponibilidade Limitada de Clientes

Nos últimos anos, soluções de machine learning (ML) e deep learning (DL) mostraram seu potencial em diversas aplicações, mas regulamentações rigorosas de compartilhamento de dados, como o GDPR e o HIPAA, dificultaram a implementação de aplicações baseadas em dados. O Federated Learning (FL) surge como uma solução, mas ainda enfrenta desafios relacionados à heterogeneidade. Neste trabalho, apresentamos o FedOAED, um algoritmo inovador que visa mitigar a deriva de clientes e a variância induzida pela participação parcial de clientes.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Geração de Regras Programáticas para Detecção de Falsificação de Documentos Usando Modelos de Linguagem de Grande Escala

A falsificação de documentos representa uma ameaça crescente a processos legais, econômicos e governamentais, exigindo mecanismos de verificação cada vez mais sofisticados. Este trabalho investiga como modelos de linguagem de grande escala (LLMs) podem ser adaptados para gerar verificações de plausibilidade baseadas em regras para detecção de falsificações, utilizando recursos de hardware limitados.

Fonte: arXiv cs.AI

Vision • Score 96

Um Conjunto de Dados e Benchmarks para Detecção de Fibrilação Atrial a partir de Eletrocardiogramas de Pacientes em Unidades de Terapia Intensiva

Objetivo: A fibrilação atrial (AF) é a arritmia cardíaca mais comum entre pacientes em unidades de terapia intensiva (UTI) e pode causar efeitos adversos à saúde. Neste estudo, publicamos um conjunto de dados rotulado da UTI e benchmarks para detecção de AF, comparando modelos de machine learning em três abordagens de inteligência artificial (IA) baseadas em dados.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

LLM-CAS: Perturbação Dinâmica de Neurônios para Correção de Alucinações em Tempo Real

Modelos de linguagem grandes (LLMs) frequentemente geram conteúdo alucinado que carece de fundamentação factual ou contextual, limitando sua confiabilidade em aplicações críticas. O LLM-CAS propõe uma abordagem de correção de alucinações em tempo real como um problema de aprendizado por reforço hierárquico, permitindo correções adaptativas sem modificação permanente de parâmetros.

Fonte: arXiv cs.CL

RL • Score 96

LeJOT: Uma Solução Inteligente de Orquestração de Custos de Trabalho para a Plataforma Databricks

Com os avanços rápidos em tecnologias de big data, a plataforma Databricks tornou-se fundamental para empresas e instituições de pesquisa. No entanto, gerenciar os custos operacionais crescentes associados à execução de trabalhos é um desafio crítico. Apresentamos o LeJOT, um framework de orquestração de custos de trabalho que utiliza machine learning para previsão de tempo de execução e um modelo de otimização baseado em solver para alocação de recursos em tempo real.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Rumo à Avaliação de Vulnerabilidades de Privacidade no Esquecimento Seletivo com Modelos de Linguagem de Grande Escala

Os avanços rápidos em inteligência artificial (IA) têm se concentrado no aprendizado a partir de dados para desenvolver sistemas de aprendizado informados. Com a implementação desses sistemas em áreas críticas, garantir sua privacidade e alinhamento com valores humanos é essencial. O esquecimento seletivo, ou machine unlearning, surge como uma abordagem promissora, mas também levanta preocupações significativas de privacidade, especialmente em domínios sensíveis.

Fonte: arXiv cs.LG

Vision • Score 96

Mapas auto-organizáveis para avaliação da qualidade da água em reservatórios e lagos: Uma revisão sistemática da literatura

A qualidade da água sustentável é fundamental para o equilíbrio ecológico e a segurança hídrica. Esta revisão examina a aplicação do Self-Organizing Map (SOM), uma técnica de IA não supervisionada, na avaliação da qualidade da água, abordando seleção de parâmetros, estratégias de amostragem e abordagens de agrupamento.

Fonte: arXiv cs.LG

MLOps/Systems • Score 93

Comparando Modelos Dinâmicos Através do Alinhamento de Campos Vetoriais Difeomórficos

Modelos de sistemas dinâmicos, como redes neurais recorrentes (RNNs), são cada vez mais populares na neurociência teórica para geração de hipóteses e análise de dados. Avaliar a dinâmica nesses modelos é crucial para entender seus mecanismos generativos aprendidos, mas enfrenta desafios significativos relacionados à comparação de dinâmicas e identificação de motivos importantes em modelos não lineares de alta dimensão.

Fonte: arXiv cs.LG

NLP/LLMs • Score 92

DACE For Railway Acronym Disambiguation

arXiv:2512.18357v1 Announce Type: new Abstract: Acronym Disambiguation (AD) is a fundamental challenge in technical text processing, particularly in specialized sectors where high ambiguity complicates automated analysis. This paper addresses AD within the context of the TextMine'26 competition on French railway documentation. We present DACE (Dynamic Prompting, Retrieval Augmented Generation, Contextual Selection, and Ensemble Aggregation), a framework that enhances Large Language Models through adaptive in-context learning and external domain knowledge injection. By dynamically tailoring prompts to acronym ambiguity and aggregating ensemble predictions, DACE mitigates hallucination and effectively handles low-resource scenarios. Our approach secured the top rank in the competition with an F1 score of 0.9069.

Fonte: arXiv cs.CL

Vision • Score 96

NOVA: Descobrindo Transformações Winograd Bem Condicionadas através da Otimização Numérica da Aritmética de Vandermonde

A convolução Winograd é o algoritmo padrão para inferência eficiente, reduzindo a complexidade aritmética em 2,25x para núcleos 3x3. No entanto, enfrenta uma barreira crítica na era moderna da computação de baixa precisão: a instabilidade numérica. Apresentamos o NOVA, um framework de descoberta que otimiza a seleção de pontos Winograd como um problema de otimização contínua.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Context-Aware Initialization for Reducing Generative Path Length in Diffusion Language Models

arXiv:2512.19004v1 Announce Type: new Abstract: Diffusion Large Language Models (DLLMs) enable fully parallel token decoding but often remain impractical at inference time due to the many denoising iterations required to refine an information-free, fully masked initialization into coherent text. Most existing acceleration methods focus on traversing this generative trajectory more efficiently via improved solvers or sampling strategies. We advance a complementary perspective: shorten the trajectory itself by starting closer to the target distribution through context-aware initialization. We propose a training-free interface that injects prompt-conditioned priors from a lightweight auxiliary model into the diffusion initialization, and instantiate it with two mechanisms: discrete token injection and representation-level embedding interpolation. Because injected priors can be imperfect and unmask-only decoding can over-commit early, we also introduce a simple confidence-based remasking mechanism as a form of prior skepticism. Preliminary evidence on GSM8K suggests that context-aware initialization can substantially reduce denoising iterations (about 35\% fewer function evaluations in our setting), while also exposing a key open challenge: naive warm-starting can degrade final accuracy relative to strong diffusion baselines. We use these findings to motivate a research agenda around calibration, revision mechanisms, and representation alignment for reliable warm-started diffusion decoding.

Fonte: arXiv cs.CL

RL • Score 93

Quando a Aprendizagem se Renormaliza? Condições Suficientes para Dinâmicas Espectrais de Lei de Potência

A escalabilidade empírica de lei de potência tem sido amplamente observada em sistemas modernos de deep learning, mas suas origens teóricas e escopo de validade permanecem incompletamente compreendidos. O framework Generalized Resolution-Shell Dynamics (GRSD) modela a aprendizagem como transporte de energia espectral através de camadas de resolução logarítmica, oferecendo uma descrição dinâmica de treinamento.

Fonte: arXiv cs.LG

RL • Score 93

Família FedSUM: Métodos Eficientes de Aprendizado Federado sob Participação Arbitrária de Clientes

Os métodos de Aprendizado Federado (FL) são frequentemente projetados para padrões específicos de participação de clientes, limitando sua aplicabilidade em implementações práticas. Apresentamos a família de algoritmos FedSUM, que suporta participação arbitrária de clientes sem suposições adicionais sobre a heterogeneidade dos dados. Nosso framework modela a variabilidade de participação com duas métricas de atraso: o atraso máximo $ au_{ ext{max}}$ e o atraso médio $ au_{ ext{avg}}$.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

CoPE: A Small Language Model for Steerable and Scalable Content Labeling

arXiv:2512.18027v1 Announce Type: new Abstract: This paper details the methodology behind CoPE, a policy-steerable small language model capable of fast and accurate content labeling. We present a novel training curricula called Contradictory Example Training that enables the model to learn policy interpretation rather than mere policy memorization. We also present a novel method for generating content policies, called Binocular Labeling, which enables rapid construction of unambiguous training datasets. When evaluated across seven different harm areas, CoPE exhibits equal or superior accuracy to frontier models at only 1% of their size. We openly release a 9 billion parameter version of the model that can be run on a single consumer-grade GPU. Models like CoPE represent a paradigm shift for classifier systems. By turning an ML task into a policy writing task, CoPE opens up new design possibilities for the governance of online platforms.

Fonte: arXiv cs.CL

RL • Score 96

Previsão e Prognóstico dos Impactos de Secas de Curto Prazo Usando Machine Learning para Apoiar Esforços de Mitigação e Adaptação

A seca é um risco natural complexo que afeta sistemas ecológicos e humanos, resultando em perdas ambientais e econômicas significativas. Este estudo aplica técnicas de machine learning para vincular índices de seca a registros históricos de impactos de seca, gerando previsões de impactos de curto prazo e melhorando a previsibilidade desses impactos em prazos acionáveis.

Fonte: arXiv cs.LG

NLP/LLMs • Score 92

Graph-O1 : Monte Carlo Tree Search with Reinforcement Learning for Text-Attributed Graph Reasoning

arXiv:2512.17912v1 Announce Type: new Abstract: ChatGPT said: Text-attributed graphs, where nodes and edges contain rich textual information, are widely used across diverse domains. A central challenge in this setting is question answering, which requires jointly leveraging unstructured text and the structured relational signals within the graph. Although Large Language Models (LLMs) have made significant advances in natural language understanding, their direct use for reasoning over text-attributed graphs remains limited. Retrieval-augmented generation methods that operate purely on text often treat passages as isolated units, ignoring the interconnected structure of the graph. Conversely, graph-based RAG methods that serialize large subgraphs into long textual sequences quickly become infeasible due to LLM context-length constraints, resulting in fragmented reasoning and degraded accuracy. To overcome these limitations, we introduce Graph-O1, an agentic GraphRAG framework that enables LLMs to conduct stepwise, interactive reasoning over graphs. Our approach integrates Monte Carlo Tree Search (MCTS) with end-to-end reinforcement learning, allowing the model to selectively explore and retrieve only the most informative subgraph components. The reasoning procedure is framed as a multi-turn interaction between the agent and the graph environment, and the agent is trained through a unified reward mechanism. Extensive experiments across multiple LLM backbones demonstrate that Graph-O1 consistently surpasses state-of-the-art baselines, producing answers that are more accurate, reliable, and interpretable.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Separating Constraint Compliance from Semantic Accuracy: A Novel Benchmark for Evaluating Instruction-Following Under Compression

arXiv:2512.17920v1 Announce Type: new Abstract: Large language models (LLMs) exhibit degraded performance under prompt compression, but the mechanisms remain poorly understood. We introduce the Compression-Decay Comprehension Test (CDCT), a benchmark that independently measures constraint compliance (CC) and semantic accuracy (SA) across compression levels. We evaluate 9 frontier LLMs across 8 concepts using 5 compression levels from extreme (c=0.0, ~2 words) to none (c=1.0, ~135 words). A three-judge LLM jury achieves almost perfect inter-rater agreement on CC (Fleiss' \k{appa}=0.90). We observe a universal U-curve pattern in constraint compliance (97.2% prevalence), with violations peaking at medium compression (c=0.5, ~27 words). Counterintuitively, models perform better at extreme compression than medium lengths. The dimensions are statistically orthogonal (r=0.193, p=0.084), with constraint effects 2.9x larger than semantic effects. Experimental validation via RLHF ablation confirms our constraint salience hypothesis: removing "helpfulness" signals improves CC by 598% on average (71/72 trials, p<0.001), with 79% achieving perfect compliance. This demonstrates that RLHF-trained helpfulness behaviors are the dominant cause of constraint violations at medium compression. Reasoning models outperform efficient models by 27.5% (Cohen's d=0.96). Our findings reveal a fundamental tension between RLHF alignment and instruction-following, providing actionable guidelines for improving deployed systems.

Fonte: arXiv cs.CL

RL • Score 92

Por que a Maioria dos Algoritmos de Bandit de Otimismo Tem a Mesma Análise de Arrependimento: Um Teorema Unificador Simples

Vários algoritmos de bandit estocásticos baseados em otimismo -- incluindo UCB, UCB-V, linear UCB e GP-UCB de braço finito -- alcançam arrependimento logarítmico usando provas que, apesar de diferenças superficiais, seguem essencialmente a mesma estrutura. Este artigo isola os ingredientes mínimos por trás dessas análises.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

A Comparative Study of Light-weight Language Models for PII Masking and their Deployment for Real Conversational Texts

arXiv:2512.18608v1 Announce Type: new Abstract: Automated masking of Personally Identifiable Information (PII) is critical for privacy-preserving conversational systems. While current frontier large language models demonstrate strong PII masking capabilities, concerns about data handling and computational costs motivate exploration of whether lightweight models can achieve comparable performance. We compare encoder-decoder and decoder-only architectures by fine-tuning T5-small and Mistral-Instruct-v0.3 on English datasets constructed from the AI4Privacy benchmark. We create different dataset variants to study label standardization and PII representation, covering 24 standardized PII categories and higher-granularity settings. Evaluation using entity-level and character-level metrics, type accuracy, and exact match shows that both lightweight models achieve performance comparable to frontier LLMs for PII masking tasks. Label normalization consistently improves performance across architectures. Mistral achieves higher F1 and recall with greater robustness across PII types but incurs significantly higher generation latency. T5, while less robust in conversational text, offers more controllable structured outputs and lower inference cost, motivating its use in a real-time Discord bot for real-world PII redaction. Evaluation on live messages reveals performance degradation under informal inputs. These results clarify trade-offs between accuracy, robustness, and computational efficiency, demonstrating that lightweight models can provide effective PII masking while addressing data handling concerns associated with frontier LLMs.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Sobre a Universalidade das Arquiteturas Transformer; Quanta Atenção é Suficiente?

Transformers são cruciais em muitos campos da IA, como modelos de linguagem grandes, visão computacional e aprendizado por reforço. Este trabalho examina a universalidade em Transformers, revisa avanços recentes, incluindo refinamentos arquiteturais, e investiga inovações que informam a compreensão teórica e prática. Nosso objetivo é esclarecer o que se sabe sobre a expressividade dos Transformers e identificar direções-chave para pesquisas teóricas futuras.

Fonte: arXiv cs.LG

Vision • Score 96

Detecção de Out-of-Distribution em Complexos Moleculares via Modelos de Difusão para Grafos Irregulares

Modelos preditivos de machine learning geralmente se destacam em dados in-distribution, mas seu desempenho se degrada em entradas out-of-distribution (OOD). Apresentamos um framework probabilístico de detecção OOD para dados complexos de grafos 3D, construído sobre um modelo de difusão que aprende a densidade da distribuição de treinamento de maneira totalmente não supervisionada.

Fonte: arXiv cs.LG

Vision • Score 93

Planejamento de Trajetória para Agricultura Inteligente Baseada em UAV Usando Aprendizado Profundo Triplo Q com Imitacão

Veículos aéreos não tripulados (UAVs) surgiram como uma plataforma auxiliar promissora para a agricultura inteligente, realizando simultaneamente detecção de ervas daninhas, reconhecimento e coleta de dados de sensores sem fio. No entanto, o planejamento de trajetória é desafiador devido à alta incerteza do ambiente e à capacidade limitada da bateria dos UAVs. Propomos um algoritmo inovador de aprendizado profundo triplo Q com imitação (ITDQN) para resolver esses problemas.

Fonte: arXiv cs.LG

RL • Score 96

Extensão de Base Contrafactual e Geometria Representacional: Um Modelo de Crescimento Conceitual com Restrições de MDL

O aprendizado de conceitos se torna possível apenas quando representações existentes falham em contabilizar a experiência. Este artigo propõe um framework geométrico onde o crescimento conceitual é modelado como extensão de base admissível avaliada sob um critério de Minimum Description Length (MDL). A experiência é representada como vetores em relação a um subespaço conceitual atual.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Redes Neurais Gráficas Aprimoradas por Recursos para Classificação de Modelos Geradores de Grafos Sintéticos: Um Estudo de Benchmarking

A capacidade de discriminar entre modelos geradores de grafos é fundamental para entender padrões estruturais complexos em grafos sintéticos e nas estruturas do mundo real que eles emulam. Este trabalho investiga a classificação de famílias de grafos sintéticos usando uma abordagem híbrida que combina Graph Neural Networks (GNNs) com recursos teóricos de grafos.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

O Número de Condição como um Proxy Invariante de Escala para Codificação de Informação em Unidades Neurais

Este artigo explora a relação entre o número de condição do tensor de pesos de uma rede neural e a extensão da informação codificada pela unidade de processamento associada, sob a perspectiva da teoria da informação. Argumenta-se que um número de condição elevado pode indicar que a unidade aprendeu a amplificar e comprimir informações de forma seletiva.

Fonte: arXiv stat.ML

RL • Score 92

Algoritmo Garantido de Regret a Qualquer Momento para Controle de Sistemas Lineares Quadráticos

Propomos um algoritmo computacionalmente eficiente que alcança um regret a qualquer momento de ordem $ extmath{O}( extmath{ extsqrt{t}})$, com dependência explícita nas dimensões do sistema e na solução da Equação de Riccati Algébrica Discreta (DARE). Nossa abordagem utiliza uma regularização adequadamente ajustada e uma estimativa inicial suficientemente precisa para construir elipsóides de confiança para o design de controle.

Fonte: arXiv stat.ML

NLP/LLMs • Score 96

DeliveryBench: Agentes Podem Lucrar no Mundo Real?

arXiv:2512.19234v1 Tipo de Anúncio: novo. Resumo: LLMs e VLMs estão sendo cada vez mais utilizados como agentes incorporados, mas os benchmarks existentes se concentram em tarefas simples de curto prazo e têm dificuldade em capturar as ricas restrições realistas que moldam a tomada de decisão no mundo real. Para fechar essa lacuna, propomos o DeliveryBench, um benchmark incorporado em escala de cidade baseado na profissão real de entrega de alimentos.

Fonte: arXiv cs.AI

Vision • Score 95

Data-driven particle dynamics: Structure-preserving coarse-graining for emergent behavior in non-equilibrium systems

arXiv:2508.12569v2 Announce Type: replace-cross Abstract: Multiscale systems are ubiquitous in science and technology, but are notoriously challenging to simulate as short spatiotemporal scales must be appropriately linked to emergent bulk physics. When expensive high-dimensional dynamical systems are coarse-grained into low-dimensional models, the entropic loss of information leads to emergent physics which are dissipative, history-dependent, and stochastic. To machine learn coarse-grained dynamics from time-series observations of particle trajectories, we propose a framework using the metriplectic bracket formalism that preserves these properties by construction; most notably, the framework guarantees discrete notions of the first and second laws of thermodynamics, conservation of momentum, and a discrete fluctuation-dissipation balance crucial for capturing non-equilibrium statistics. We introduce the mathematical framework abstractly before specializing to a particle discretization. As labels are generally unavailable for entropic state variables, we introduce a novel self-supervised learning strategy to identify emergent structural variables. We validate the method on benchmark systems and demonstrate its utility on two challenging examples: (1) coarse-graining star polymers at challenging levels of coarse-graining while preserving non-equilibrium statistics, and (2) learning models from high-speed video of colloidal suspensions that capture coupling between local rearrangement events and emergent stochastic dynamics. We provide open-source implementations in both PyTorch and LAMMPS, enabling large-scale inference and extensibility to diverse particle-based systems.

Fonte: arXiv stat.ML

MLOps/Systems • Score 92

Uma Perspectiva de Otimização Riemanniana do Método Gauss-Newton para Redes Neurais Feedforward

Neste trabalho, estabelecemos limites de convergência não assintóticos para o método Gauss-Newton no treinamento de redes neurais com ativações suaves. No regime subparametrizado, o fluxo de gradiente Gauss-Newton no espaço de parâmetros induz um fluxo de gradiente Riemanniano em uma subvariedade embutida de baixa dimensão do espaço de funções.

Fonte: arXiv stat.ML

Theory/Optimization • Score 87

Redução de Variância e Baixa Complexidade de Amostra em Otimização Estocástica via Método de Ponto Proximal

As garantias de alta probabilidade em otimização estocástica são frequentemente obtidas apenas sob suposições de ruído forte, como caudas sub-Gaussianas. Mostramos que tais garantias também podem ser alcançadas sob a suposição mais fraca de variância limitada, desenvolvendo um método estocástico de ponto proximal que combina um solucionador de subproblemas proximais com um amplificador de probabilidade.

Fonte: arXiv stat.ML

RL • Score 92

Modelos Aditivos Generalizados Baseados em Cluster Informados por Recursos Aleatórios de Fourier

A aprendizagem de máquina explicável busca equilibrar a precisão da previsão e a transparência do modelo, especialmente em cenários onde modelos preditivos de caixa-preta, como redes neurais profundas, apresentam bom desempenho, mas são difíceis de interpretar. Este trabalho introduz uma mistura de modelos aditivos generalizados (GAMs) que utilizam representações de recursos aleatórios de Fourier (RFF) para revelar estruturas localmente adaptativas nos dados.

Fonte: arXiv stat.ML

Vision • Score 92

A informação mútua normalizada é uma medida enviesada para classificação e detecção de comunidades

A informação mútua normalizada é amplamente utilizada como uma medida de similaridade para avaliar o desempenho de algoritmos de agrupamento e classificação. Neste artigo, argumentamos que os resultados retornados pela informação mútua normalizada são enviesados por duas razões: ignoram o conteúdo informativo da tabela de contingência e sua normalização simétrica introduz dependência espúria na saída do algoritmo. Apresentamos uma versão modificada da informação mútua que corrige essas falhas.

Fonte: arXiv stat.ML

Theory/Optimization • Score 93

O Gargalo de Interação das Redes Neurais Profundas: Descoberta, Prova e Modulação

Compreender que tipos de estruturas cooperativas as redes neurais profundas (DNNs) podem representar continua sendo um problema fundamental, mas insuficientemente compreendido. Este trabalho investiga como as DNNs codificam interações sob diferentes níveis de complexidade contextual e como esses padrões de interação microscópica moldam a capacidade de representação macroscópica.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Contribuição Consciente de Dados via Destilação de Cadeia de Pensamento Orientada pela Comunidade

A atual era de desenvolvimento de IA enfatiza o treinamento de grandes modelos em conjuntos de dados cada vez maiores. Este paradigma levantou preocupações sobre privacidade de dados e escolha do consumidor. Neste artigo, discutimos a portabilidade de dados e a autonomia do usuário no contexto de LLMs que utilizam rastros de cadeia de pensamento (CoT) para gerar saídas finais a partir de entradas do usuário.

Fonte: arXiv cs.LG

MLOps/Systems • Score 89

Misturas de cadeias de Markov variacionais com seleção automática de componentes

A modelagem de estados de Markov ganhou popularidade em diversos campos científicos, pois reduz conjuntos de dados de séries temporais complexas em transições entre poucos estados. Este artigo propõe um modelo de dados de séries temporais usando uma mistura de cadeias de Markov, determinando automaticamente o número de componentes da mistura com o algoritmo variacional de expectativa-maximização.

Fonte: arXiv stat.ML

NLP/LLMs • Score 96

HARBOR: Modelo Holístico e Adaptativo de Avaliação de Risco para Cuidados de Saúde Comportamental

A avaliação de risco em cuidados de saúde comportamental continua sendo um desafio devido à natureza multimodal dos dados dos pacientes e à dinâmica temporal dos transtornos de humor e afetivos. Neste trabalho, apresentamos o HARBOR, um modelo de linguagem consciente da saúde comportamental projetado para prever um escore de humor e risco discreto, denominado Harbor Risk Score (HRS).

Fonte: arXiv cs.AI

RL • Score 93

A Geometria da Abstração: Aprendizado Contínuo via Quotienting Recursivo

Sistemas de aprendizado contínuo que operam em espaços de dimensão fixa enfrentam uma barreira geométrica fundamental: o problema do manifold plano. Neste trabalho, propomos uma resolução geométrica para esse paradoxo com base na Contração Métrica Recursiva, formalizando a abstração como uma deformação topológica.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

MemEvolve: Meta-Evolution of Agent Memory Systems

arXiv:2512.18746v1 Announce Type: new Abstract: Self-evolving memory systems are unprecedentedly reshaping the evolutionary paradigm of large language model (LLM)-based agents. Prior work has predominantly relied on manually engineered memory architectures to store trajectories, distill experience, and synthesize reusable tools, enabling agents to evolve on the fly within environment interactions. However, this paradigm is fundamentally constrained by the staticity of the memory system itself: while memory facilitates agent-level evolving, the underlying memory architecture cannot be meta-adapted to diverse task contexts. To address this gap, we propose MemEvolve, a meta-evolutionary framework that jointly evolves agents' experiential knowledge and their memory architecture, allowing agent systems not only to accumulate experience but also to progressively refine how they learn from it. To ground MemEvolve in prior research and foster openness in future self-evolving systems, we introduce EvolveLab, a unified self-evolving memory codebase that distills twelve representative memory systems into a modular design space (encode, store, retrieve, manage), providing both a standardized implementation substrate and a fair experimental arena. Extensive evaluations on four challenging agentic benchmarks demonstrate that MemEvolve achieves (I) substantial performance gains, improving frameworks such as SmolAgent and Flash-Searcher by up to $17.06\%$; and (II) strong cross-task and cross-LLM generalization, designing memory architectures that transfer effectively across diverse benchmarks and backbone models.

Fonte: arXiv cs.CL

RL • Score 95

From Natural Language to Control Signals: A Conceptual Framework for Semantic Channel Finding in Complex Experimental Infrastructure

arXiv:2512.18779v1 Announce Type: new Abstract: Modern experimental platforms such as particle accelerators, fusion devices, telescopes, and industrial process control systems expose tens to hundreds of thousands of control and diagnostic channels accumulated over decades of evolution. Operators and AI systems rely on informal expert knowledge, inconsistent naming conventions, and fragmented documentation to locate signals for monitoring, troubleshooting, and automated control, creating a persistent bottleneck for reliability, scalability, and language-model-driven interfaces. We formalize semantic channel finding-mapping natural-language intent to concrete control-system signals-as a general problem in complex experimental infrastructure, and introduce a four-paradigm framework to guide architecture selection across facility-specific data regimes. The paradigms span (i) direct in-context lookup over curated channel dictionaries, (ii) constrained hierarchical navigation through structured trees, (iii) interactive agent exploration using iterative reasoning and tool-based database queries, and (iv) ontology-grounded semantic search that decouples channel meaning from facility-specific naming conventions. We demonstrate each paradigm through proof-of-concept implementations at four operational facilities spanning two orders of magnitude in scale-from compact free-electron lasers to large synchrotron light sources-and diverse control-system architectures, from clean hierarchies to legacy environments. These implementations achieve 90-97% accuracy on expert-curated operational queries.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations

arXiv:2512.18906v1 Announce Type: new Abstract: Over the years, automatic MT metrics have hillclimbed benchmarks and presented strong and sometimes human-level agreement with human ratings. Yet they remain black-box, offering little insight into their decision-making and often failing under real-world out-of-distribution (OOD) inputs. We introduce Remedy-R, a reasoning-driven generative MT metric trained with reinforcement learning from pairwise translation preferences, without requiring error-span annotations or distillation from closed LLMs. Remedy-R produces step-by-step analyses of accuracy, fluency, and completeness, followed by a final score, enabling more interpretable assessments. With only 60K training pairs across two language pairs, Remedy-R remains competitive with top scalar metrics and GPT-4-based judges on WMT22-24 meta-evaluation, generalizes to other languages, and exhibits strong robustness on OOD stress tests. Moreover, Remedy-R models generate self-reflective feedback that can be reused for translation improvement. Building on this finding, we introduce Remedy-R Agent, a simple evaluate-revise pipeline that leverages Remedy-R's evaluation analysis to refine translations. This agent consistently improves translation quality across diverse models, including Qwen2.5, ALMA-R, GPT-4o-mini, and Gemini-2.0-Flash, suggesting that Remedy-R's reasoning captures translation-relevant information and is practically useful.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

A Large Language Model Based Method for Complex Logical Reasoning over Knowledge Graphs

arXiv:2512.19092v1 Announce Type: new Abstract: Reasoning over knowledge graphs (KGs) with first-order logic (FOL) queries is challenging due to the inherent incompleteness of real-world KGs and the compositional complexity of logical query structures. Most existing methods rely on embedding entities and relations into continuous geometric spaces and answer queries via differentiable set operations. While effective for simple query patterns, these approaches often struggle to generalize to complex queries involving multiple operators, deeper reasoning chains, or heterogeneous KG schemas. We propose ROG (Reasoning Over knowledge Graphs with large language models), an ensemble-style framework that combines query-aware KG neighborhood retrieval with large language model (LLM)-based chain-of-thought reasoning. ROG decomposes complex FOL queries into sequences of simpler sub-queries, retrieves compact, query-relevant subgraphs as contextual evidence, and performs step-by-step logical inference using an LLM, avoiding the need for task-specific embedding optimization. Experiments on standard KG reasoning benchmarks demonstrate that ROG consistently outperforms strong embedding-based baselines in terms of mean reciprocal rank (MRR), with particularly notable gains on high-complexity query types. These results suggest that integrating structured KG retrieval with LLM-driven logical reasoning offers a robust and effective alternative for complex KG reasoning tasks.

Fonte: arXiv cs.CL

RL • Score 95

AraMix: Recycling, Refiltering, and Deduplicating to Deliver the Largest Arabic Pretraining Corpus

arXiv:2512.18834v1 Announce Type: new Abstract: We present AraMix, a deduplicated Arabic pretraining corpus containing approximately 178 billion tokens across 179 million documents. Rather than scraping the web again, AraMix demonstrates that substantial value lies in systematically reusing and curating existing pretraining datasets: we combine seven publicly available Arabic web datasets, apply quality filtering designed specifically for Arabic text to re-filter some datasets, and perform cross-dataset deduplication, both MinHash and sentence-level. This approach reveals that nearly 60% of tokens across these independently collected corpora are duplicates, redundancy that any new scraping efforts will reproduce. Our work suggests that for lower resource languages, investment in curation pipelines for existing data yields greater returns than additional web crawls, an approach that allowed us to curate the largest heavily filtered publicly available Arabic pretraining corpus.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Stop saying LLM: Large Discourse Models (LDM) and Artificial Discursive Agent (ADA)?

arXiv:2512.19117v1 Announce Type: new Abstract: This paper proposes an epistemological shift in the analysis of large generative models, replacing the category ''Large Language Models'' (LLM) with that of ''Large Discourse Models'' (LDM), and then with that of Artificial Discursive Agent (ADA). The theoretical framework is based on an ontological triad distinguishing three regulatory instances: the apprehension of the phenomenal regularities of the referential world, the structuring of embodied cognition, and the structural-linguistic sedimentation of the utterance within a socio-historical context. LDMs, operating on the product of these three instances (the document), model the discursive projection of a portion of human experience reified by the learning corpus. The proposed program aims to replace the ''fascination/fear'' dichotomy with public trials and procedures that make the place, uses, and limits of artificial discursive agents in contemporary social space decipherable, situating this approach within a perspective of governance and co-regulation involving the State, industry, civil society, and academia.

Fonte: arXiv cs.CL

RL • Score 96

Monitoramento da Monitorabilidade

A observabilidade na tomada de decisões de sistemas de IA modernos pode ser necessária para implantar com segurança agentes cada vez mais capazes. Monitorar a cadeia de pensamento (CoT) dos modelos de raciocínio atuais tem se mostrado eficaz na detecção de comportamentos inadequados. No entanto, essa 'monitorabilidade' pode ser frágil sob diferentes procedimentos de treinamento e fontes de dados.

Fonte: arXiv cs.AI

NLP/LLMs • Score 92

InstructNet: Uma Abordagem Inovadora para Classificação de Instruções Multirrótulo através de Aprendizado Profundo Avançado

As pessoas utilizam motores de busca para diversos tópicos e itens, desde essenciais do dia a dia até objetos mais especializados. Este estudo utiliza artigos 'How To' para determinar a categoria de instrução multirrótulo, empregando arquiteturas de redes neurais profundas baseadas em transformers, como XLNet e BERT, e alcançando uma precisão de 97,30% com a arquitetura XLNet.

Fonte: arXiv cs.CL

MLOps/Systems • Score 95

GenUQ: Predictive Uncertainty Estimates via Generative Hyper-Networks

arXiv:2509.21605v2 Announce Type: replace-cross Abstract: Operator learning is a recently developed generalization of regression to mappings between functions. It promises to drastically reduce expensive numerical integration of PDEs to fast evaluations of mappings between functional states of a system, i.e., surrogate and reduced-order modeling. Operator learning has already found applications in several areas such as modeling sea ice, combustion, and atmospheric physics. Recent approaches towards integrating uncertainty quantification into the operator models have relied on likelihood based methods to infer parameter distributions from noisy data. However, stochastic operators may yield actions from which a likelihood is difficult or impossible to construct. In this paper, we introduce, GenUQ, a measure-theoretic approach to UQ that avoids constructing a likelihood by introducing a generative hyper-network model that produces parameter distributions consistent with observed data. We demonstrate that GenUQ outperforms other UQ methods in three example problems, recovering a manufactured operator, learning the solution operator to a stochastic elliptic PDE, and modeling the failure location of porous steel under tension.

Fonte: arXiv stat.ML

RL • Score 96

Unificando Aprendizado por Reforço Causal: Revisão, Taxonomia, Algoritmos e Aplicações

Integrar inferência causal (CI) com aprendizado por reforço (RL) se tornou um paradigma poderoso para abordar limitações críticas no RL clássico, como baixa explicabilidade e falta de robustez. Este trabalho revisa avanços recentes na interseção entre CI e RL, categorizando abordagens existentes e discutindo desafios, sucessos empíricos e direções futuras de pesquisa.

Fonte: arXiv cs.AI

NLP/LLMs • Score 92

Um Framework de IA Agente para Treinamento de Habilidades de Estudantes de Medicina Geral

Avanços em grandes modelos de linguagem oferecem forte potencial para aprimorar pacientes simulados virtuais (VSPs) na educação médica, proporcionando alternativas escaláveis a métodos tradicionais que consomem muitos recursos. Apresentamos um framework agente para treinar habilidades de estudantes de medicina geral que unifica geração de vinhetas configuráveis, diálogo controlado com pacientes e feedback estruturado baseado em padrões.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

MoE-TransMov: Um Modelo Baseado em Transformer para Previsão do Próximo Ponto de Interesse (POI) em Movimentos Familiares e Não Familiares

A previsão precisa do próximo ponto de interesse (POI) nas trajetórias de mobilidade humana é crucial para serviços baseados em localização, permitindo recomendações mais oportunas e personalizadas. Propomos o MoE-TransMov, um modelo baseado em Transformer com arquitetura Mixture-of-Experts (MoE) que captura padrões de mobilidade distintos em diferentes contextos de movimento, melhorando a precisão das previsões.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

De Palavra a Mundo: Podem Modelos de Linguagem Grande Servir como Modelos de Mundo Baseados em Texto Implicitamente?

O aprendizado por reforço agente depende cada vez mais de escalabilidade orientada pela experiência, mas ambientes do mundo real continuam sendo não adaptativos e difíceis de escalar. Este estudo investiga se modelos de linguagem grande (LLMs) podem melhorar a eficiência do aprendizado em ambientes baseados em texto, apresentando um framework de três níveis para avaliação de modelos de mundo baseados em LLMs.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Supplementary Resources and Analysis for Automatic Speech Recognition Systems Trained on the Loquacious Dataset

arXiv:2512.17915v1 Announce Type: new Abstract: The recently published Loquacious dataset aims to be a replacement for established English automatic speech recognition (ASR) datasets such as LibriSpeech or TED-Lium. The main goal of the Loquacious dataset is to provide properly defined training and test partitions across many acoustic and language domains, with an open license suitable for both academia and industry. To further promote the benchmarking and usability of this new dataset, we present additional resources in the form of n-gram language models (LMs), a grapheme-to-phoneme (G2P) model and pronunciation lexica, with open and public access. Utilizing those additional resources we show experimental results across a wide range of ASR architectures with different label units and topologies. Our initial experimental results indicate that the Loquacious dataset offers a valuable study case for a variety of common challenges in ASR.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3

arXiv:2512.18399v1 Announce Type: new Abstract: Tokenization is a critical preprocessing step for large language models (LLMs), directly impacting training efficiency and downstream performance. General-purpose tokenizers trained predominantly on English and Latin-script languages exhibit suboptimal performance on morphologically rich languages such as Arabic, resulting in inflated token sequences and reduced compression efficiency. In this work, we present AraToken, an Arabic-optimized tokenizer built on SentencePiece Unigram algorithm with a comprehensive normalization pipeline addressing Arabic-specific orthographic variations including Alif variants, diacritics, and Arabic-Indic numerals. We systematically compare BPE, WordPiece, and SentencePiece algorithms across multiple configurations, demonstrating that SentencePiece with normalization achieves 18% lower fertility (1.199 vs 1.35 tokens/word) compared to unnormalized baselines. Furthermore, we introduce the Language Extension Pipeline (LEP), a method for integrating the optimized tokenizer into Qwen3-0.6B through vocabulary extension with mean subtoken initialization and selective transformer layer unfreezing. Our experiments show that LEP reduces evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. We release our tokenizer, training scripts, and model checkpoints to facilitate Arabic NLP research.

Fonte: arXiv cs.CL

RL • Score 96

Treinamento de Modelos de Raciocínio Multimodal Grandes Necessita de Melhores Ideias: Um Framework de Três Estágios para Síntese e Seleção de Longas Cadeias de Pensamento

Modelos de Raciocínio Grandes (LRMs) demonstraram desempenho notável em tarefas complexas de raciocínio por meio de longas Cadeias de Pensamento (CoT). A extensão desses sucessos para raciocínio multimodal é desafiadora devido à complexidade de integrar diversas modalidades de entrada e à escassez de dados de treinamento de alta qualidade. Neste artigo, propomos o SynSelect, um novo framework de Síntese-Seleção em três estágios para gerar dados longos de CoT de alta qualidade voltados para tarefas de raciocínio multimodal.

Fonte: arXiv cs.AI

Theory/Optimization • Score 92

Inferência Causal como Adaptação de Distribuição: Otimizando o Risco ATE sob Incerteza de Propensão

Abordagens padrão para inferência causal, como Regressão de Resultado e Ajuste de Regressão Ponderada por Probabilidade Inversa (IPWRA), são geralmente derivadas através da lente da imputação de dados ausentes e teoria de identificação. Neste trabalho, unificamos esses métodos sob uma perspectiva de Machine Learning, reformulando a estimativa de ATE como um problema de adaptação de domínio sob mudança de distribuição.

Fonte: arXiv stat.ML

NLP/LLMs • Score 92

MDToC: Árvore Dinâmica Metacognitiva de Conceitos para Aumentar a Resolução de Problemas Matemáticos em Modelos de Linguagem de Grande Escala

Apesar dos avanços nas capacidades de raciocínio matemático, os Modelos de Linguagem de Grande Escala (LLMs) ainda enfrentam dificuldades na verificação de cálculos ao utilizar técnicas de prompting estabelecidas. Apresentamos o MDToC, uma abordagem em três fases que constrói uma árvore de conceitos, desenvolve cálculos verificados por precisão para cada conceito e utiliza votação majoritária para avaliar soluções concorrentes.

Fonte: arXiv cs.CL

Theory/Optimization • Score 92

Desdobramento Generalizado de Dados Usando Estatísticas Suficientes

Nosso objetivo é desenvolver uma estratégia geral para decompor uma variável aleatória $X$ em múltiplas variáveis aleatórias independentes, sem sacrificar informações sobre parâmetros desconhecidos. Este trabalho generaliza um procedimento recente, permitindo a reconstrução exata de $X$ a partir de funções conhecidas das variáveis independentes.

Fonte: arXiv stat.ML

Evaluation/Benchmarks • Score 92

Uma Função de Perda Convexa para Predição de Conjuntos com Compromissos Otimais Entre Tamanho e Cobertura Condicional

Consideramos problemas de aprendizado supervisionado em que previsões de conjuntos fornecem estimativas explícitas de incerteza. Usando integrais de Choquet (também conhecidas como extensões de Lov{á}sz), propomos uma função de perda convexa para funções de subconjunto não decrescentes obtidas como conjuntos de nível de uma função de valor real.

Fonte: arXiv stat.ML

RL • Score 92

A Single-Loop First-Order Algorithm for Linearly Constrained Bilevel Optimization

arXiv:2510.24710v2 Announce Type: replace-cross Abstract: We study bilevel optimization problems where the lower-level problems are strongly convex and have coupled linear constraints. To overcome the potential non-smoothness of the hyper-objective and the computational challenges associated with the Hessian matrix, we utilize penalty and augmented Lagrangian methods to reformulate the original problem as a single-level one. Especially, we establish a strong theoretical connection between the reformulated function and the original hyper-objective by characterizing the closeness of their values and derivatives. Based on this reformulation, we propose a single-loop, first-order algorithm for linearly constrained bilevel optimization (SFLCB). We provide rigorous analyses of its non-asymptotic convergence rates, showing an improvement over prior double-loop algorithms -- form $O(\epsilon^{-3}\log(\epsilon^{-1}))$ to $O(\epsilon^{-3})$. The experiments corroborate our theoretical findings and demonstrate the practical efficiency of the proposed SFLCB algorithm. Simulation code is provided at https://github.com/ShenGroup/SFLCB.

Fonte: arXiv stat.ML

Evaluation/Benchmarks • Score 95

Descida de Espelho Variacional Online para Aprendizado Robusto na Ponte de Schrödinger

A Ponte de Schrödinger (SB) evoluiu para uma classe universal de modelos generativos probabilísticos. No entanto, os sinais de aprendizado estimados são intrinsecamente incertos, e a confiabilidade prometida pelos métodos existentes muitas vezes se baseia em cenários ótimos especulativos. Neste trabalho, propomos um framework de Descida de Espelho Online Variacional (OMD) para os problemas de SB, que proporciona maior estabilidade aos solucionadores de SB.

Fonte: arXiv stat.ML

NLP/LLMs • Score 92

LIR$^3$AG: A Lightweight Rerank Reasoning Strategy Framework for Retrieval-Augmented Generation

arXiv:2512.18329v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) effectively enhances Large Language Models (LLMs) by incorporating retrieved external knowledge into the generation process. Reasoning models improve LLM performance in multi-hop QA tasks, which require integrating and reasoning over multiple pieces of evidence across different documents to answer a complex question. However, they often introduce substantial computational costs, including increased token consumption and inference latency. To better understand and mitigate this trade-off, we conduct a comprehensive study of reasoning strategies for reasoning models in RAG multi-hop QA tasks. Our findings reveal that reasoning models adopt structured strategies to integrate retrieved and internal knowledge, primarily following two modes: Context-Grounded Reasoning, which relies directly on retrieved content, and Knowledge-Reconciled Reasoning, which resolves conflicts or gaps using internal knowledge. To this end, we propose a novel Lightweight Rerank Reasoning Strategy Framework for RAG (LiR$^3$AG) to enable non-reasoning models to transfer reasoning strategies by restructuring retrieved evidence into coherent reasoning chains. LiR$^3$AG significantly reduce the average 98% output tokens overhead and 58.6% inferencing time while improving 8B non-reasoning model's F1 performance ranging from 6.2% to 22.5% to surpass the performance of 32B reasoning model in RAG, offering a practical and efficient path forward for RAG systems.

Fonte: arXiv cs.CL

NLP/LLMs • Score 92

Toward Human-Centered AI-Assisted Terminology Work

arXiv:2512.18859v1 Announce Type: new Abstract: The rapid diffusion of generative artificial intelligence is transforming terminology work. While this technology promises gains in efficiency, its unstructured adoption risks weakening professional autonomy, amplifying bias, and eroding linguistic and conceptual diversity. This paper argues that a human-centered approach to artificial intelligence has become a necessity for terminology work. Building on research in artificial intelligence and translation studies, it proposes a human-centered framework that conceptualizes artificial intelligence as a means of amplifying the terminologist's capabilities, rather than replacing them. The framework is organized around three interrelated dimensions: the augmented terminologist, ethical AI, and human-centered design. Together, these dimensions emphasize the compatibility of high automation with strong human control, the central role of terminologists in bias mitigation, and the importance of designing AI tools and workflows around the needs, values, and well-being of the terminologist. The paper concludes by stressing that current choices in AI adoption will shape not only terminological practice, but also the preservation of accuracy, adequacy, and diversity in terminology and specialized knowledge.

Fonte: arXiv cs.CL

NLP/LLMs • Score 93

CodeGEMM: Uma Abordagem Centrada em Codebook para GEMM Eficiente em LLMs Quantizados

A quantização apenas de pesos é amplamente utilizada para mitigar a natureza limitada da memória na inferência de LLM. Métodos baseados em codebook alcançam alta precisão em regimes de bits extremamente baixos (por exemplo, 2 bits). Apresentamos o CodeGEMM, um kernel GEMM centrado em codebook que substitui a dequantização por produtos internos pré-computados, melhorando a eficiência computacional e a utilização do subsistema de memória.

Fonte: arXiv cs.LG

Applications • Score 89

Learning Confidence Ellipsoids and Applications to Robust Subspace Recovery

arXiv:2512.16875v2 Announce Type: replace-cross Abstract: We study the problem of finding confidence ellipsoids for an arbitrary distribution in high dimensions. Given samples from a distribution $D$ and a confidence parameter $\alpha$, the goal is to find the smallest volume ellipsoid $E$ which has probability mass $\Pr_{D}[E] \ge 1-\alpha$. Ellipsoids are a highly expressive class of confidence sets as they can capture correlations in the distribution, and can approximate any convex set. This problem has been studied in many different communities. In statistics, this is the classic minimum volume estimator introduced by Rousseeuw as a robust non-parametric estimator of location and scatter. However in high dimensions, it becomes NP-hard to obtain any non-trivial approximation factor in volume when the condition number $\beta$ of the ellipsoid (ratio of the largest to the smallest axis length) goes to $\infty$. This motivates the focus of our paper: can we efficiently find confidence ellipsoids with volume approximation guarantees when compared to ellipsoids of bounded condition number $\beta$? Our main result is a polynomial time algorithm that finds an ellipsoid $E$ whose volume is within a $O(\beta)^{\gamma d}$ multiplicative factor of the volume of best $\beta$-conditioned ellipsoid while covering at least $1-O(\alpha/\gamma)$ probability mass for any $\gamma < \alpha$. We complement this with a computational hardness result that shows that such a dependence seems necessary up to constants in the exponent. The algorithm and analysis uses the rich primal-dual structure of the minimum volume enclosing ellipsoid and the geometric Brascamp-Lieb inequality. As a consequence, we obtain the first polynomial time algorithm with approximation guarantees on worst-case instances of the robust subspace recovery problem.

Fonte: arXiv stat.ML

RL • Score 96

Inteligência Alinhada à Segurança Incorporada via Embeddings de Alinhamento Interno Diferenciáveis

Introduzimos a Inteligência Alinhada à Segurança Incorporada (ESAI), um framework teórico para aprendizado por reforço multi-agente que incorpora restrições de alinhamento diretamente nas representações internas dos agentes usando embeddings de alinhamento interno diferenciáveis. O framework ESAI integra quatro mecanismos principais e analisa condições de estabilidade para embeddings internos limitados.

Fonte: arXiv cs.LG

RL • Score 93

Redes Neurais Variacionais Baseadas em Microestrutura para Quantificação Robusta de Incertezas em Gêmeos Digitais de Materiais

As incertezas aleatórias - variabilidade irremovível na morfologia da microestrutura, comportamento dos constituintes e condições de processamento - representam um grande desafio para o desenvolvimento de gêmeos digitais robustos em relação à incerteza. Apresentamos a Variational Deep Material Network (VDMN), um modelo substituto informado pela física que permite previsões eficientes e probabilísticas do comportamento dos materiais.

Fonte: arXiv cs.LG

NLP/LLMs • Score 92

LLMs on Drugs: Language Models Are Few-Shot Consumers

arXiv:2512.18546v1 Announce Type: new Abstract: Large language models (LLMs) are sensitive to the personas imposed on them at inference time, yet prompt-level "drug" interventions have never been benchmarked rigorously. We present the first controlled study of psychoactive framings on GPT-5-mini using ARC-Challenge. Four single-sentence prompts -- LSD, cocaine, alcohol, and cannabis -- are compared against a sober control across 100 validation items per condition, with deterministic decoding, full logging, Wilson confidence intervals, and Fisher exact tests. Control accuracy is 0.45; alcohol collapses to 0.10 (p = 3.2e-8), cocaine to 0.21 (p = 4.9e-4), LSD to 0.19 (p = 1.3e-4), and cannabis to 0.30 (p = 0.041), largely because persona prompts disrupt the mandated "Answer: " template. Persona text therefore behaves like a "few-shot consumable" that can destroy reliability without touching model weights. All experimental code, raw results, and analysis scripts are available at https://github.com/lexdoudkin/llms-on-drugs.

Fonte: arXiv cs.CL

RL • Score 96

Aprendizado por Reforço Confiável e Explicável para Controle de Processos Seguro e Eficiente em Energia: Um Caso de Uso em Sistemas Industriais de Ar Comprimido

Este artigo apresenta uma abordagem confiável de aprendizado por reforço para o controle de sistemas industriais de ar comprimido. Desenvolvemos um framework que possibilita operação segura e eficiente em energia sob condições de contorno realistas e introduzimos um pipeline de explicabilidade em múltiplos níveis. Uma avaliação empírica mostra que a política aprendida é fisicamente plausível e respeita consistentemente os limites do sistema.

Fonte: arXiv cs.LG

RL • Score 95

ReGal: A First Look at PPO-based Legal AI for Judgment Prediction and Summarization in India

arXiv:2512.18014v1 Announce Type: new Abstract: This paper presents an early exploration of reinforcement learning methodologies for legal AI in the Indian context. We introduce Reinforcement Learning-based Legal Reasoning (ReGal), a framework that integrates Multi-Task Instruction Tuning with Reinforcement Learning from AI Feedback (RLAIF) using Proximal Policy Optimization (PPO). Our approach is evaluated across two critical legal tasks: (i) Court Judgment Prediction and Explanation (CJPE), and (ii) Legal Document Summarization. Although the framework underperforms on standard evaluation metrics compared to supervised and proprietary models, it provides valuable insights into the challenges of applying RL to legal texts. These challenges include reward model alignment, legal language complexity, and domain-specific adaptation. Through empirical and qualitative analysis, we demonstrate how RL can be repurposed for high-stakes, long-document tasks in law. Our findings establish a foundation for future work on optimizing legal reasoning pipelines using reinforcement learning, with broader implications for building interpretable and adaptive legal AI systems.

Fonte: arXiv cs.CL

NLP/LLMs • Score 92

Narrative Consolidation: Formulating a New Task for Unifying Multi-Perspective Accounts

arXiv:2512.18041v1 Announce Type: new Abstract: Processing overlapping narrative documents, such as legal testimonies or historical accounts, often aims not for compression but for a unified, coherent, and chronologically sound text. Standard Multi-Document Summarization (MDS), with its focus on conciseness, fails to preserve narrative flow. This paper formally defines this challenge as a new NLP task: Narrative Consolidation, where the central objectives are chronological integrity, completeness, and the fusion of complementary details. To demonstrate the critical role of temporal structure in this task, we introduce Temporal Alignment Event Graph (TAEG), a graph structure that explicitly models chronology and event alignment. By applying a standard centrality algorithm to TAEG, our method functions as a version selection mechanism, choosing the most central representation of each event in its correct temporal position. In a study on the four Biblical Gospels, this structure-focused approach guarantees perfect temporal ordering (Kendall's Tau of 1.000) by design and dramatically improves content metrics (e.g., +357.2% in ROUGE-L F1). The success of this baseline method validates the formulation of Narrative Consolidation as a relevant task and establishes that an explicit temporal backbone is a fundamental component for its resolution.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Generalization Gaps in Political Fake News Detection: An Empirical Study on the LIAR Dataset

arXiv:2512.18533v1 Announce Type: new Abstract: The proliferation of linguistically subtle political disinformation poses a significant challenge to automated fact-checking systems. Despite increasing emphasis on complex neural architectures, the empirical limits of text-only linguistic modeling remain underexplored. We present a systematic diagnostic evaluation of nine machine learning algorithms on the LIAR benchmark. By isolating lexical features (Bag-of-Words, TF-IDF) and semantic embeddings (GloVe), we uncover a hard "Performance Ceiling", with fine-grained classification not exceeding a Weighted F1-score of 0.32 across models. Crucially, a simple linear SVM (Accuracy: 0.624) matches the performance of pre-trained Transformers such as RoBERTa (Accuracy: 0.620), suggesting that model capacity is not the primary bottleneck. We further diagnose a massive "Generalization Gap" in tree-based ensembles, which achieve more than 99% training accuracy but collapse to approximately 25% on test data, indicating reliance on lexical memorization rather than semantic inference. Synthetic data augmentation via SMOTE yields no meaningful gains, confirming that the limitation is semantic (feature ambiguity) rather than distributional. These findings indicate that for political fact-checking, increasing model complexity without incorporating external knowledge yields diminishing returns.

Fonte: arXiv cs.CL

NLP/LLMs • Score 89

Neologism Learning as a Parameter-Efficient Alternative to Fine-Tuning for Model Steering

arXiv:2512.18551v1 Announce Type: new Abstract: In language modeling, neologisms are new tokens trained to represent a concept not already included in a given model's vocabulary. Neologisms can be used to encourage specific behavior in models, for example by appending prompts with "Give me a neologism answer." Behavioral steering can also be achieved through fine-tuning, albeit with more compute and less flexibility: learning a neologism only trains d parameters and allows the user to still access the model's default behavior. We compare the performance of neologism learning against low-rank adaptation (LoRA) fine-tuning, finding that neologisms outperform fine-tuned models under a matched training setup (same data and hyperparameters). We also investigate self-verbalizations of neologisms, and observe that the model will occasionally make up its own new words when asked about a neologism.

Fonte: arXiv cs.CL

Vision • Score 96

EIA-SEC: Framework Melhorado de Actor-Critic para Controle Colaborativo de Multi-UAV na Agricultura Inteligente

A aplicação generalizada da tecnologia de comunicação sem fio tem promovido o desenvolvimento da agricultura inteligente, onde veículos aéreos não tripulados (UAVs) desempenham um papel multifuncional. Neste trabalho, modelamos um processo de decisão de Markov para resolver o problema de planejamento de trajetória de multi-UAV e propomos o novo framework Elite Imitation Actor-Shared Ensemble Critic (EIA-SEC). Resultados experimentais mostram que o EIA-SEC supera as referências de ponta em desempenho de recompensa, estabilidade de treinamento e velocidade de convergência.

Fonte: arXiv cs.LG

Vision • Score 96

Aprendizado de Cláusulas Direcionado por Conflito com Heurísticas VSIDS para Layout Discreto de Instalações

Este artigo estuda o uso de Aprendizado de Cláusulas Direcionado por Conflito (CDCL) com heurísticas VSIDS como um motor computacional para problemas de layout discreto de instalações. O problema de layout é modelado como um problema de atribuição combinatória com uma estrutura lógica densa, resultante de restrições de adjacência, separação e disponibilidade de slots.

Fonte: arXiv cs.AI

RL • Score 96

Explicações Fiéis e Estáveis de Neurônios para Interpretabilidade Mecanística Confiável

A identificação de neurônios é uma ferramenta popular na interpretabilidade mecanística, visando descobrir os conceitos interpretáveis por humanos representados por neurônios individuais em redes profundas. Embora algoritmos como Network Dissection e CLIP-Dissect tenham alcançado grande sucesso empírico, uma base teórica rigorosa ainda está ausente, o que é crucial para permitir explicações confiáveis. Este trabalho apresenta a primeira análise teórica de desafios fundamentais relacionados à fidelidade e estabilidade das explicações de neurônios.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Propor, Resolver, Verificar: Auto-Jogo Através da Verificação Formal

O treinamento de modelos apenas através de auto-jogo (sem dados humanos) tem sido um objetivo de longa data em IA, mas sua eficácia para treinar grandes modelos de linguagem permanece incerta, especialmente na geração de código. Estudamos o auto-jogo no contexto de geração de código verificado, onde a verificação formal fornece sinais de correção confiáveis.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

MSC-180: Um Benchmark para Prova Formal Automatizada de Teoremas a partir da Classificação de Assuntos Matemáticos

O Automated Theorem Proving (ATP) é uma direção de pesquisa central em inteligência artificial para alcançar raciocínio formal e verificação. Propomos o MSC-180, um benchmark baseado na classificação de assuntos matemáticos MSC2020, que compreende 180 problemas de verificação formal, abrangendo níveis de graduação e pós-graduação, para avaliar e impulsionar o desenvolvimento de sistemas de IA com habilidades genuínas de raciocínio matemático.

Fonte: arXiv cs.AI

Vision • Score 93

Detecção de Ameaças Internas Usando GCN e Bi-LSTM com Representações Gráficas Explícitas e Implícitas

A detecção de ameaças internas (ITD) é desafiadora devido à natureza sutil e oculta das atividades maliciosas realizadas por usuários confiáveis. Este artigo propõe um framework ITD pós-hoc que integra representações gráficas explícitas e implícitas com modelagem temporal para capturar padrões complexos de comportamento do usuário.

Fonte: arXiv cs.AI

Evaluation/Benchmarks • Score 93

Otimização de Roteamento de Atribuição: Solucionadores para Problemas Sob Restrições

Estudamos o problema de Roteamento-Atribuição Conjunto (JRA), onde itens devem ser atribuídos um a um a espaços reservados, enquanto determinamos simultaneamente um ciclo Hamiltoniano que visita todos os nós exatamente uma vez. Desenvolvemos um solucionador adaptado para cenários práticos de planejamento de embalagem com restrições mais ricas.

Fonte: arXiv cs.AI

Evaluation/Benchmarks • Score 93

Comparação Social sem Inferência Explícita dos Valores de Recompensa dos Outros: Uma Abordagem Construtiva Usando um Modelo Generativo Probabilístico

A comparação social — o processo de avaliar as recompensas de um indivíduo em relação às dos outros — desempenha um papel fundamental na cognição social dos primatas. Este estudo investiga se os macacos reconhecem apenas diferenças objetivas de recompensa ou inferem as avaliações subjetivas de recompensa dos outros, utilizando três modelos computacionais com diferentes graus de processamento de informações sociais.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Parceria Inteligente Homem-Máquina para Manufatura: Aprimorando o Planejamento de Armazém através de Grafos de Conhecimento Baseados em Simulação e Colaboração de LLM

Os planejadores de manufatura enfrentam desafios operacionais complexos que exigem colaboração entre a expertise humana e sistemas inteligentes. Nosso framework integra Grafos de Conhecimento e agentes baseados em Large Language Models (LLMs) para capacitar profissionais da manufatura, permitindo interações em linguagem natural com dados operacionais, melhorando a análise e a tomada de decisões.

Fonte: arXiv cs.AI

Vision • Score 93

Few-Shot Learning de um Modelo de Rede Neural Baseado em Grafo Sem Retropropagação

Propondo uma abordagem estrutural-gráfica para classificar imagens de contorno em um regime de few-shot sem retropropagação, este trabalho apresenta um modelo onde a estrutura é a portadora de explicações. A imagem é codificada como um grafo atribuído, e a generalização é alcançada por meio da formação de atratores de conceito.

Fonte: arXiv cs.AI

Vision • Score 96

Detecção de Drift de Saída Baseada em Agentes para Previsão de Resposta ao Câncer de Mama em um Sistema de Suporte à Decisão Clínica Multissite

Os sistemas modernos de suporte à decisão clínica podem atender simultaneamente várias instituições independentes de imagem médica, mas seu desempenho preditivo pode se degradar entre os sites devido a variações nas populações de pacientes, hardware de imagem e protocolos de aquisição. Propomos um framework baseado em agentes para detectar drift e avaliar sua gravidade em sistemas de IA clínica multissite.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Vox Deorum: Uma Arquitetura Híbrida de LLM para IA em Jogos 4X / de Grande Estratégia -- Lições de Civilization V

A capacidade dos Large Language Models de raciocinar em linguagem natural os torna promissores para jogos 4X e de grande estratégia, facilitando interações mais naturais entre humanos e IA. No entanto, a complexidade desses jogos e fatores como latência e custo podem dificultar a implementação real dos LLMs. Apresentamos Vox Deorum, uma arquitetura híbrida LLM+X, validada por meio de 2.327 jogos completos.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Confiança Reflexiva: Corrigindo Falhas de Raciocínio via Auto-Correção Online

Modelos de linguagem grandes (LLMs) demonstraram forte desempenho em tarefas complexas de raciocínio utilizando técnicas como chain-of-thought e self-consistency. No entanto, abordagens baseadas em ensembles, especialmente self-consistency, frequentemente acarretam um overhead computacional substancial. Propomos a confiança reflexiva, um novo framework de raciocínio que transforma sinais de baixa confiança em gatilhos de reflexão.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

IntelliCode: Um Sistema de Tutoria com LLM Multi-Agente e Modelagem Centralizada do Aprendiz

Os tutores baseados em LLM geralmente são assistentes de turno único que não possuem representações persistentes do conhecimento do aprendiz, dificultando o suporte pedagógico a longo prazo. Apresentamos o IntelliCode, um sistema de tutoria LLM multi-agente que integra estimativas de domínio, equívocos, cronogramas de revisão e sinais de engajamento em um estado de aprendiz centralizado e versionado.

Fonte: arXiv cs.AI

Evaluation/Benchmarks • Score 96

Aprendizado por Transferência Baseado em Clustering para Algoritmo Evolutivo Multimodal Multiobjetivo Dinâmico

A otimização multimodal multiobjetivo dinâmica enfrenta o desafio de rastrear simultaneamente múltiplos conjuntos pareto ótimos equivalentes e manter a diversidade populacional em ambientes variáveis. Este artigo apresenta um novo conjunto de funções de teste e um algoritmo inovador baseado em um mecanismo de resposta dinâmica de Clustering-based Autoencoder, visando melhorar a diversidade e a convergência em algoritmos evolutivos.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Population-Evolve: um Método de Amostragem Paralela e Evolutiva para Raciocínio Matemático em LLMs

O escalonamento em tempo de teste surgiu como uma direção promissora para aprimorar as capacidades de raciocínio de Large Language Models nos últimos anos. Neste trabalho, propomos o Population-Evolve, um método livre de treinamento inspirado em Algoritmos Genéticos para otimizar o raciocínio em LLMs, mantendo uma população dinâmica de soluções candidatas.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Adaptação Automática à Complexidade de Conceitos e Conceitos Naturais Subjetivos: Um Modelo Cognitivo Baseado em Chunking

Um problema central na ciência cognitiva diz respeito aos processos psicológicos fundamentais que sustentam a formação e recuperação de múltiplos tipos de conceitos na memória de curto e longo prazo (STM e LTM, respectivamente). Propomos que os mecanismos de chunking desempenham um papel essencial e mostramos como o modelo computacional CogAct fundamenta o aprendizado de conceitos em processos e estruturas cognitivas fundamentais.

Fonte: arXiv cs.AI

NLP/LLMs • Score 93

Conceitos abstratos de LLM podem melhorar o desempenho de SLM?

Modelos de linguagem grandes (LLMs) se destacam em diversas tarefas, mas sua implementação em dispositivos com recursos limitados continua desafiadora. Investigamos a transferibilidade de conceitos de alto nível extraídos de modelos maiores para modelos de linguagem menores (SLM) durante a inferência, demonstrando melhorias de desempenho em uma ampla gama de tarefas.

Fonte: arXiv cs.AI

Evaluation/Benchmarks • Score 90

Aprendendo Operadores Neurais Generalizáveis para Problemas Inversos

Problemas inversos desafiam as arquiteturas existentes de operadores neurais devido a mapas inversos mal definidos que violam suposições de continuidade, unicidade e estabilidade. Apresentamos o B2B${}^{-1}$, um framework de operadores neurais de base para base que aborda essa limitação, permitindo a aprendizagem de modelos determinísticos, invertíveis e probabilísticos em um único framework.

Fonte: arXiv cs.LG

Evaluation/Benchmarks • Score 93

KeenKT: Desambiguação do Estado de Domínio do Conhecimento para Rastreio de Conhecimento

O Rastreio de Conhecimento (KT) visa modelar dinamicamente o domínio de conceitos de conhecimento de um estudante com base em suas interações de aprendizado históricas. A maioria dos métodos atuais depende de estimativas pontuais, que não conseguem distinguir a verdadeira habilidade de explosões ou desatenção, criando ambiguidade no julgamento do domínio.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

CORE: Reforço Orientado a Conceitos para Reduzir a Lacuna entre Definição e Aplicação em Raciocínio Matemático

Modelos de linguagem grandes (LLMs) frequentemente resolvem exercícios matemáticos desafiadores, mas falham em aplicar o conceito quando o problema exige compreensão genuína. Apresentamos o CORE (Reforço Orientado a Conceitos), um framework de treinamento de RL que transforma conceitos explícitos em um sinal de supervisão controlável.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Raciocínio Híbrido Aumentado por Ferramentas com Destilação para Resolução Bilingue de Problemas Matemáticos

A resolução bilíngue de problemas matemáticos requer uma ligação clara entre raciocínio linguístico e cálculo simbólico. Este artigo apresenta o HERALD (Hybrid Ensemble Reasoning with Adaptive Learning and Distillation), um framework que integra raciocínio e cálculo utilizando NuminaMath-7B-TIR, GPT-4o e Mistral-7B, oferecendo uma solução prática para raciocínio matemático multilíngue com melhor precisão e clareza.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Gabliteration: Modificação Adaptativa de Pesos Neurais Multi-Direcional para Alteração Comportamental Seletiva em Grandes Modelos de Linguagem

Apresentamos o Gabliteration, uma nova técnica de modificação de pesos neurais que avança além dos métodos tradicionais de abliteration, implementando projeções multi-direcionais adaptativas com seleção de camadas regularizada. Nossa abordagem supera limitações fundamentais dos métodos existentes, mantendo a qualidade do modelo ao modificar padrões comportamentais específicos.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Recontextualização Mitiga o Jogo de Especificação sem Modificar a Especificação

Os desenvolvedores frequentemente enfrentam dificuldades para especificar rótulos de treinamento e recompensas corretas. Propomos a recontextualização, que reduz a frequência com que modelos de linguagem 'jogam' com sinais de treinamento, realizando comportamentos inadequados que esses sinais reforçam erroneamente.

Fonte: arXiv cs.AI

Vision • Score 90

Condicionamento de modelos Accept-Desirability no contexto de mudança de crença semelhante ao AGM

Discutimos o condicionamento para modelos Accept-Desirability em um framework de tomada de decisão abstrato, onde recompensas incertas residem em um espaço linear geral. Este ambiente permite unificar probabilidades clássicas e quânticas, estendendo-as a um contexto de probabilidades imprecisas. Introduzimos uma nova regra de condicionamento e associamos um operador de revisão de crença a essa regra.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

$eta(3,4)$ 'Atenção' em Agentes Cognitivos: Representações de Conhecimento Sem Ontologia com Semântica Teórica de Promessa

A semântica e a dinâmica da 'atenção' estão intimamente relacionadas a noções teóricas de promessa desenvolvidas para agentes autônomos e podem ser facilmente expressas em um framework de promessas. Isso permite estabelecer uma ponte entre Machine Learning vetorizado e representações de Knowledge Graph sem depender implicitamente de modelos de linguagem.

Fonte: arXiv cs.AI

MLOps/Systems • Score 96

A Cama Procrusteana das Séries Temporais: O Viés de Otimização da Função de Perda Pontual

Otimizar modelos de séries temporais por meio de funções de perda pontuais (por exemplo, MSE) baseando-se em uma suposição falha de independência e distribuição idêntica pontual (i.i.d.) que desconsidera a estrutura temporal causal. Este artigo analisa o Expectation of Optimization Bias (EOB) e revela que quanto mais determinística e estruturada a série temporal, mais severo é o viés causado pela função de perda pontual.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications

arXiv:2505.14918v2 Announce Type: replace-cross Abstract: This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification, addressing the lack of established reliability assessment methods. Adapting psychometric principles, we determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability. Our case study examines financial news sentiment classification across 14 LLMs (including claude-3-7-sonnet, gpt-4o, deepseek-r1, gemma3, llama3.2, phi4, and command-r-plus), with five replicates per model on 1,350 articles. Models demonstrated high intra-rater consistency, achieving perfect agreement on 90-98% of examples, with minimal differences between expensive and economical models from the same families. When validated against StockNewsAPI labels, models achieved strong performance (accuracy 0.76-0.88), with smaller models like gemma3:1B, llama3.2:3B, and claude-3-5-haiku outperforming larger counterparts. All models performed at chance when predicting actual market movements, indicating task constraints rather than model limitations. Our framework provides systematic guidance for LLM selection, sample size planning, and reliability assessment, enabling organizations to optimize resources for classification tasks.

Fonte: arXiv stat.ML

RL • Score 89

Estimativa de densidade via discrepância de mistura e momentos

Com o objetivo de generalizar estatísticas de histogramas para casos de alta dimensão, a estimativa de densidade via partição sequencial baseada em discrepância (DSP) foi proposta para aprender uma aproximação adaptativa constante por partes. A discrepância de mistura e a comparação de momentos são utilizadas como substitutos da discrepância estrela, resultando em DSP-mix e MSP, que são computacionalmente viáveis e exibem invariância de reflexão e rotação.

Fonte: arXiv stat.ML

NLP/LLMs • Score 96

Compreendendo a Cadeia de Pensamento em Grandes Modelos de Linguagem via Análise de Dados Topológicos

Com o desenvolvimento de grandes modelos de linguagem (LLMs) e a introdução da técnica de cadeia de raciocínio longa, a capacidade de raciocínio dos LLMs em resolução de problemas complexos foi significativamente aprimorada. Este trabalho analisa a qualidade da cadeia de raciocínio a partir de uma perspectiva estrutural, utilizando homologia persistente da Análise de Dados Topológicos (TDA) para mapear etapas de raciocínio e extrair características topológicas.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Vibe Reasoning: Extraindo Capacidades Matemáticas de IA de Fronteira -- Um Estudo de Caso sobre o Problema 6 do IMO 2025

Apresentamos o Vibe Reasoning, um paradigma colaborativo entre humanos e IA para resolver problemas matemáticos complexos. Nossa principal percepção é que modelos de IA de fronteira já possuem o conhecimento necessário para resolver problemas desafiadores, mas não sabem como, o que ou quando aplicá-lo. Este trabalho demonstra essa abordagem através do Problema 6 do IMO 2025.

Fonte: arXiv cs.AI

RL • Score 96

SD2AIL: Aprendizado por Imitação Adversarial a partir de Demonstrações Sintéticas via Modelos de Difusão

O Aprendizado por Imitação Adversarial (AIL) é um framework dominante que infere recompensas a partir de demonstrações de especialistas para guiar a otimização de políticas. Inspirados pelo sucesso dos modelos de difusão, propomos o SD2AIL, que utiliza demonstrações sintéticas para aumentar as demonstrações de especialistas, introduzindo também uma estratégia de replay priorizado para maximizar a eficácia das demonstrações.

Fonte: arXiv cs.LG

RL • Score 96

O Challenger: Quando Novas Fontes de Dados Justificam a Troca de Modelos de Machine Learning?

Estudamos o problema de decidir se e quando uma organização deve substituir um modelo incumbente treinado por um challenger que utiliza novas características disponíveis. Desenvolvemos um framework econômico e estatístico unificado que relaciona dinâmicas de curva de aprendizado, custos de aquisição de dados e re-treinamento, e desconto de ganhos futuros.

Fonte: arXiv cs.LG

RL • Score 96

APC-GNN++: Uma GNN Adaptativa Centrada no Paciente com Atenção Contextual e Explicabilidade de Mini-Grafo para Classificação de Diabetes

Propomos o APC-GNN++, uma Rede Neural Gráfica adaptativa centrada no paciente para classificação de diabetes. Nosso modelo integra atenção de arestas ciente do contexto, mistura guiada por confiança de características de nós e representações gráficas, e regularização de consistência de vizinhança para capturar melhor relações clinicamente significativas entre pacientes.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Ajuste Fino Eficiente em Parâmetros para HAR: Integrando LoRA e QLoRA em Modelos Transformer

O reconhecimento de atividades humanas (HAR) é uma tarefa fundamental em computação pervasiva. Este trabalho investiga técnicas de ajuste fino eficientes em parâmetros, especificamente Low-Rank Adaptation (LoRA) e Quantized LoRA, como alternativas escaláveis ao ajuste fino completo de modelos para HAR, demonstrando desempenho competitivo com menos parâmetros treináveis e menor uso de memória.

Fonte: arXiv cs.LG

RL • Score 96

AL-GNN: Aprendizado Contínuo de Grafos Preservando a Privacidade e Livre de Replay via Aprendizado Analítico

O aprendizado contínuo de grafos (CGL) permite que redes neurais de grafos aprendam incrementalmente a partir de dados estruturados em grafos sem esquecer o conhecimento previamente adquirido. O AL-GNN é um novo framework que elimina a necessidade de retropropagação e buffers de replay, utilizando princípios da teoria do aprendizado analítico para otimizar o aprendizado.

Fonte: arXiv cs.LG

Applications • Score 89

Estrutura das Fronteiras de Classificadores: Estudo de Caso para um Classificador Naive Bayes

Classificadores atribuem pontos de dados de entrada complexos a uma pequena quantidade de categorias de saída. Neste estudo, analisamos a estrutura da extit{fronteira} de um classificador Bayes, que compreende pontos cujos vizinhos são classificados de maneira diferente. Apresentamos uma nova medida de incerteza, Neighbor Similarity, que compara o resultado de um ponto de entrada com a distribuição de resultados de seus vizinhos.

Fonte: arXiv stat.ML

RL • Score 95

Stochastic Optimization with Optimal Importance Sampling

arXiv:2504.03560v2 Announce Type: replace-cross Abstract: Importance Sampling (IS) is a widely used variance reduction technique for enhancing the efficiency of Monte Carlo methods, particularly in rare-event simulation and related applications. Despite its effectiveness, the performance of IS is highly sensitive to the choice of the proposal distribution and often requires stochastic calibration. While the design and analysis of IS have been extensively studied in estimation settings, applying IS within stochastic optimization introduces a fundamental challenge: the decision variable and the importance sampling distribution are mutually dependent, creating a circular optimization structure. This interdependence complicates both convergence analysis and variance control. We consider convex stochastic optimization problems with linear constraints and propose a single-loop stochastic approximation algorithm, based on a joint variant of Nesterov's dual averaging, that jointly updates the decision variable and the importance sampling distribution, without time-scale separation or nested optimization. The method is globally convergent and achieves minimal asymptotic variance among stochastic gradient schemes, matching the performance of an oracle sampler adapted to the optimal solution.

Fonte: arXiv stat.ML

NLP/LLMs • Score 96

MoE Pathfinder: Poda de Especialistas Orientada por Trajetória

As arquiteturas de Mixture-of-experts (MoE) em grandes modelos de linguagem (LLMs) alcançam desempenho de ponta em diversas tarefas, mas enfrentam desafios práticos como complexidade de implantação e baixa eficiência de ativação. A poda de especialistas surgiu como uma solução promissora para reduzir a sobrecarga computacional e simplificar a implantação de modelos MoE.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Q-KVComm: Comunicação Multi-Agente Eficiente Via Compressão Adaptativa de Cache KV

Sistemas multi-agente de Large Language Model (LLM) enfrentam um gargalo crítico: a transmissão redundante de informações contextuais entre agentes consome largura de banda e recursos computacionais excessivos. Apresentamos o Q-KVComm, um novo protocolo que permite a transmissão direta de representações comprimidas de cache key-value (KV) entre agentes LLM.

Fonte: arXiv cs.CL

NLP/LLMs • Score 92

SRS-Stories: Vocabulary-constrained multilingual story generation for language learning

arXiv:2512.18362v1 Announce Type: new Abstract: In this paper, we use large language models to generate personalized stories for language learners, using only the vocabulary they know. The generated texts are specifically written to teach the user new vocabulary by simply reading stories where it appears in context, while at the same time seamlessly reviewing recently learned vocabulary. The generated stories are enjoyable to read and the vocabulary reviewing/learning is optimized by a Spaced Repetition System. The experiments are conducted in three languages: English, Chinese and Polish, evaluating three story generation methods and three strategies for enforcing lexical constraints. The results show that the generated stories are more grammatical, coherent, and provide better examples of word usage than texts generated by the standard constrained beam search approach

Fonte: arXiv cs.CL

Vision • Score 95

Statistical laws and linguistics inform meaning in naturalistic and fictional conversation

arXiv:2512.18072v1 Announce Type: new Abstract: Conversation is a cornerstone of social connection and is linked to well-being outcomes. Conversations vary widely in type with some portion generating complex, dynamic stories. One approach to studying how conversations unfold in time is through statistical patterns such as Heaps' law, which holds that vocabulary size scales with document length. Little work on Heaps's law has looked at conversation and considered how language features impact scaling. We measure Heaps' law for conversations recorded in two distinct mediums: 1. Strangers brought together on video chat and 2. Fictional characters in movies. We find that scaling of vocabulary size differs by parts of speech. We discuss these findings through behavioral and linguistic frameworks.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

On Finding Inconsistencies in Documents

arXiv:2512.18601v1 Announce Type: new Abstract: Professionals in academia, law, and finance audit their documents because inconsistencies can result in monetary, reputational, and scientific costs. Language models (LMs) have the potential to dramatically speed up this auditing process. To understand their abilities, we introduce a benchmark, FIND (Finding INconsistencies in Documents), where each example is a document with an inconsistency inserted manually by a domain expert. Despite the documents being long, technical, and complex, the best-performing model (gpt-5) recovered 64% of the inserted inconsistencies. Surprisingly, gpt-5 also found undiscovered inconsistencies present in the original documents. For example, on 50 arXiv papers, we judged 136 out of 196 of the model's suggestions to be legitimate inconsistencies missed by the original authors. However, despite these findings, even the best models miss almost half of the inconsistencies in FIND, demonstrating that inconsistency detection is still a challenging task.

Fonte: arXiv cs.CL

RL • Score 92

Attractor learning for spatiotemporally chaotic dynamical systems using echo state networks with transfer learning

arXiv:2505.24099v2 Announce Type: replace-cross Abstract: In this paper, we explore the predictive capabilities of echo state networks (ESNs) for the generalized Kuramoto-Sivashinsky (gKS) equation, an archetypal nonlinear PDE that exhibits spatiotemporal chaos. Our research focuses on predicting changes in long-term statistical patterns of the gKS model that result from varying the dispersion relation or the length of the spatial domain. We use transfer learning to adapt ESNs to different parameter settings and successfully capture changes in the underlying chaotic attractor. Previous work has shown that transfer learning can be used effectively with ESNs for single-orbit prediction. The novelty of our paper lies in our use of this pairing to predict the long-term statistical properties of spatiotemporally chaotic PDEs. We also show that transfer learning nontrivially improves the length of time that predictions of individual gKS trajectories remain accurate.

Fonte: arXiv stat.ML

MLOps/Systems • Score 96

Mecanismos de Memória Dependentes de Modalidade em Computação Neuromórfica Cross-Modal

As redes neurais spiking (SNNs) com memória prometem computação neuromórfica energeticamente eficiente, mas sua generalização entre modalidades sensoriais permanece inexplorada. Apresentamos o primeiro estudo abrangente de ablação cross-modal dos mecanismos de memória em SNNs, avaliando redes de Hopfield, Redes Recorrentes Gated Hierárquicas (HGRNs) e aprendizado contrastivo supervisionado (SCL) em conjuntos de dados neuromórficos visuais (N-MNIST) e auditivos (SHD).

Fonte: arXiv cs.LG

NLP/LLMs • Score 93

Misturas secretas de especialistas dentro do seu LLM

arXiv:2512.18452v1 Tipo de Anúncio: novo. Este artigo investiga as camadas MLP em modelos LLM densos, propondo que essas camadas realizam secretamente uma computação esparsa, sendo bem aproximadas por camadas de Mixture of Experts (MoE) com ativação esparsa. Validamos empiricamente essa hipótese em LLMs pré-treinados, mostrando que a distribuição de ativação é crucial para os resultados.

Fonte: arXiv cs.LG

NLP/LLMs • Score 89

KVReviver: Reversible KV Cache Compression with Sketch-Based Token Reconstruction

arXiv:2512.17917v1 Announce Type: new Abstract: As the context length of current large language models (LLMs) rapidly increases, the memory demand for the Key-Value (KV) cache is becoming a bottleneck for LLM deployment and batch processing. Traditional KV cache compression methods typically involve permanently evicting or irreversibly merging "less important" tokens with low attention scores. This approach results in the unrecoverable loss of token information, which we call Contextual Amnesia, significantly degrading the model's information retrieval capability. To address this issue, we propose KVReviver, a reversible KV cache compression method based on the sketch algorithm. This method allows reconstructing compressed tokens from an additional data structure, thus enabling full-scale computation within limited memory. Experiments showed that in 2k-length contexts, it requires only 10% of KV Cache budget while maintaining identical end-to-end inference accuracy. For 32k-length contexts, it achieves equivalent or comparable accuracy ~2% accuracy loss) using merely 25% of KV Cache budget.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

TraCeR: Análise de Risco Competitivo Baseada em Transformer com Covariáveis Longitudinais

A análise de sobrevivência é uma ferramenta crítica para modelar dados de tempo até o evento. Modelos recentes baseados em deep learning reduziram várias suposições de modelagem, mas ainda há desafios na incorporação de covariáveis longitudinais. Apresentamos o TraCeR, um framework de análise de sobrevivência baseado em transformer que lida com covariáveis longitudinais e melhora a calibração do modelo.

Fonte: arXiv cs.LG

Theory/Optimization • Score 89

CTTA-T: Continual Test-Time Adaptation for Text Understanding via Teacher-Student with a Domain-aware and Generalized Teacher

arXiv:2512.18321v1 Announce Type: new Abstract: Text understanding often suffers from domain shifts. To handle testing domains, domain adaptation (DA) is trained to adapt to a fixed and observed testing domain; a more challenging paradigm, test-time adaptation (TTA), cannot access the testing domain during training and online adapts to the testing samples during testing, where the samples are from a fixed domain. We aim to explore a more practical and underexplored scenario, continual test-time adaptation (CTTA) for text understanding, which involves a sequence of testing (unobserved) domains in testing. Current CTTA methods struggle in reducing error accumulation over domains and enhancing generalization to handle unobserved domains: 1) Noise-filtering reduces accumulated errors but discards useful information, and 2) accumulating historical domains enhances generalization, but it is hard to achieve adaptive accumulation. In this paper, we propose a CTTA-T (continual test-time adaptation for text understanding) framework adaptable to evolving target domains: it adopts a teacher-student framework, where the teacher is domain-aware and generalized for evolving domains. To improve teacher predictions, we propose a refine-then-filter based on dropout-driven consistency, which calibrates predictions and removes unreliable guidance. For the adaptation-generalization trade-off, we construct a domain-aware teacher by dynamically accumulating cross-domain semantics via incremental PCA, which continuously tracks domain shifts. Experiments show CTTA-T excels baselines.

Fonte: arXiv cs.CL

Vision • Score 95

Teste para estrutura latente via a matriz aleatória de Wilcoxon--Wigner de estatísticas de classificação normalizadas

Este artigo considera o problema de testar a estrutura latente em grandes matrizes de dados simétricas. O objetivo é desenvolver uma metodologia estatisticamente fundamentada que seja flexível em sua aplicabilidade, computacionalmente eficiente e insensível a variações extremas nos dados, superando assim as limitações das abordagens existentes.

Fonte: arXiv stat.ML

RL • Score 95

Defesa Certificada sobre a Justiça das Redes Neurais Gráficas

As Redes Neurais Gráficas (GNNs) se destacaram como um modelo proeminente de aprendizado em grafos, mas são vulneráveis a ataques que podem corromper a justiça de suas previsões. Neste artigo, propomos um framework chamado ELEGANT, que oferece uma análise teórica detalhada para certificar a justiça das GNNs, sem exigir re-treinamento e sem suposições sobre a estrutura ou parâmetros das GNNs.

Fonte: arXiv stat.ML

MLOps/Systems • Score 89

Universalidade dos limites de escalonamento de alta dimensão do stochastic gradient descent

Consideramos tarefas estatísticas em altas dimensões cuja perda depende dos dados apenas através de sua projeção em um subespaço de dimensão fixa. Isso inclui a classificação de distribuições mistas com perda de cross-entropy usando redes de uma e duas camadas. Nosso principal resultado é que os limites de ODE são universais, desde que a inicialização e os vetores de verdade fundamental sejam deslocalizados em coordenadas.

Fonte: arXiv stat.ML

RL • Score 92

Deep Learning para Extração do Modo $B$ Primordial

A busca por ondas gravitacionais primordiais é um objetivo central das pesquisas sobre o fundo cósmico de micro-ondas (CMB). Isolar o sinal de polarização característico do modo $B$ gerado por ondas gravitacionais primordiais é desafiador devido a vários fatores, incluindo a pequena amplitude do sinal e a contaminação por foregrounds astrofísicos. Este trabalho demonstra como redes de deep learning podem ser aplicadas para estimar e remover múltiplas fontes de polarização do modo $B$ secundário.

Fonte: arXiv stat.ML

RL • Score 92

Garantias de Convergência Teórica para Autoencoders Variacionais

Os Autoencoders Variacionais (VAE) são modelos generativos populares usados para amostrar de distribuições de dados complexas. Este artigo busca preencher lacunas na compreensão das propriedades teóricas dos VAE, fornecendo garantias de convergência não assintótica para VAE treinados com os algoritmos Stochastic Gradient Descent e Adam, derivando uma taxa de convergência de $ ext{O}( rac{ ext{log} n}{ ext{sqrt}(n)})$.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

Mantenha Leve! Simplificando o Agrupamento de Imagens Através de Adaptadores Sem Texto

No contexto de modelos pré-treinados, a classificação eficaz pode ser alcançada com camadas de leitura leves. Este trabalho demonstra que, no agrupamento profundo, é possível obter desempenho competitivo com métodos mais complexos utilizando um pipeline de treinamento altamente simplificado e sem texto. Nossa abordagem, Simple Clustering via Pre-trained models (SCP), utiliza representações de características de modelos de visão pré-treinados e pares de dados positivos.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

Rumo a Testes de Independência Condicional Escaláveis e Válidos com Representações Espectrais

A independência condicional (CI) é central para inferência causal, seleção de características e modelagem gráfica, mas é muitas vezes impossível de testar sem suposições adicionais. Testes existentes de CI dependem de condições estruturais restritivas, limitando sua validade em dados do mundo real. Este trabalho explora se o aprendizado de representações pode ajudar a superar essas limitações.

Fonte: arXiv stat.ML

Theory/Optimization • Score 89

Treinamento Ótimo de Fonte é Subótimo para Transferência

Provamos que treinar um modelo de fonte de forma ótima para sua própria tarefa é genericamente subótimo quando o objetivo é a transferência a montante. Estudamos o problema de otimização do lado da fonte na regressão ridge L2-SP e mostramos um desajuste fundamental entre a regularização ótima para a fonte e a regularização ótima para a transferência.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

Does It Tie Out? Towards Autonomous Legal Agents in Venture Capital

arXiv:2512.18658v1 Announce Type: new Abstract: Before closing venture capital financing rounds, lawyers conduct diligence that includes tying out the capitalization table: verifying that every security (for example, shares, options, warrants) and issuance term (for example, vesting schedules, acceleration triggers, transfer restrictions) is supported by large sets of underlying legal documentation. While LLMs continue to improve on legal benchmarks, specialized legal workflows, such as capitalization tie-out, remain out of reach even for strong agentic systems. The task requires multi-document reasoning, strict evidence traceability, and deterministic outputs that current approaches fail to reliably deliver. We characterize capitalization tie-out as an instance of a real-world benchmark for legal AI, analyze and compare the performance of existing agentic systems, and propose a world model architecture toward tie-out automation-and more broadly as a foundation for applied legal intelligence.

Fonte: arXiv cs.CL

Theory/Optimization • Score 89

Stopping Rules for Stochastic Gradient Descent via Anytime-Valid Confidence Sequences

arXiv:2512.13123v3 Announce Type: replace-cross Abstract: We study stopping rules for stochastic gradient descent (SGD) for convex optimization from the perspective of anytime-valid confidence sequences. Classical analyses of SGD provide convergence guarantees in expectation or at a fixed horizon, but offer no statistically valid way to assess, at an arbitrary time, how close the current iterate is to the optimum. We develop an anytime-valid, data-dependent upper confidence sequence for the weighted average suboptimality of projected SGD, constructed via nonnegative supermartingales and requiring no smoothness or strong convexity. This confidence sequence yields a simple stopping rule that is provably $\varepsilon$-optimal with probability at least $1-\alpha$, with explicit bounds on the stopping time under standard stochastic approximation stepsizes. To the best of our knowledge, these are the first rigorous, time-uniform performance guarantees and finite-time $\varepsilon$-optimality certificates for projected SGD with general convex objectives, based solely on observable trajectory quantities.

Fonte: arXiv stat.ML

NLP/LLMs • Score 92

SAP: Syntactic Attention Pruning for Transformer-based Language Models

arXiv:2512.19125v1 Announce Type: new Abstract: This paper introduces Syntactic Attention Pruning (SAP), a novel method for effectively pruning attention heads in Transformer models. Unlike conventional approaches that rely solely on mathematical analysis of model weights and activations, SAP incorporates both the syntactic structure and attention patterns of sentences to guide the pruning process. By leveraging these linguistic features, SAP not only achieves performance comparable to state-of-the-art methods but also enhances the interpretability of model behavior. To further improve robustness, we propose Candidate Filtering (CF), a mechanism that prioritizes heads based on their contribution to model performance, mitigating degradation during pruning. Experimental results indicate that SAP effectively preserves critical heads of a high density of strong attention values, outperforming existing head pruning strategies in retrain-free settings. These findings position SAP as a promising foundation for a new direction in model compression research, offering high flexibility for pruning across all transformer-based language models.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Evaluating the Challenges of LLMs in Real-world Medical Follow-up: A Comparative Study and An Optimized Framework

arXiv:2512.18999v1 Announce Type: new Abstract: When applied directly in an end-to-end manner to medical follow-up tasks, Large Language Models (LLMs) often suffer from uncontrolled dialog flow and inaccurate information extraction due to the complexity of follow-up forms. To address this limitation, we designed and compared two follow-up chatbot systems: an end-to-end LLM-based system (control group) and a modular pipeline with structured process control (experimental group). Experimental results show that while the end-to-end approach frequently fails on lengthy and complex forms, our modular method-built on task decomposition, semantic clustering, and flow management-substantially improves dialog stability and extraction accuracy. Moreover, it reduces the number of dialogue turns by 46.73% and lowers token consumption by 80% to 87.5%. These findings highlight the necessity of integrating external control mechanisms when deploying LLMs in high-stakes medical follow-up scenarios.

Fonte: arXiv cs.CL

NLP/LLMs • Score 92

Mitigating Spurious Correlations in NLI via LLM-Synthesized Counterfactuals and Dynamic Balanced Sampling

arXiv:2512.18462v1 Announce Type: new Abstract: Natural Language Inference (NLI) models frequently rely on spurious correlations rather than semantic reasoning. Existing mitigation strategies often incur high annotation costs or trigger catastrophic forgetting during fine-tuning. We propose an automated, scalable pipeline to address these limitations. First, we introduce Log-Frequency LMI (LF-LMI) to accurately detect semantic artifacts. Second, we generate a high-quality synthetic contrast set via an LLM-synthesis pipeline with multi-judge verification. Finally, we introduce Dynamic Balanced Sampling, a training strategy that rotates the original data distribution to prevent forgetting. Our method improves consistency on a challenging benchmark from 63.5% to 81.0% while maintaining 88.4% in-domain accuracy, significantly outperforming naive fine-tuning.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

GeoSense-AI: Fast Location Inference from Crisis Microblogs

arXiv:2512.18225v1 Announce Type: new Abstract: This paper presents an applied AI pipeline for realtime geolocation from noisy microblog streams, unifying statistical hashtag segmentation, part-of-speech-driven proper-noun detection, dependency parsing around disaster lexicons, lightweight named-entity recognition, and gazetteer-grounded disambiguation to infer locations directly from text rather than sparse geotags. The approach operationalizes information extraction under streaming constraints, emphasizing low-latency NLP components and efficient validation against geographic knowledge bases to support situational awareness during emergencies. In head to head comparisons with widely used NER toolkits, the system attains strong F1 while being engineered for orders-of-magnitude faster throughput, enabling deployment in live crisis informatics settings. A production map interface demonstrates end-to-end AI functionality ingest, inference, and visualization--surfacing locational signals at scale for floods, outbreaks, and other fastmoving events. By prioritizing robustness to informal text and streaming efficiency, GeoSense-AI illustrates how domain-tuned NLP and knowledge grounding can elevate emergency response beyond conventional geo-tag reliance.

Fonte: arXiv cs.CL

NLP/LLMs • Score 92

From Scratch to Fine-Tuned: A Comparative Study of Transformer Training Strategies for Legal Machine Translation

arXiv:2512.18593v1 Announce Type: new Abstract: In multilingual nations like India, access to legal information is often hindered by language barriers, as much of the legal and judicial documentation remains in English. Legal Machine Translation (L-MT) offers a scalable solution to this challenge by enabling accurate and accessible translations of legal documents. This paper presents our work for the JUST-NLP 2025 Legal MT shared task, focusing on English-Hindi translation using Transformer-based approaches. We experiment with 2 complementary strategies, fine-tuning a pre-trained OPUS-MT model for domain-specific adaptation and training a Transformer model from scratch using the provided legal corpus. Performance is evaluated using standard MT metrics, including SacreBLEU, chrF++, TER, ROUGE, BERTScore, METEOR, and COMET. Our fine-tuned OPUS-MT model achieves a SacreBLEU score of 46.03, significantly outperforming both baseline and from-scratch models. The results highlight the effectiveness of domain adaptation in enhancing translation quality and demonstrate the potential of L-MT systems to improve access to justice and legal transparency in multilingual contexts.

Fonte: arXiv cs.CL

NLP/LLMs • Score 94

LLM Agents Implement an NLG System from Scratch: Building Interpretable Rule-Based RDF-to-Text Generators

arXiv:2512.18360v1 Announce Type: new Abstract: We present a novel neurosymbolic framework for RDF-to-text generation, in which the model is "trained" through collaborative interactions among multiple LLM agents rather than traditional backpropagation. The LLM agents produce rule-based Python code for a generator for the given domain, based on RDF triples only, with no in-domain human reference texts. The resulting system is fully interpretable, requires no supervised training data, and generates text nearly instantaneously using only a single CPU. Our experiments on the WebNLG and OpenDialKG data show that outputs produced by our approach reduce hallucination, with only slight fluency penalties compared to finetuned or prompted language models

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Research on a hybrid LSTM-CNN-Attention model for text-based web content classification

arXiv:2512.18475v1 Announce Type: new Abstract: This study presents a hybrid deep learning architecture that integrates LSTM, CNN, and an Attention mechanism to enhance the classification of web content based on text. Pretrained GloVe embeddings are used to represent words as dense vectors that preserve semantic similarity. The CNN layer extracts local n-gram patterns and lexical features, while the LSTM layer models long-range dependencies and sequential structure. The integrated Attention mechanism enables the model to focus selectively on the most informative parts of the input sequence. A 5-fold cross-validation setup was used to assess the robustness and generalizability of the proposed solution. Experimental results show that the hybrid LSTM-CNN-Attention model achieved outstanding performance, with an accuracy of 0.98, precision of 0.94, recall of 0.92, and F1-score of 0.93. These results surpass the performance of baseline models based solely on CNNs, LSTMs, or transformer-based classifiers such as BERT. The combination of neural network components enabled the model to effectively capture both fine-grained text structures and broader semantic context. Furthermore, the use of GloVe embeddings provided an efficient and effective representation of textual data, making the model suitable for integration into systems with real-time or near-real-time requirements. The proposed hybrid architecture demonstrates high effectiveness in text-based web content classification, particularly in tasks requiring both syntactic feature extraction and semantic interpretation. By combining presented mechanisms, the model addresses the limitations of individual architectures and achieves improved generalization. These findings support the broader use of hybrid deep learning approaches in NLP applications, especially where complex, unstructured textual data must be processed and classified with high reliability.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

ScoutGPT: Capturando o Impacto do Jogador a partir de Sequências de Ação da Equipe Usando um Framework Baseado em GPT

Transferências desempenham um papel crucial no sucesso de um clube de futebol, mas prever se uma transferência terá sucesso continua sendo difícil devido à forte dependência do contexto no desempenho em campo. Para abordar essa lacuna, apresentamos o EventGPT, um modelo de previsão de próximo evento condicionado ao jogador e ciente de valor, construído em um transformer autoregressivo estilo GPT.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Aprendizado por Reforço para Agente Autoaperfeiçoável com Biblioteca de Habilidades

Agentes baseados em Large Language Models (LLM) demonstraram capacidades notáveis em raciocínio complexo e interações de múltiplas etapas, mas enfrentam dificuldades em melhorar e se adaptar continuamente em novos ambientes. Propomos uma abordagem baseada em Aprendizado por Reforço (RL) para aprimorar as capacidades de autoaperfeiçoamento dos agentes com uma biblioteca de habilidades.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Democratizing Pathology Co-Pilots: An Open Pipeline and Dataset for Whole-Slide Vision-Language Modelling

arXiv:2512.17326v1 Announce Type: new Abstract: Vision-language models (VLMs) have the potential to become co-pilots for pathologists. However, most VLMs either focus on small regions of interest within whole-slide images, provide only static slide-level outputs, or rely on data that is not publicly available, limiting reproducibility. Furthermore, training data containing WSIs paired with detailed clinical reports is scarce, restricting progress toward transparent and generalisable VLMs. We address these limitations with three main contributions. First, we introduce Polysome, a standardised tool for synthetic instruction generation. Second, we apply Polysome to the public HISTAI dataset, generating HISTAI-Instruct, a large whole-slide instruction tuning dataset spanning 24,259 slides and over 1.1 million instruction-response pairs. Finally, we use HISTAI-Instruct to train ANTONI-{\alpha}, a VLM capable of visual-question answering (VQA). We show that ANTONI-{\alpha} outperforms MedGemma on WSI-level VQA tasks of tissue identification, neoplasm detection, and differential diagnosis. We also compare the performance of multiple incarnations of ANTONI-{\alpha} trained with different amounts of data. All methods, data, and code are publicly available.

Fonte: arXiv cs.CV

RL • Score 89

On the performance of multi-fidelity and reduced-dimensional neural emulators for inference of physiological boundary conditions

arXiv:2506.11683v2 Announce Type: replace Abstract: Solving inverse problems in cardiovascular modeling is particularly challenging due to the high computational cost of running high-fidelity simulations. In this work, we focus on Bayesian parameter estimation and explore different methods to reduce the computational cost of sampling from the posterior distribution by leveraging low-fidelity approximations. A common approach is to construct a surrogate model for the high-fidelity simulation itself. Another is to build a surrogate for the discrepancy between high- and low-fidelity models. This discrepancy, which is often easier to approximate, is modeled with either a fully connected neural network or a nonlinear dimensionality reduction technique that enables surrogate construction in a lower-dimensional space. A third possible approach is to treat the discrepancy between the high-fidelity and surrogate models as random noise and estimate its distribution using normalizing flows. This allows us to incorporate the approximation error into the Bayesian inverse problem by modifying the likelihood function. We validate five different methods which are variations of the above on analytical test cases by comparing them to posterior distributions derived solely from high-fidelity models, assessing both accuracy and computational cost. Finally, we demonstrate our approaches on two cardiovascular examples of increasing complexity: a lumped-parameter Windkessel model and a patient-specific three-dimensional anatomy.

Fonte: arXiv stat.ML

RL • Score 92

Targeted Learning for Variable Importance

arXiv:2411.02221v2 Announce Type: replace Abstract: Variable importance is one of the most widely used measures for interpreting machine learning with significant interest from both statistics and machine learning communities. Recently, increasing attention has been directed toward uncertainty quantification in these metrics. Current approaches largely rely on one-step procedures, which, while asymptotically efficient, can present higher sensitivity and instability in finite sample settings. To address these limitations, we propose a novel method by employing the targeted learning (TL) framework, designed to enhance robustness in inference for variable importance metrics. Our approach is particularly suited for conditional permutation variable importance. We show that it (i) retains the asymptotic efficiency of traditional methods, (ii) maintains comparable computational complexity, and (iii) delivers improved accuracy, especially in finite sample contexts. We further support these findings with numerical experiments that illustrate the practical advantages of our method and validate the theoretical results.

Fonte: arXiv stat.ML

Theory/Optimization • Score 92

Weighted Stochastic Differential Equation to Implement Wasserstein-Fisher-Rao Gradient Flow

arXiv:2512.17878v1 Announce Type: cross Abstract: Score-based diffusion models currently constitute the state of the art in continuous generative modeling. These methods are typically formulated via overdamped or underdamped Ornstein--Uhlenbeck-type stochastic differential equations, in which sampling is driven by a combination of deterministic drift and Brownian diffusion, resulting in continuous particle trajectories in the ambient space. While such dynamics enjoy exponential convergence guarantees for strongly log-concave target distributions, it is well known that their mixing rates deteriorate exponentially in the presence of nonconvex or multimodal landscapes, such as double-well potentials. Since many practical generative modeling tasks involve highly non-log-concave target distributions, considerable recent effort has been devoted to developing sampling schemes that improve exploration beyond classical diffusion dynamics. A promising line of work leverages tools from information geometry to augment diffusion-based samplers with controlled mass reweighting mechanisms. This perspective leads naturally to Wasserstein--Fisher--Rao (WFR) geometries, which couple transport in the sample space with vertical (reaction) dynamics on the space of probability measures. In this work, we formulate such reweighting mechanisms through the introduction of explicit correction terms and show how they can be implemented via weighted stochastic differential equations using the Feynman--Kac representation. Our study provides a preliminary but rigorous investigation of WFR-based sampling dynamics, and aims to clarify their geometric and operator-theoretic structure as a foundation for future theoretical and algorithmic developments.

Fonte: arXiv stat.ML

NLP/LLMs • Score 96

Um Framework Solver-in-the-Loop para Melhorar LLMs em Programação de Conjuntos de Respostas para Resolução de Quebra-Cabeças Lógicos

O aumento dos modelos de linguagem de grande porte (LLMs) gerou interesse em assistentes de codificação. Este artigo apresenta uma abordagem inovadora de ASP-solver-in-the-loop para o ajuste de instruções guiadas por solucionadores, visando melhorar a geração de código em Programação de Conjuntos de Respostas (ASP) para resolver problemas de busca combinatória. Nossos experimentos mostram melhorias consistentes em dois conjuntos de dados distintos.

Fonte: arXiv cs.AI

MLOps/Systems • Score 95

A Systems-Theoretic View on the Convergence of Algorithms under Disturbances

arXiv:2512.17598v1 Announce Type: cross Abstract: Algorithms increasingly operate within complex physical, social, and engineering systems where they are exposed to disturbances, noise, and interconnections with other dynamical systems. This article extends known convergence guarantees of an algorithm operating in isolation (i.e., without disturbances) and systematically derives stability bounds and convergence rates in the presence of such disturbances. By leveraging converse Lyapunov theorems, we derive key inequalities that quantify the impact of disturbances. We further demonstrate how our result can be utilized to assess the effects of disturbances on algorithmic performance in a wide variety of applications, including communication constraints in distributed learning, sensitivity in machine learning generalization, and intentional noise injection for privacy. This underpins the role of our result as a unifying tool for algorithm analysis in the presence of noise, disturbances, and interconnections with other dynamical systems.

Fonte: arXiv stat.ML

RL • Score 92

On the Identification of Temporally Causal Representation with Instantaneous Dependence

arXiv:2405.15325v3 Announce Type: replace-cross Abstract: Temporally causal representation learning aims to identify the latent causal process from time series observations, but most methods require the assumption that the latent causal processes do not have instantaneous relations. Although some recent methods achieve identifiability in the instantaneous causality case, they require either interventions on the latent variables or grouping of the observations, which are in general difficult to obtain in real-world scenarios. To fill this gap, we propose an \textbf{ID}entification framework for instantane\textbf{O}us \textbf{L}atent dynamics (\textbf{IDOL}) by imposing a sparse influence constraint that the latent causal processes have sparse time-delayed and instantaneous relations. Specifically, we establish identifiability results of the latent causal process based on sufficient variability and the sparse influence constraint by employing contextual information of time series data. Based on these theories, we incorporate a temporally variational inference architecture to estimate the latent variables and a gradient-based sparsity regularization to identify the latent causal process. Experimental results on simulation datasets illustrate that our method can identify the latent causal process. Furthermore, evaluations on multiple human motion forecasting benchmarks with instantaneous dependencies indicate the effectiveness of our method in real-world settings.

Fonte: arXiv stat.ML

RL • Score 89

Clustering and Pruning in Causal Data Fusion

arXiv:2505.15215v2 Announce Type: replace Abstract: Data fusion, the process of combining observational and experimental data, can enable the identification of causal effects that would otherwise remain non-identifiable. Although identification algorithms have been developed for specific scenarios, do-calculus remains the only general-purpose tool for causal data fusion, particularly when variables are present in some data sources but not others. However, approaches based on do-calculus may encounter computational challenges as the number of variables increases and the causal graph grows in complexity. Consequently, there exists a need to reduce the size of such models while preserving the essential features. For this purpose, we propose pruning (removing unnecessary variables) and clustering (combining variables) as preprocessing operations for causal data fusion. We generalize earlier results on a single data source and derive conditions for applying pruning and clustering in the case of multiple data sources. We give sufficient conditions for inferring the identifiability or non-identifiability of a causal effect in a larger graph based on a smaller graph and show how to obtain the corresponding identifying functional for identifiable causal effects. Examples from epidemiology and social science demonstrate the use of the results.

Fonte: arXiv stat.ML

Evaluation/Benchmarks • Score 89

Adjusting Model Size in Continual Gaussian Processes: How Big is Big Enough?

arXiv:2408.07588v5 Announce Type: replace Abstract: Many machine learning models require setting a parameter that controls their size before training, e.g. number of neurons in DNNs, or inducing points in GPs. Increasing capacity typically improves performance until all the information from the dataset is captured. After this point, computational cost keeps increasing, without improved performance. This leads to the question "How big is big enough?" We investigate this problem for Gaussian processes (single-layer neural networks) in continual learning. Here, data becomes available incrementally, and the final dataset size will therefore not be known before training, preventing the use of heuristics for setting a fixed model size. We develop a method to automatically adjust model size while maintaining near-optimal performance. Our experimental procedure follows the constraint that any hyperparameters must be set without seeing dataset properties, and we show that our method performs well across diverse datasets without the need to adjust its hyperparameter, showing it requires less tuning than others.

Fonte: arXiv stat.ML

NLP/LLMs • Score 92

Towards Sharp Minimax Risk Bounds for Operator Learning

arXiv:2512.17805v1 Announce Type: cross Abstract: We develop a minimax theory for operator learning, where the goal is to estimate an unknown operator between separable Hilbert spaces from finitely many noisy input-output samples. For uniformly bounded Lipschitz operators, we prove information-theoretic lower bounds together with matching or near-matching upper bounds, covering both fixed and random designs under Hilbert-valued Gaussian noise and Gaussian white noise errors. The rates are controlled by the spectrum of the covariance operator of the measure that defines the error metric. Our setup is very general and allows for measures with unbounded support. A key implication is a curse of sample complexity which shows that the minimax risk for generic Lipschitz operators cannot decay at any algebraic rate in the sample size. We obtain essentially sharp characterizations when the covariance spectrum decays exponentially and provide general upper and lower bounds in slower-decay regimes.

Fonte: arXiv stat.ML

NLP/LLMs • Score 96

V-Agent: Um Sistema de Busca de Vídeo Interativo Usando Modelos de Visão-Linguagem

Apresentamos o V-Agent, uma nova plataforma multi-agente projetada para busca avançada de vídeos e conversas interativas entre usuário e sistema. Ao ajustar um modelo de visão-linguagem (VLM) com um pequeno conjunto de dados de preferência de vídeo e aprimorá-lo com um vetor de recuperação de um modelo de recuperação de imagem-texto, superamos as limitações dos sistemas tradicionais de recuperação baseados em texto em cenários multimodais.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Generalized infinite dimensional Alpha-Procrustes based geometries

arXiv:2511.09801v2 Announce Type: replace Abstract: This work extends the recently introduced Alpha-Procrustes family of Riemannian metrics for symmetric positive definite (SPD) matrices by incorporating generalized versions of the Bures-Wasserstein (GBW), Log-Euclidean, and Wasserstein distances. While the Alpha-Procrustes framework has unified many classical metrics in both finite- and infinite- dimensional settings, it previously lacked the structural components necessary to realize these generalized forms. We introduce a formalism based on unitized Hilbert-Schmidt operators and an extended Mahalanobis norm that allows the construction of robust, infinite-dimensional generalizations of GBW and Log-Hilbert-Schmidt distances. Our approach also incorporates a learnable regularization parameter that enhances geometric stability in high-dimensional comparisons. Preliminary experiments reproducing benchmarks from the literature demonstrate the improved performance of our generalized metrics, particularly in scenarios involving comparisons between datasets of varying dimension and scale. This work lays a theoretical and computational foundation for advancing robust geometric methods in machine learning, statistical inference, and functional data analysis.

Fonte: arXiv stat.ML

RL • Score 92

Constraint-based causal discovery with tiered background knowledge and latent variables in single or overlapping datasets

arXiv:2503.21526v3 Announce Type: replace Abstract: In this paper we consider the use of tiered background knowledge within constraint based causal discovery. Our focus is on settings relaxing causal sufficiency, i.e. allowing for latent variables which may arise because relevant information could not be measured at all, or not jointly, as in the case of multiple overlapping datasets. We first present novel insights into the properties of the 'tiered FCI' (tFCI) algorithm. Building on this, we introduce a new extension of the IOD (integrating overlapping datasets) algorithm incorporating tiered background knowledge, the 'tiered IOD' (tIOD) algorithm. We show that under full usage of the tiered background knowledge tFCI and tIOD are sound, while simple versions of the tIOD and tFCI are sound and complete. We further show that the tIOD algorithm can often be expected to be considerably more efficient and informative than the IOD algorithm even beyond the obvious restriction of the Markov equivalence classes. We provide a formal result on the conditions for this gain in efficiency and informativeness. Our results are accompanied by a series of examples illustrating the exact role and usefulness of tiered background knowledge.

Fonte: arXiv stat.ML

NLP/LLMs • Score 96

Rumo a uma IA Conversacional Explicável para Diagnóstico Precoce com Modelos de Linguagem de Grande Escala

Os sistemas de saúde em todo o mundo enfrentam desafios como diagnósticos ineficientes, aumento de custos e acesso limitado a especialistas. Este estudo apresenta um chatbot de diagnóstico alimentado por um Large Language Model (LLM), utilizando GPT-4o e técnicas de IA explicável, que interage com os pacientes para extrair e normalizar sintomas, priorizando diagnósticos potenciais.

Fonte: arXiv cs.AI

Privacy/Security/Fairness • Score 89

Differentially private Bayesian tests

arXiv:2401.15502v3 Announce Type: replace Abstract: Differential privacy has emerged as an significant cornerstone in the realm of scientific hypothesis testing utilizing confidential data. In reporting scientific discoveries, Bayesian tests are widely adopted since they effectively circumnavigate the key criticisms of P-values, namely, lack of interpretability and inability to quantify evidence in support of the competing hypotheses. We present a novel differentially private Bayesian hypotheses testing framework that arise naturally under a principled data generative mechanism, inherently maintaining the interpretability of the resulting inferences. Furthermore, by focusing on differentially private Bayes factors based on widely used test statistics, we circumvent the need to model the complete data generative mechanism and ensure substantial computational benefits. We also provide a set of sufficient conditions to establish results on Bayes factor consistency under the proposed framework. The utility of the devised technology is showcased via several numerical experiments.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

Spatially-informed transformers: Injecting geostatistical covariance biases into self-attention for spatio-temporal forecasting

arXiv:2512.17696v1 Announce Type: cross Abstract: The modeling of high-dimensional spatio-temporal processes presents a fundamental dichotomy between the probabilistic rigor of classical geostatistics and the flexible, high-capacity representations of deep learning. While Gaussian processes offer theoretical consistency and exact uncertainty quantification, their prohibitive computational scaling renders them impractical for massive sensor networks. Conversely, modern transformer architectures excel at sequence modeling but inherently lack a geometric inductive bias, treating spatial sensors as permutation-invariant tokens without a native understanding of distance. In this work, we propose a spatially-informed transformer, a hybrid architecture that injects a geostatistical inductive bias directly into the self-attention mechanism via a learnable covariance kernel. By formally decomposing the attention structure into a stationary physical prior and a non-stationary data-driven residual, we impose a soft topological constraint that favors spatially proximal interactions while retaining the capacity to model complex dynamics. We demonstrate the phenomenon of ``Deep Variography'', where the network successfully recovers the true spatial decay parameters of the underlying process end-to-end via backpropagation. Extensive experiments on synthetic Gaussian random fields and real-world traffic benchmarks confirm that our method outperforms state-of-the-art graph neural networks. Furthermore, rigorous statistical validation confirms that the proposed method delivers not only superior predictive accuracy but also well-calibrated probabilistic forecasts, effectively bridging the gap between physics-aware modeling and data-driven learning.

Fonte: arXiv stat.ML

MLOps/Systems • Score 95

Penalized Fair Regression for Multiple Groups in Chronic Kidney Disease

arXiv:2512.17340v1 Announce Type: cross Abstract: Fair regression methods have the potential to mitigate societal bias concerns in health care, but there has been little work on penalized fair regression when multiple groups experience such bias. We propose a general regression framework that addresses this gap with unfairness penalties for multiple groups. Our approach is demonstrated for binary outcomes with true positive rate disparity penalties. It can be efficiently implemented through reduction to a cost-sensitive classification problem. We additionally introduce novel score functions for automatically selecting penalty weights. Our penalized fair regression methods are empirically studied in simulations, where they achieve a fairness-accuracy frontier beyond that of existing comparison methods. Finally, we apply these methods to a national multi-site primary care study of chronic kidney disease to develop a fair classifier for end-stage renal disease. There we find substantial improvements in fairness for multiple race and ethnicity groups who experience societal bias in the health care system without any appreciable loss in overall fit.

Fonte: arXiv stat.ML

Privacy/Security/Fairness • Score 88

Imputation Uncertainty in Interpretable Machine Learning Methods

arXiv:2512.17689v1 Announce Type: new Abstract: In real data, missing values occur frequently, which affects the interpretation with interpretable machine learning (IML) methods. Recent work considers bias and shows that model explanations may differ between imputation methods, while ignoring additional imputation uncertainty and its influence on variance and confidence intervals. We therefore compare the effects of different imputation methods on the confidence interval coverage probabilities of the IML methods permutation feature importance, partial dependence plots and Shapley values. We show that single imputation leads to underestimation of variance and that, in most cases, only multiple imputation is close to nominal coverage.

Fonte: arXiv stat.ML

Applications • Score 89

Fast and Robust: Computationally Efficient Covariance Estimation for Sub-Weibull Vectors

arXiv:2512.17632v1 Announce Type: new Abstract: High-dimensional covariance estimation is notoriously sensitive to outliers. While statistically optimal estimators exist for general heavy-tailed distributions, they often rely on computationally expensive techniques like semidefinite programming or iterative M-estimation ($O(d^3)$). In this work, we target the specific regime of \textbf{Sub-Weibull distributions} (characterized by stretched exponential tails $\exp(-t^\alpha)$). We investigate a computationally efficient alternative: the \textbf{Cross-Fitted Norm-Truncated Estimator}. Unlike element-wise truncation, our approach preserves the spectral geometry while requiring $O(Nd^2)$ operations, which represents the theoretical lower bound for constructing a full covariance matrix. Although spherical truncation is geometrically suboptimal for anisotropic data, we prove that within the Sub-Weibull class, the exponential tail decay compensates for this mismatch. Leveraging weighted Hanson-Wright inequalities, we derive non-asymptotic error bounds showing that our estimator recovers the optimal sub-Gaussian rate $\tilde{O}(\sqrt{r(\Sigma)/N})$ with high probability. This provides a scalable solution for high-dimensional data that exhibits tails heavier than Gaussian but lighter than polynomial decay.

Fonte: arXiv stat.ML

RL • Score 89

Perfect reconstruction of sparse signals using nonconvexity control and one-step RSB message passing

arXiv:2512.17426v1 Announce Type: new Abstract: We consider sparse signal reconstruction via minimization of the smoothly clipped absolute deviation (SCAD) penalty, and develop one-step replica-symmetry-breaking (1RSB) extensions of approximate message passing (AMP), termed 1RSB-AMP. Starting from the 1RSB formulation of belief propagation, we derive explicit update rules of 1RSB-AMP together with the corresponding state evolution (1RSB-SE) equations. A detailed comparison shows that 1RSB-AMP and 1RSB-SE agree remarkably well at the macroscopic level, even in parameter regions where replica-symmetric (RS) AMP, termed RS-AMP, diverges and where the 1RSB description itself is not expected to be thermodynamically exact. Fixed-point analysis of 1RSB-SE reveals a phase diagram consisting of success, failure, and diverging phases, as in the RS case. However, the diverging-region boundary now depends on the Parisi parameter due to the 1RSB ansatz, and we propose a new criterion -- minimizing the size of the diverging region -- rather than the conventional zero-complexity condition, to determine its value. Combining this criterion with the nonconvexity-control (NCC) protocol proposed in a previous RS study improves the algorithmic limit of perfect reconstruction compared with RS-AMP. Numerical solutions of 1RSB-SE and experiments with 1RSB-AMP confirm that this improved limit is achieved in practice, though the gain is modest and remains slightly inferior to the Bayes-optimal threshold. We also report the behavior of thermodynamic quantities -- overlaps, free entropy, complexity, and the non-self-averaging susceptibility -- that characterize the 1RSB phase in this problem.

Fonte: arXiv stat.ML

MLOps/Systems • Score 92

Sharp Structure-Agnostic Lower Bounds for General Functional Estimation

arXiv:2512.17341v1 Announce Type: new Abstract: The design of efficient nonparametric estimators has long been a central problem in statistics, machine learning, and decision making. Classical optimal procedures often rely on strong structural assumptions, which can be misspecified in practice and complicate deployment. This limitation has sparked growing interest in structure-agnostic approaches -- methods that debias black-box nuisance estimates without imposing structural priors. Understanding the fundamental limits of these methods is therefore crucial. This paper provides a systematic investigation of the optimal error rates achievable by structure-agnostic estimators. We first show that, for estimating the average treatment effect (ATE), a central parameter in causal inference, doubly robust learning attains optimal structure-agnostic error rates. We then extend our analysis to a general class of functionals that depend on unknown nuisance functions and establish the structure-agnostic optimality of debiased/double machine learning (DML). We distinguish two regimes -- one where double robustness is attainable and one where it is not -- leading to different optimal rates for first-order debiasing, and show that DML is optimal in both regimes. Finally, we instantiate our general lower bounds by deriving explicit optimal rates that recover existing results and extend to additional estimands of interest. Our results provide theoretical validation for widely used first-order debiasing methods and guidance for practitioners seeking optimal approaches in the absence of structural assumptions. This paper generalizes and subsumes the ATE lower bound established in \citet{jin2024structure} by the same authors.

Fonte: arXiv stat.ML

NLP/LLMs • Score 96

Acelerando o Desempenho em Jogos com LLM Multi-modal via Previsão de Entrada e Correção de Erros

Agentes de controle sequencial em tempo real frequentemente enfrentam limitações devido à latência de inferência. Propomos um framework de especulação e correção que adapta a filosofia de prever e verificar da execução especulativa ao controle baseado em modelo com TD-MPC2, permitindo que o agente execute múltiplas ações planejadas sem necessidade de replanejamento imediato.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs

arXiv:2512.17206v1 Announce Type: new Abstract: Exploration capacity shapes both inference-time performance and reinforcement learning (RL) training for large (vision-) language models, as stochastic sampling often yields redundant reasoning paths with little high-level diversity. This paper proposes Reasoning Palette, a novel latent-modulation framework that endows the model with a stochastic latent variable for strategic contextualization, guiding its internal planning prior to token generation. This latent context is inferred from the mean-pooled embedding of a question-answer pair via a variational autoencoder (VAE), where each sampled latent potentially encodes a distinct reasoning context. During inference, a sampled latent is decoded into learnable token prefixes and prepended to the input prompt, modulating the model's internal reasoning trajectory. In this way, the model performs internal sampling over reasoning strategies prior to output generation, which shapes the style and structure of the entire response sequence. A brief supervised fine-tuning (SFT) warm-up phase allows the model to adapt to this latent conditioning. Within RL optimization, Reasoning Palette facilitates structured exploration by enabling on-demand injection for diverse reasoning modes, significantly enhancing exploration efficiency and sustained learning capability. Experiments across multiple reasoning benchmarks demonstrate that our method enables interpretable and controllable control over the (vision-) language model's strategic behavior, thereby achieving consistent performance gains over standard RL methods.

Fonte: arXiv cs.CV

Vision • Score 95

AnyCXR: Segmentação da Anatomia Humana de Raios-X de Tórax em Qualquer Posição de Aquisição usando Dados Sintéticos Randomizados de Domínio em Múltiplas Etapas com Anotações Imperfeitas e Aprendizado de Regularização de Anotação Conjunta Condicional

A segmentação anatômica robusta de raios-X de tórax (CXRs) continua sendo um desafio devido à escassez de anotações abrangentes e à variabilidade substancial das condições de aquisição no mundo real. Propomos o AnyCXR, um framework unificado que permite a segmentação multi-orgânica generalizável em ângulos de projeção arbitrários de CXR usando apenas supervisão sintética.

Fonte: arXiv cs.CV

RL • Score 95

Generative Multi-Objective Bayesian Optimization with Scalable Batch Evaluations for Sample-Efficient De Novo Molecular Design

arXiv:2512.17659v1 Announce Type: new Abstract: Designing molecules that must satisfy multiple, often conflicting objectives is a central challenge in molecular discovery. The enormous size of chemical space and the cost of high-fidelity simulations have driven the development of machine learning-guided strategies for accelerating design with limited data. Among these, Bayesian optimization (BO) offers a principled framework for sample-efficient search, while generative models provide a mechanism to propose novel, diverse candidates beyond fixed libraries. However, existing methods that couple the two often rely on continuous latent spaces, which introduces both architectural entanglement and scalability challenges. This work introduces an alternative, modular "generate-then-optimize" framework for de novo multi-objective molecular design/discovery. At each iteration, a generative model is used to construct a large, diverse pool of candidate molecules, after which a novel acquisition function, qPMHI (multi-point Probability of Maximum Hypervolume Improvement), is used to optimally select a batch of candidates most likely to induce the largest Pareto front expansion. The key insight is that qPMHI decomposes additively, enabling exact, scalable batch selection via only simple ranking of probabilities that can be easily estimated with Monte Carlo sampling. We benchmark the framework against state-of-the-art latent-space and discrete molecular optimization methods, demonstrating significant improvements across synthetic benchmarks and application-driven tasks. Specifically, in a case study related to sustainable energy storage, we show that our approach quickly uncovers novel, diverse, and high-performing organic (quinone-based) cathode materials for aqueous redox flow battery applications.

Fonte: arXiv stat.ML

Vision • Score 95

Disentangled representations via score-based variational autoencoders

arXiv:2512.17127v1 Announce Type: new Abstract: We present the Score-based Autoencoder for Multiscale Inference (SAMI), a method for unsupervised representation learning that combines the theoretical frameworks of diffusion models and VAEs. By unifying their respective evidence lower bounds, SAMI formulates a principled objective that learns representations through score-based guidance of the underlying diffusion process. The resulting representations automatically capture meaningful structure in the data: it recovers ground truth generative factors in our synthetic dataset, learns factorized, semantic latent dimensions from complex natural images, and encodes video sequences into latent trajectories that are straighter than those of alternative encoders, despite training exclusively on static images. Furthermore, SAMI can extract useful representations from pre-trained diffusion models with minimal additional training. Finally, the explicitly probabilistic formulation provides new ways to identify semantically meaningful axes in the absence of supervised labels, and its mathematical exactness allows us to make formal statements about the nature of the learned representation. Overall, these results indicate that implicit structural information in diffusion models can be made explicit and interpretable through synergistic combination with a variational autoencoder.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs

arXiv:2512.17319v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) demonstrate strong perception and reasoning performance on existing remote sensing (RS) benchmarks. However, most prior benchmarks rely on low-resolution imagery, and some high-resolution benchmarks suffer from flawed reasoning-task designs. We show that text-only LLMs can perform competitively with multimodal vision-language models on RS reasoning tasks without access to images, revealing a critical mismatch between current benchmarks and the intended evaluation of visual understanding. To enable faithful assessment, we introduce RSHR-Bench, a super-high-resolution benchmark for RS visual understanding and reasoning. RSHR-Bench contains 5,329 full-scene images with a long side of at least 4,000 pixels, with up to about 3 x 10^8 pixels per image, sourced from widely used RS corpora and UAV collections. We design four task families: multiple-choice VQA, open-ended VQA, image captioning, and single-image evaluation. These tasks cover nine perception categories and four reasoning types, supporting multi-turn and multi-image dialog. To reduce reliance on language priors, we apply adversarial filtering with strong LLMs followed by rigorous human verification. Overall, we construct 3,864 VQA tasks, 3,913 image captioning tasks, and 500 fully human-written or verified single-image evaluation VQA pairs. Evaluations across open-source, closed-source, and RS-specific VLMs reveal persistent performance gaps in super-high-resolution scenarios. Code: https://github.com/Yunkaidang/RSHR

Fonte: arXiv cs.CV

RL • Score 96

Sharing Knowledge without Sharing Data: Stitches can improve ensembles of disjointly trained models

arXiv:2512.17592v1 Announce Type: new Abstract: Deep learning has been shown to be very capable at performing many real-world tasks. However, this performance is often dependent on the presence of large and varied datasets. In some settings, like in the medical domain, data is often fragmented across parties, and cannot be readily shared. While federated learning addresses this situation, it is a solution that requires synchronicity of parties training a single model together, exchanging information about model weights. We investigate how asynchronous collaboration, where only already trained models are shared (e.g. as part of a publication), affects performance, and propose to use stitching as a method for combining models. Through taking a multi-objective perspective, where performance on each parties' data is viewed independently, we find that training solely on a single parties' data results in similar performance when merging with another parties' data, when considering performance on that single parties' data, while performance on other parties' data is notably worse. Moreover, while an ensemble of such individually trained networks generalizes better, performance on each parties' own dataset suffers. We find that combining intermediate representations in individually trained models with a well placed pair of stitching layers allows this performance to recover to a competitive degree while maintaining improved generalization, showing that asynchronous collaboration can yield competitive results.

Fonte: arXiv cs.LG

MLOps/Systems • Score 96

NetworkFF: Unified Layer Optimization in Forward-Only Neural Networks

arXiv:2512.17531v1 Announce Type: new Abstract: The Forward-Forward algorithm eliminates backpropagation's memory constraints and biological implausibility through dual forward passes with positive and negative data. However, conventional implementations suffer from critical inter-layer isolation, where layers optimize goodness functions independently without leveraging collective learning dynamics. This isolation constrains representational coordination and limits convergence efficiency in deeper architectures. This paper introduces Collaborative Forward-Forward (CFF) learning, extending the original algorithm through inter-layer cooperation mechanisms that preserve forward-only computation while enabling global context integration. Our framework implements two collaborative paradigms: Fixed CFF (F-CFF) with constant inter-layer coupling and Adaptive CFF (A-CFF) with learnable collaboration parameters that evolve during training. The collaborative goodness function incorporates weighted contributions from all layers, enabling coordinated feature learning while maintaining memory efficiency and biological plausibility. Comprehensive evaluation on MNIST and Fashion-MNIST demonstrates significant performance improvements over baseline Forward-Forward implementations. These findings establish inter-layer collaboration as a fundamental enhancement to Forward-Forward learning, with immediate applicability to neuromorphic computing architectures and energy-constrained AI systems.

Fonte: arXiv cs.LG

RL • Score 96

Assessing Long-Term Electricity Market Design for Ambitious Decarbonization Targets using Multi-Agent Reinforcement Learning

arXiv:2512.17444v1 Announce Type: new Abstract: Electricity systems are key to transforming today's society into a carbon-free economy. Long-term electricity market mechanisms, including auctions, support schemes, and other policy instruments, are critical in shaping the electricity generation mix. In light of the need for more advanced tools to support policymakers and other stakeholders in designing, testing, and evaluating long-term markets, this work presents a multi-agent reinforcement learning model capable of capturing the key features of decarbonizing energy systems. Profit-maximizing generation companies make investment decisions in the wholesale electricity market, responding to system needs, competitive dynamics, and policy signals. The model employs independent proximal policy optimization, which was selected for suitability to the decentralized and competitive environment. Nevertheless, given the inherent challenges of independent learning in multi-agent settings, an extensive hyperparameter search ensures that decentralized training yields market outcomes consistent with competitive behavior. The model is applied to a stylized version of the Italian electricity system and tested under varying levels of competition, market designs, and policy scenarios. Results highlight the critical role of market design for decarbonizing the electricity sector and avoiding price volatility. The proposed framework allows assessing long-term electricity markets in which multiple policy and market mechanisms interact simultaneously, with market participants responding and adapting to decarbonization pathways.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Understanding Generalization in Role-Playing Models via Information Theory

arXiv:2512.17270v1 Announce Type: new Abstract: Role-playing models (RPMs) are widely used in real-world applications but underperform when deployed in the wild. This degradation can be attributed to distribution shifts, including user, character, and dialogue compositional shifts. Existing methods like LLM-as-a-judge fall short in providing a fine-grained diagnosis of how these shifts affect RPM generalization, and thus there lack formal frameworks to characterize RPM generalization behaviors. To bridge these gaps, we introduce an information-theoretic metric, named reasoning-based effective mutual information difference (R-EMID), to measure RPM performance degradation in an interpretable way. We also derive an upper bound on R-EMID to predict the worst-case generalization performance of RPMs and theoretically reveal how various shifts contribute to the RPM performance degradation. Moreover, we propose a co-evolving reinforcement learning framework to adaptively model the connection among user, character, and dialogue context and thus enhance the estimation of dialogue response generation probability, which is critical for calculating R-EMID. Finally, we evaluate the generalization performance of various RPMs using R-EMID, finding that user shift poses the highest risk among all shifts and reinforcement learning is the most effective approach for enhancing RPM generalization.

Fonte: arXiv cs.LG

NLP/LLMs • Score 92

Perturb Your Data: Paraphrase-Guided Training Data Watermarking

arXiv:2512.17075v1 Announce Type: new Abstract: Training data detection is critical for enforcing copyright and data licensing, as Large Language Models (LLM) are trained on massive text corpora scraped from the internet. We present SPECTRA, a watermarking approach that makes training data reliably detectable even when it comprises less than 0.001% of the training corpus. SPECTRA works by paraphrasing text using an LLM and assigning a score based on how likely each paraphrase is, according to a separate scoring model. A paraphrase is chosen so that its score closely matches that of the original text, to avoid introducing any distribution shifts. To test whether a suspect model has been trained on the watermarked data, we compare its token probabilities against those of the scoring model. We demonstrate that SPECTRA achieves a consistent p-value gap of over nine orders of magnitude when detecting data used for training versus data not used for training, which is greater than all baselines tested. SPECTRA equips data owners with a scalable, deploy-before-release watermark that survives even large-scale LLM training.

Fonte: arXiv cs.CL

Vision • Score 96

Dialética para Inteligência Artificial

O artigo explora se a inteligência artificial pode descobrir conceitos a partir de experiências brutas sem supervisão humana. Propomos uma definição de 'conceito' que seja uma estrutura revisável, comparável e alinhável entre agentes, utilizando uma perspectiva de informação algorítmica.

Fonte: arXiv cs.AI

Evaluation/Benchmarks • Score 93

meval: A Statistical Toolbox for Fine-Grained Model Performance Analysis

arXiv:2512.17409v1 Announce Type: new Abstract: Analyzing machine learning model performance stratified by patient and recording properties is becoming the accepted norm and often yields crucial insights about important model failure modes. Performing such analyses in a statistically rigorous manner is non-trivial, however. Appropriate performance metrics must be selected that allow for valid comparisons between groups of different sample sizes and base rates; metric uncertainty must be determined and multiple comparisons be corrected for, in order to assess whether any observed differences may be purely due to chance; and in the case of intersectional analyses, mechanisms must be implemented to find the most `interesting' subgroups within combinatorially many subgroup combinations. We here present a statistical toolbox that addresses these challenges and enables practitioners to easily yet rigorously assess their models for potential subgroup performance disparities. While broadly applicable, the toolbox is specifically designed for medical imaging applications. The analyses provided by the toolbox are illustrated in two case studies, one in skin lesion malignancy classification on the ISIC2020 dataset and one in chest X-ray-based disease classification on the MIMIC-CXR dataset.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

MMRAG-RFT: Ajuste Fino de Reforço em Duas Etapas para Geração Aumentada por Recuperação Multi-modal Explicável

A Geração Aumentada por Recuperação Multi-modal (MMRAG) possibilita uma geração altamente credível ao integrar conhecimento externo multi-modal, apresentando desempenho impressionante em cenários complexos. No entanto, métodos existentes falham em esclarecer a lógica de raciocínio por trás da recuperação e geração de respostas. Propomos a introdução de aprendizado por reforço para aprimorar as capacidades de raciocínio de modelos de linguagem multi-modal.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

A percepção realista de ameaça impulsiona o conflito entre grupos: uma análise causal e dinâmica usando simulações de agentes generativos

O conflito humano é frequentemente atribuído a ameaças contra condições materiais e valores simbólicos, mas a interação entre eles e qual domina ainda não está clara. Este estudo utiliza simulações de agentes impulsionados por modelos de linguagem de grande porte (LLM) em sociedades virtuais para investigar essas dinâmicas.

Fonte: arXiv cs.AI

RL • Score 96

Learning Safe Autonomous Driving Policies Using Predictive Safety Representations

arXiv:2512.17586v1 Announce Type: new Abstract: Safe reinforcement learning (SafeRL) is a prominent paradigm for autonomous driving, where agents are required to optimize performance under strict safety requirements. This dual objective creates a fundamental tension, as overly conservative policies limit driving efficiency while aggressive exploration risks safety violations. The Safety Representations for Safer Policy Learning (SRPL) framework addresses this challenge by equipping agents with a predictive model of future constraint violations and has shown promise in controlled environments. This paper investigates whether SRPL extends to real-world autonomous driving scenarios. Systematic experiments on the Waymo Open Motion Dataset (WOMD) and NuPlan demonstrate that SRPL can improve the reward-safety tradeoff, achieving statistically significant improvements in success rate (effect sizes r = 0.65-0.86) and cost reduction (effect sizes r = 0.70-0.83), with p < 0.05 for observed improvements. However, its effectiveness depends on the underlying policy optimizer and the dataset distribution. The results further show that predictive safety representations play a critical role in improving robustness to observation noise. Additionally, in zero-shot cross-dataset evaluation, SRPL-augmented agents demonstrate improved generalization compared to non-SRPL methods. These findings collectively demonstrate the potential of predictive safety representations to strengthen SafeRL for autonomous driving.

Fonte: arXiv cs.LG

RL • Score 96

SafeBench-Seq: A Homology-Clustered, CPU-Only Baseline for Protein Hazard Screening with Physicochemical/Composition Features and Cluster-Aware Confidence Intervals

arXiv:2512.17527v1 Announce Type: new Abstract: Foundation models for protein design raise concrete biosecurity risks, yet the community lacks a simple, reproducible baseline for sequence-level hazard screening that is explicitly evaluated under homology control and runs on commodity CPUs. We introduce SafeBench-Seq, a metadata-only, reproducible benchmark and baseline classifier built entirely from public data (SafeProtein hazards and UniProt benigns) and interpretable features (global physicochemical descriptors and amino-acid composition). To approximate "never-before-seen" threats, we homology-cluster the combined dataset at <=40% identity and perform cluster-level holdouts (no cluster overlap between train/test). We report discrimination (AUROC/AUPRC) and screening-operating points (TPR@1% FPR; FPR@95% TPR) with 95% bootstrap confidence intervals (n=200), and we provide calibrated probabilities via CalibratedClassifierCV (isotonic for Logistic Regression / Random Forest; Platt sigmoid for Linear SVM). We quantify probability quality using Brier score, Expected Calibration Error (ECE; 15 bins), and reliability diagrams. Shortcut susceptibility is probed via composition-preserving residue shuffles and length-/composition-only ablations. Empirically, random splits substantially overestimate robustness relative to homology-clustered evaluation; calibrated linear models exhibit comparatively good calibration, while tree ensembles retain slightly higher Brier/ECE. SafeBench-Seq is CPU-only, reproducible, and releases metadata only (accessions, cluster IDs, split labels), enabling rigorous evaluation without distributing hazardous sequences.

Fonte: arXiv cs.LG

NLP/LLMs • Score 92

XLM: A Python package for non-autoregressive language models

arXiv:2512.17065v1 Announce Type: new Abstract: In recent years, there has been a resurgence of interest in non-autoregressive text generation in the context of general language modeling. Unlike the well-established autoregressive language modeling paradigm, which has a plethora of standard training and inference libraries, implementations of non-autoregressive language modeling have largely been bespoke making it difficult to perform systematic comparisons of different methods. Moreover, each non-autoregressive language model typically requires it own data collation, loss, and prediction logic, making it challenging to reuse common components. In this work, we present the XLM python package, which is designed to make implementing small non-autoregressive language models faster with a secondary goal of providing a suite of small pre-trained models (through a companion xlm-models package) that can be used by the research community. The code is available at https://github.com/dhruvdcoder/xlm-core.

Fonte: arXiv cs.CL

Theory/Optimization • Score 90

Machine Learning for Static and Single-Event Dynamic Complex Network Analysis

arXiv:2512.17577v1 Announce Type: new Abstract: The primary objective of this thesis is to develop novel algorithmic approaches for Graph Representation Learning of static and single-event dynamic networks. In such a direction, we focus on the family of Latent Space Models, and more specifically on the Latent Distance Model which naturally conveys important network characteristics such as homophily, transitivity, and the balance theory. Furthermore, this thesis aims to create structural-aware network representations, which lead to hierarchical expressions of network structure, community characterization, the identification of extreme profiles in networks, and impact dynamics quantification in temporal networks. Crucially, the methods presented are designed to define unified learning processes, eliminating the need for heuristics and multi-stage processes like post-processing steps. Our aim is to delve into a journey towards unified network embeddings that are both comprehensive and powerful, capable of characterizing network structures and adeptly handling the diverse tasks that graph analysis offers.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Deep Learning-Based Surrogate Creep Modelling in Inconel 625: A High-Temperature Alloy Study

arXiv:2512.17477v1 Announce Type: new Abstract: Time-dependent deformation, particularly creep, in high-temperature alloys such as Inconel 625 is a key factor in the long-term reliability of components used in aerospace and energy systems. Although Inconel 625 shows excellent creep resistance, finite-element creep simulations in tools such as ANSYS remain computationally expensive, often requiring tens of minutes for a single 10,000-hour run. This work proposes deep learning based surrogate models to provide fast and accurate replacements for such simulations. Creep strain data was generated in ANSYS using the Norton law under uniaxial stresses of 50 to 150 MPa and temperatures of 700 to 1000 $^\circ$C, and this temporal dataset was used to train two architectures: a BiLSTM Variational Autoencoder for uncertainty-aware and generative predictions, and a BiLSTM Transformer hybrid that employs self-attention to capture long-range temporal behavior. Both models act as surrogate predictors, with the BiLSTM-VAE offering probabilistic output and the BiLSTM-Transformer delivering high deterministic accuracy. Performance is evaluated using RMSE, MAE, and $R^2$. Results show that the BiLSTM-VAE provides stable and reliable creep strain forecasts, while the BiLSTM-Transformer achieves strong accuracy across the full time range. Latency tests indicate substantial speedup: while each ANSYS simulation requires 30 to 40 minutes for a given stress-temperature condition, the surrogate models produce predictions within seconds. The proposed framework enables rapid creep assessment for design optimization and structural health monitoring, and provides a scalable solution for high-temperature alloy applications.

Fonte: arXiv cs.LG

Vision • Score 93

DeepShare: Sharing ReLU Across Channels and Layers for Efficient Private Inference

arXiv:2512.17398v1 Announce Type: new Abstract: Private Inference (PI) uses cryptographic primitives to perform privacy preserving machine learning. In this setting, the owner of the network runs inference on the data of the client without learning anything about the data and without revealing any information about the model. It has been observed that a major computational bottleneck of PI is the calculation of the gate (i.e., ReLU), so a considerable amount of effort have been devoted to reducing the number of ReLUs in a given network. We focus on the DReLU, which is the non-linear step function of the ReLU and show that one DReLU can serve many ReLU operations. We suggest a new activation module where the DReLU operation is only performed on a subset of the channels (Prototype channels), while the rest of the channels (replicate channels) replicates the DReLU of each of their neurons from the corresponding neurons in one of the prototype channels. We then extend this idea to work across different layers. We show that this formulation can drastically reduce the number of DReLU operations in resnet type network. Furthermore, our theoretical analysis shows that this new formulation can solve an extended version of the XOR problem, using just one non-linearity and two neurons, something that traditional formulations and some PI specific methods cannot achieve. We achieve new SOTA results on several classification setups, and achieve SOTA results on image segmentation.

Fonte: arXiv cs.LG

Vision • Score 95

SynergyWarpNet: Attention-Guided Cooperative Warping for Neural Portrait Animation

arXiv:2512.17331v1 Announce Type: new Abstract: Recent advances in neural portrait animation have demonstrated remarked potential for applications in virtual avatars, telepresence, and digital content creation. However, traditional explicit warping approaches often struggle with accurate motion transfer or recovering missing regions, while recent attention-based warping methods, though effective, frequently suffer from high complexity and weak geometric grounding. To address these issues, we propose SynergyWarpNet, an attention-guided cooperative warping framework designed for high-fidelity talking head synthesis. Given a source portrait, a driving image, and a set of reference images, our model progressively refines the animation in three stages. First, an explicit warping module performs coarse spatial alignment between the source and driving image using 3D dense optical flow. Next, a reference-augmented correction module leverages cross-attention across 3D keypoints and texture features from multiple reference images to semantically complete occluded or distorted regions. Finally, a confidence-guided fusion module integrates the warped outputs with spatially-adaptive fusing, using a learned confidence map to balance structural alignment and visual consistency. Comprehensive evaluations on benchmark datasets demonstrate state-of-the-art performance.

Fonte: arXiv cs.CV

NLP/LLMs • Score 92

EMAG: Amostragem de Difusão Auto-Retificadora com Orientação de Média Móvel Exponencial

No contexto de modelos generativos de difusão e correspondência de fluxo, técnicas de orientação são amplamente utilizadas para melhorar a qualidade e a consistência das amostras. Propomos o EMAG, um mecanismo que modifica a atenção durante a inferência em transformers de difusão, produzindo negativos mais difíceis e semanticamente fiéis, melhorando a qualidade e a pontuação de preferência humana (HPS) em +0,46 em relação ao CFG.

Fonte: arXiv cs.CV

Evaluation/Benchmarks • Score 89

Enhancing Long Document Long Form Summarisation with Self-Planning

arXiv:2512.17179v1 Announce Type: new Abstract: We introduce a novel approach for long context summarisation, highlight-guided generation, that leverages sentence-level information as a content plan to improve the traceability and faithfulness of generated summaries. Our framework applies self-planning methods to identify important content and then generates a summary conditioned on the plan. We explore both an end-to-end and two-stage variants of the approach, finding that the two-stage pipeline performs better on long and information-dense documents. Experiments on long-form summarisation datasets demonstrate that our method consistently improves factual consistency while preserving relevance and overall quality. On GovReport, our best approach has improved ROUGE-L by 4.1 points and achieves about 35% gains in SummaC scores. Qualitative analysis shows that highlight-guided summarisation helps preserve important details, leading to more accurate and insightful summaries across domains.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

GreedySnake: Accelerating SSD-Offloaded LLM Training with Efficient Scheduling and Optimizer Step Overlapping

arXiv:2512.17570v1 Announce Type: new Abstract: SSD-offloaded training offers a practical and promising approach to making LLM training cost-effective. Building on gradient accumulation with micro-batches, this paper introduces GreedySnake, a new SSD-offloaded training system that employs vertical scheduling, which executes all microbatches of a layer before proceeding to the next. Compared to existing systems that use horizontal scheduling (i.e., executing micro-batches sequentially), GreedySnake achieves higher training throughput with smaller batch sizes, bringing the system much closer to the ideal scenario predicted by the roofline model. To further mitigate the I/O bottleneck, GreedySnake overlaps part of the optimization step with the forward pass of the next iteration. Experimental results on A100 GPUs show that GreedySnake achieves saturated training throughput improvements over ZeRO-Infinity: 1.96x on 1 GPU and 1.93x on 4 GPUs for GPT-65B, and 2.53x on 1 GPU for GPT-175B. The code is open-sourced at https://github.com/npz7yyk/GreedySnake

Fonte: arXiv cs.LG

RL • Score 93

Design de IA Humanóide Aumenta o Antropomorfismo, mas Produz Resultados Divergentes em Engajamento e Confiança Globalmente

Mais de um bilhão de usuários em todo o mundo interagem com sistemas de IA projetados para imitar características humanas. Este estudo investiga a relação entre o design humanóide da IA e seus efeitos no engajamento e na confiança, utilizando experimentos em 10 nações. Os resultados mostram que a percepção de humanidade na IA é influenciada por fatores culturais, desafiando narrativas sobre riscos inerentes ao design humanóide.

Fonte: arXiv cs.AI

NLP/LLMs • Score 93

A lightweight Spatial-Temporal Graph Neural Network for Long-term Time Series Forecasting

arXiv:2512.17453v1 Announce Type: new Abstract: We propose Lite-STGNN, a lightweight spatial-temporal graph neural network for long-term multivariate forecasting that integrates decomposition-based temporal modeling with learnable sparse graph structure. The temporal module applies trend-seasonal decomposition, while the spatial module performs message passing with low-rank Top-$K$ adjacency learning and conservative horizon-wise gating, enabling spatial corrections that enhance a strong linear baseline. Lite-STGNN achieves state-of-the-art accuracy on four benchmark datasets for horizons up to 720 steps, while being parameter-efficient and substantially faster to train than transformer-based methods. Ablation studies show that the spatial module yields 4.6% improvement over the temporal baseline, Top-$K$ enhances locality by 3.3%, and learned adjacency matrices reveal domain-specific interaction dynamics. Lite-STGNN thus offers a compact, interpretable, and efficient framework for long-term multivariate time series forecasting.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Data Augmentation Supporting a Conversational Agent Designed for Smoking Cessation Support Groups

arXiv:2512.17092v1 Announce Type: new Abstract: Online support groups for smoking cessation are economical and accessible, yet they often face challenges with low user engagement and stigma. The use of an automatic conversational agent would improve engagement by ensuring that all user comments receive a timely response.). We address the challenge of insufficient high-quality data by employing a two-level data augmentation strategy: synthetic data augmentation and real data augmentation. First, we fine-tuned an open source LLM to classify posts from our existing smoking cessation support groups and identify intents with low F1 (precision+recall) scores. Then, for these intents, we generate additional synthetic data using prompt engineering with the GPT model, with an average of 87\% of the generated synthetic posts deemed high quality by human annotators. Overall, the synthetic augmentation process resulted in 43\% of the original posts being selected for augmentation, followed by 140\% synthetic expansion of these posts. Additionally, we scraped more than 10,000 real posts from a related online support context, of which 73\% were validated as good quality by human annotators. Each synthetic or scraped post underwent rigorous validation involving human reviewers to ensure quality and relevance. The validated new data, combined with the original support group posts, formed an augmented dataset used to retrain the intent classifier. Performance evaluation of the retrained model demonstrated a 32\% improvement in F1, confirming the effectiveness of our data augmentation approach. Synthetic and real post augmentation led to similar performance improvements. This study provides a replicable framework for enhancing conversational agent performance in domains where data scarcity is a critical issue.

Fonte: arXiv cs.CL

Evaluation/Benchmarks • Score 90

Bayesian Optimisation: Which Constraints Matter?

arXiv:2512.17569v1 Announce Type: new Abstract: Bayesian optimisation has proven to be a powerful tool for expensive global black-box optimisation problems. In this paper, we propose new Bayesian optimisation variants of the popular Knowledge Gradient acquisition functions for problems with \emph{decoupled} black-box constraints, in which subsets of the objective and constraint functions may be evaluated independently. In particular, our methods aim to take into account that often only a handful of the constraints may be binding at the optimum, and hence we should evaluate only relevant constraints when trying to optimise a function. We empirically benchmark these methods against existing methods and demonstrate their superiority over the state-of-the-art.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Learning What to Write: Write-Gated KV for Efficient Long-Context Inference

arXiv:2512.17452v1 Announce Type: new Abstract: Long-context LLM inference is bottlenecked by the quadratic attention complexity and linear KV cache growth. Prior approaches mitigate this via post-hoc selection or eviction but overlook the root inefficiency: indiscriminate writing to persistent memory. In this paper, we formalize KV cache management as a causal system of three primitives: KV Admission, Selection, and Eviction. We instantiate KV Admission via Write-Gated KV, a lightweight mechanism that learns to predict token utility before it enters the cache. By filtering out low-utility states early to maintain a compact global cache alongside a sliding local cache, Write-Gated KV reduces memory usage by 46-57% and delivers 3.03-3.45$\times$ prefill and 1.89-2.56$\times$ decode speedups on Llama model with negligible accuracy loss, all while remaining compatible with FlashAttention and paged-KV systems. These results demonstrate that learning what to write, is a principled and practical recipe for efficient long-context inference. Code is available at https://github.com/EMCLab-Sinica/WG-KV .

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Classificação de Hipóteses Inspirada em Solomonoff com LLMs para Previsão Sob Incerteza

O raciocínio sob incerteza é um desafio crucial em IA, especialmente em tarefas do mundo real, onde problemas com dados escassos exigem generalização sistemática. Propomos um método inspirado em Solomonoff que pondera hipóteses geradas por LLM com base na simplicidade e no ajuste preditivo, resultando em previsões conservadoras e conscientes da incerteza.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

InfoTok: Tokenizador de Vídeo Discreto Adaptativo via Compressão Teórica da Informação

A tokenização discreta de vídeo precisa e eficiente é essencial para o processamento de longas sequências de vídeo. Este artigo apresenta o InfoTok, um framework fundamentado para tokenização adaptativa de vídeo, provando que métodos de treinamento existentes são subótimos e introduzindo um novo algoritmo baseado em ELBO que se aproxima da optimalidade teórica.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Adversarially Robust Detection of Harmful Online Content: A Computational Design Science Approach

arXiv:2512.17367v1 Announce Type: new Abstract: Social media platforms are plagued by harmful content such as hate speech, misinformation, and extremist rhetoric. Machine learning (ML) models are widely adopted to detect such content; however, they remain highly vulnerable to adversarial attacks, wherein malicious users subtly modify text to evade detection. Enhancing adversarial robustness is therefore essential, requiring detectors that can defend against diverse attacks (generalizability) while maintaining high overall accuracy. However, simultaneously achieving both optimal generalizability and accuracy is challenging. Following the computational design science paradigm, this study takes a sequential approach that first proposes a novel framework (Large Language Model-based Sample Generation and Aggregation, LLM-SGA) by identifying the key invariances of textual adversarial attacks and leveraging them to ensure that a detector instantiated within the framework has strong generalizability. Second, we instantiate our detector (Adversarially Robust Harmful Online Content Detector, ARHOCD) with three novel design components to improve detection accuracy: (1) an ensemble of multiple base detectors that exploits their complementary strengths; (2) a novel weight assignment method that dynamically adjusts weights based on each sample's predictability and each base detector's capability, with weights initialized using domain knowledge and updated via Bayesian inference; and (3) a novel adversarial training strategy that iteratively optimizes both the base detectors and the weight assignor. We addressed several limitations of existing adversarial robustness enhancement research and empirically evaluated ARHOCD across three datasets spanning hate speech, rumor, and extremist content. Results show that ARHOCD offers strong generalizability and improves detection accuracy under adversarial conditions.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Task Schema and Binding: A Double Dissociation Study of In-Context Learning

arXiv:2512.17325v1 Announce Type: new Abstract: We provide causal mechanistic validation that in-context learning (ICL) decomposes into two separable mechanisms: Task Schema (abstract task type recognition) and Binding (specific input-output associations). Through activation patching experiments across 9 models from 7 Transformer families plus Mamba (370M-13B parameters), we establish three key findings: 1. Double dissociation: Task Schema transfers at 100% via late MLP patching; Binding transfers at 62% via residual stream patching -- proving separable mechanisms 2. Prior-Schema trade-off: Schema reliance inversely correlates with prior knowledge (Spearman rho = -0.596, p < 0.001, N=28 task-model pairs) 3. Architecture generality: The mechanism operates across all tested architectures including the non-Transformer Mamba These findings offer a mechanistic account of the ICL puzzle that contrasts with prior views treating ICL as a monolithic mechanism (whether retrieval-based, gradient descent-like, or purely Bayesian). By establishing that Schema and Binding are neurally dissociable -- not merely behavioral modes -- we provide causal evidence for dual-process theories of ICL. Models rely on Task Schema when prior knowledge is absent, but prior knowledge interferes through attentional mis-routing (72.7% recency bias) rather than direct output competition (0%). This explains why arbitrary mappings succeed (zero prior leads to full Schema reliance) while factual overrides fail -- and reveals that the true bottleneck is attentional, not output-level. Practical implications: Understanding these dual mechanisms enables more efficient prompt engineering -- reliable schema transfer reduces required demonstrations for novel tasks, while prior-aware design can mitigate the 38% binding failure rate in high-prior scenarios, improving ICL system reliability in production deployments.

Fonte: arXiv cs.LG

NLP/LLMs • Score 93

UniRel-R1: Raciocínio LLM ajustado por RL para Respostas a Perguntas Relacionais em Grafos de Conhecimento

O Knowledge Graph Question Answering (KGQA) tradicionalmente se concentra em consultas centradas em entidades que retornam uma única entidade de resposta. Neste trabalho, introduzimos o KGQA centrado em relações, onde a resposta é um subgrafo que captura as conexões semânticas entre entidades. Propomos o UniRel-R1, um framework unificado que integra seleção de subgrafos, poda de grafos em múltiplas etapas e um LLM ajustado com aprendizado por reforço.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Vision-Language Model Guided Image Restoration

arXiv:2512.17292v1 Announce Type: new Abstract: Many image restoration (IR) tasks require both pixel-level fidelity and high-level semantic understanding to recover realistic photos with fine-grained details. However, previous approaches often struggle to effectively leverage both the visual and linguistic knowledge. Recent efforts have attempted to incorporate Vision-language models (VLMs), which excel at aligning visual and textual features, into universal IR. Nevertheless, these methods fail to utilize the linguistic priors to ensure semantic coherence during the restoration process. To address this issue, in this paper, we propose the Vision-Language Model Guided Image Restoration (VLMIR) framework, which leverages the rich vision-language priors of VLMs, such as CLIP, to enhance IR performance through improved visual perception and semantic understanding. Our approach consists of two stages: VLM-based feature extraction and diffusion-based image restoration. In the first stage, we extract complementary visual and linguistic representations of input images by condensing the visual perception and high-level semantic priors through VLMs. Specifically, we align the embeddings of captions from low-quality and high-quality images using a cosine similarity loss with LoRA fine-tuning, and employ a degradation predictor to decompose degradation and clean image content embeddings. These complementary visual and textual embeddings are then integrated into a diffusion-based model via cross-attention mechanisms for enhanced restoration. Extensive experiments and ablation studies demonstrate that VLMIR achieves superior performance across both universal and degradation-specific IR tasks, underscoring the critical role of integrated visual and linguistic knowledge from VLMs in advancing image restoration capabilities.

Fonte: arXiv cs.CV

Vision • Score 95

Homografia Infinita como Condicionamento Robusto para Geração de Vídeo Controlada por Câmera

O progresso recente em modelos de difusão de vídeo gerou interesse crescente na geração de vídeo de nova perspectiva controlada por câmera para cenas dinâmicas. Um desafio chave é garantir a fidelidade à pose da câmera especificada, mantendo a consistência de visualização e raciocinando sobre a geometria ocluída a partir de observações limitadas. Apresentamos o InfCam, um framework de geração de vídeo para vídeo controlado por câmera com alta fidelidade de pose.

Fonte: arXiv cs.CV

Vision • Score 95

Pro-Pose: Unpaired Full-Body Portrait Synthesis via Canonical UV Maps

arXiv:2512.17143v1 Announce Type: new Abstract: Photographs of people taken by professional photographers typically present the person in beautiful lighting, with an interesting pose, and flattering quality. This is unlike common photos people can take of themselves. In this paper, we explore how to create a ``professional'' version of a person's photograph, i.e., in a chosen pose, in a simple environment, with good lighting, and standard black top/bottom clothing. A key challenge is to preserve the person's unique identity, face and body features while transforming the photo. If there would exist a large paired dataset of the same person photographed both ``in the wild'' and by a professional photographer, the problem would potentially be easier to solve. However, such data does not exist, especially for a large variety of identities. To that end, we propose two key insights: 1) Our method transforms the input photo and person's face to a canonical UV space, which is further coupled with reposing methodology to model occlusions and novel view synthesis. Operating in UV space allows us to leverage existing unpaired datasets. 2) We personalize the output photo via multi image finetuning. Our approach yields high-quality, reposed portraits and achieves strong qualitative and quantitative performance on real-world imagery.

Fonte: arXiv cs.CV

NLP/LLMs • Score 89

ProCache: Caching de Recursos Consciente de Restrições com Computação Seletiva para Aceleração de Transformers de Difusão

Os Transformers de Difusão (DiTs) alcançaram desempenho de ponta em modelagem generativa, mas seu alto custo computacional dificulta a implantação em tempo real. O ProCache é um framework de caching dinâmico de recursos que aborda limitações existentes, oferecendo uma solução de aceleração sem treinamento com um padrão de caching consciente de restrições e um módulo de computação seletiva.

Fonte: arXiv cs.CV

RL • Score 96

Adaptive Graph Pruning with Sudden-Events Evaluation for Traffic Prediction using Online Semi-Decentralized ST-GNNs

arXiv:2512.17352v1 Announce Type: new Abstract: Spatio-Temporal Graph Neural Networks (ST-GNNs) are well-suited for processing high-frequency data streams from geographically distributed sensors in smart mobility systems. However, their deployment at the edge across distributed compute nodes (cloudlets) createssubstantial communication overhead due to repeated transmission of overlapping node features between neighbouring cloudlets. To address this, we propose an adaptive pruning algorithm that dynamically filters redundant neighbour features while preserving the most informative spatial context for prediction. The algorithm adjusts pruning rates based on recent model performance, allowing each cloudlet to focus on regions experiencing traffic changes without compromising accuracy. Additionally, we introduce the Sudden Event Prediction Accuracy (SEPA), a novel event-focused metric designed to measure responsiveness to traffic slowdowns and recoveries, which are often missed by standard error metrics. We evaluate our approach in an online semi-decentralized setting with traditional FL, server-free FL, and Gossip Learning on two large-scale traffic datasets, PeMS-BAY and PeMSD7-M, across short-, mid-, and long-term prediction horizons. Experiments show that, in contrast to standard metrics, SEPA exposes the true value of spatial connectivity in predicting dynamic and irregular traffic. Our adaptive pruning algorithm maintains prediction accuracy while significantly lowering communication cost in all online semi-decentralized settings, demonstrating that communication can be reduced without compromising responsiveness to critical traffic events.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images

arXiv:2512.17306v1 Announce Type: new Abstract: Recent advances in large Vision-Language Models (VLMs) have exhibited strong reasoning capabilities on complex visual tasks by thinking with images in their Chain-of-Thought (CoT), which is achieved by actively invoking tools to analyze visual inputs rather than merely perceiving them. However, existing models often struggle to reflect on and correct themselves when attempting incorrect reasoning trajectories. To address this limitation, we propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT. Our pipeline comprises three stages: data construction, cold-start SFT and RL. Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs, where solving each task requires multi-turn tool calls to reach the correct answer. In the SFT stage, we collect tool trajectories as cold-start data, guiding a multi-turn reasoning pattern. In the RL stage, we introduce redundancy-penalized policy optimization, which incentivizes the model to develop a self-reflective reasoning pattern. The basic idea is to impose judgment on reasoning trajectories and penalize those that produce incorrect answers without sufficient multi-scale exploration. Extensive experiments demonstrate that DRIM achieves superior performance on visual understanding benchmarks.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Auxiliary Descriptive Knowledge for Few-Shot Adaptation of Vision-Language Model

arXiv:2512.17313v1 Announce Type: new Abstract: Despite the impressive zero-shot capabilities of Vision-Language Models (VLMs), they often struggle in downstream tasks with distribution shifts from the pre-training data. Few-Shot Adaptation (FSA-VLM) has emerged as a key solution, typically using Parameter-Efficient Fine-Tuning (PEFT) to adapt models with minimal data. However, these PEFT methods are constrained by their reliance on fixed, handcrafted prompts, which are often insufficient to understand the semantics of classes. While some studies have proposed leveraging image-induced prompts to provide additional clues for classification, they introduce prohibitive computational overhead at inference. Therefore, we introduce Auxiliary Descriptive Knowledge (ADK), a novel framework that efficiently enriches text representations without compromising efficiency. ADK first leverages a Large Language Model to generate a rich set of descriptive prompts for each class offline. These pre-computed features are then deployed in two ways: (1) as Compositional Knowledge, an averaged representation that provides rich semantics, especially beneficial when class names are ambiguous or unfamiliar to the VLM; and (2) as Instance-Specific Knowledge, where a lightweight, non-parametric attention mechanism dynamically selects the most relevant descriptions for a given image. This approach provides two additional types of knowledge alongside the handcrafted prompt, thereby facilitating category distinction across various domains. Also, ADK acts as a parameter-free, plug-and-play component that enhances existing PEFT methods. Extensive experiments demonstrate that ADK consistently boosts the performance of multiple PEFT baselines, setting a new state-of-the-art across various scenarios.

Fonte: arXiv cs.CV

Theory/Optimization • Score 89

DGH: Dynamic Gaussian Hair

arXiv:2512.17094v1 Announce Type: new Abstract: The creation of photorealistic dynamic hair remains a major challenge in digital human modeling because of the complex motions, occlusions, and light scattering. Existing methods often resort to static capture and physics-based models that do not scale as they require manual parameter fine-tuning to handle the diversity of hairstyles and motions, and heavy computation to obtain high-quality appearance. In this paper, we present Dynamic Gaussian Hair (DGH), a novel framework that efficiently learns hair dynamics and appearance. We propose: (1) a coarse-to-fine model that learns temporally coherent hair motion dynamics across diverse hairstyles; (2) a strand-guided optimization module that learns a dynamic 3D Gaussian representation for hair appearance with support for differentiable rendering, enabling gradient-based learning of view-consistent appearance under motion. Unlike prior simulation-based pipelines, our approach is fully data-driven, scales with training data, and generalizes across various hairstyles and head motion sequences. Additionally, DGH can be seamlessly integrated into a 3D Gaussian avatar framework, enabling realistic, animatable hair for high-fidelity avatar representation. DGH achieves promising geometry and appearance results, providing a scalable, data-driven alternative to physics-based simulation and rendering.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Rotterdam artery-vein segmentation (RAV) dataset

arXiv:2512.17322v1 Announce Type: new Abstract: Purpose: To provide a diverse, high-quality dataset of color fundus images (CFIs) with detailed artery-vein (A/V) segmentation annotations, supporting the development and evaluation of machine learning algorithms for vascular analysis in ophthalmology. Methods: CFIs were sampled from the longitudinal Rotterdam Study (RS), encompassing a wide range of ages, devices, and capture conditions. Images were annotated using a custom interface that allowed graders to label arteries, veins, and unknown vessels on separate layers, starting from an initial vessel segmentation mask. Connectivity was explicitly verified and corrected using connected component visualization tools. Results: The dataset includes 1024x1024-pixel PNG images in three modalities: original RGB fundus images, contrast-enhanced versions, and RGB-encoded A/V masks. Image quality varied widely, including challenging samples typically excluded by automated quality assessment systems, but judged to contain valuable vascular information. Conclusion: This dataset offers a rich and heterogeneous source of CFIs with high-quality segmentations. It supports robust benchmarking and training of machine learning models under real-world variability in image quality and acquisition settings. Translational Relevance: By including connectivity-validated A/V masks and diverse image conditions, this dataset enables the development of clinically applicable, generalizable machine learning tools for retinal vascular analysis, potentially improving automated screening and diagnosis of systemic and ocular diseases.

Fonte: arXiv cs.CV

RL • Score 95

Método de Direção Alternada de Multiplicadores para Decomposições de Matrizes Não Lineares

Apresentamos um algoritmo baseado no método de direção alternada de multiplicadores (ADMM) para resolver decomposições de matrizes não lineares (NMD). Dada uma matriz de entrada $X \in \mathbb{R}^{m \times n}$ e um rank de fatoração $r \ll \min(m, n)$, NMD busca matrizes $W \in \mathbb{R}^{m \times r}$ e $H \in \mathbb{R}^{r \times n}$ de forma que $X \approx f(WH)$, onde $f$ é uma função não linear elemento a elemento.

Fonte: arXiv stat.ML

Vision • Score 92

MatLat: Material Latent Space for PBR Texture Generation

arXiv:2512.17302v1 Announce Type: new Abstract: We propose a generative framework for producing high-quality PBR textures on a given 3D mesh. As large-scale PBR texture datasets are scarce, our approach focuses on effectively leveraging the embedding space and diffusion priors of pretrained latent image generative models while learning a material latent space, MatLat, through targeted fine-tuning. Unlike prior methods that freeze the embedding network and thus lead to distribution shifts when encoding additional PBR channels and hinder subsequent diffusion training, we fine-tune the pretrained VAE so that new material channels can be incorporated with minimal latent distribution deviation. We further show that correspondence-aware attention alone is insufficient for cross-view consistency unless the latent-to-image mapping preserves locality. To enforce this locality, we introduce a regularization in the VAE fine-tuning that crops latent patches, decodes them, and aligns the corresponding image regions to maintain strong pixel-latent spatial correspondence. Ablation studies and comparison with previous baselines demonstrate that our framework improves PBR texture fidelity and that each component is critical for achieving state-of-the-art performance.

Fonte: arXiv cs.CV

Theory/Optimization • Score 90

Explanation Beyond Intuition: A Testable Criterion for Inherent Explainability

arXiv:2512.17316v1 Announce Type: new Abstract: Inherent explainability is the gold standard in Explainable Artificial Intelligence (XAI). However, there is not a consistent definition or test to demonstrate inherent explainability. Work to date either characterises explainability through metrics, or appeals to intuition - "we know it when we see it". We propose a globally applicable criterion for inherent explainability. The criterion uses graph theory for representing and decomposing models for structure-local explanation, and recomposing them into global explanations. We form the structure-local explanations as annotations, a verifiable hypothesis-evidence structure that allows for a range of explanatory methods to be used. This criterion matches existing intuitions on inherent explainability, and provides justifications why a large regression model may not be explainable but a sparse neural network could be. We differentiate explainable -- a model that allows for explanation -- and \textit{explained} -- one that has a verified explanation. Finally, we provide a full explanation of PREDICT -- a Cox proportional hazards model of cardiovascular disease risk, which is in active clinical use in New Zealand. It follows that PREDICT is inherently explainable. This work provides structure to formalise other work on explainability, and allows regulators a flexible but rigorous test that can be used in compliance frameworks.

Fonte: arXiv cs.LG

Applications • Score 90

M2RU: Memristive Minion Recurrent Unit for Continual Learning at the Edge

arXiv:2512.17299v1 Announce Type: new Abstract: Continual learning on edge platforms remains challenging because recurrent networks depend on energy-intensive training procedures and frequent data movement that are impractical for embedded deployments. This work introduces M2RU, a mixed-signal architecture that implements the minion recurrent unit for efficient temporal processing with on-chip continual learning. The architecture integrates weighted-bit streaming, which enables multi-bit digital inputs to be processed in crossbars without high-resolution conversion, and an experience replay mechanism that stabilizes learning under domain shifts. M2RU achieves 15 GOPS at 48.62 mW, corresponding to 312 GOPS per watt, and maintains accuracy within 5 percent of software baselines on sequential MNIST and CIFAR-10 tasks. Compared with a CMOS digital design, the accelerator provides 29X improvement in energy efficiency. Device-aware analysis shows an expected operational lifetime of 12.2 years under continual learning workloads. These results establish M2RU as a scalable and energy-efficient platform for real-time adaptation in edge-level temporal intelligence.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

CheXPO-v2: Preference Optimization for Chest X-ray VLMs with Knowledge Graph Consistency

arXiv:2512.17213v1 Announce Type: new Abstract: Medical Vision-Language Models (VLMs) are prone to hallucinations, compromising clinical reliability. While reinforcement learning methods like Group Relative Policy Optimization (GRPO) offer a low-cost alignment solution, their reliance on sparse, outcome-based rewards inadvertently encourages models to "overthink" -- generating verbose, convoluted, and unverifiable Chain-of-Thought reasoning to justify answers. This focus on outcomes obscures factual errors and poses significant safety risks. To address this, we propose CheXPO-v2, a novel alignment framework that shifts from outcome to process supervision. Our core innovation is a Knowledge Graph Consistency Reward mechanism driven by Entity-Relation Matching. By explicitly parsing reasoning steps into structured "Disease, Relation, Anatomy" triplets, we provide fine-grained supervision that penalizes incoherent logic and hallucinations at the atomic level. Integrating this with a hard-example mining strategy, our approach significantly outperforms GRPO and state-of-the-art models on benchmarks like MIMIC-CXR-VQA. Crucially, CheXPO-v2 achieves new state-of-the-art accuracy using only 5k samples, demonstrating exceptional data efficiency while producing clinically sound and verifiable reasoning. The project source code is publicly available at: https://github.com/ecoxial2007/CheX-Phi4MM.

Fonte: arXiv cs.CV

RL • Score 96

Alzheimer's Disease Brain Network Mining

arXiv:2512.17276v1 Announce Type: new Abstract: Machine learning approaches for Alzheimer's disease (AD) diagnosis face a fundamental challenges. Clinical assessments are expensive and invasive, leaving ground truth labels available for only a fraction of neuroimaging datasets. We introduce Multi view Adaptive Transport Clustering for Heterogeneous Alzheimer's Disease (MATCH-AD), a semi supervised framework that integrates deep representation learning, graph-based label propagation, and optimal transport theory to address this limitation. The framework leverages manifold structure in neuroimaging data to propagate diagnostic information from limited labeled samples to larger unlabeled populations, while using Wasserstein distances to quantify disease progression between cognitive states. Evaluated on nearly five thousand subjects from the National Alzheimer's Coordinating Center, encompassing structural MRI measurements from hundreds of brain regions, cerebrospinal fluid biomarkers, and clinical variables MATCHAD achieves near-perfect diagnostic accuracy despite ground truth labels for less than one-third of subjects. The framework substantially outperforms all baseline methods, achieving kappa indicating almost perfect agreement compared to weak agreement for the best baseline, a qualitative transformation in diagnostic reliability. Performance remains clinically useful even under severe label scarcity, and we provide theoretical convergence guarantees with proven bounds on label propagation error and transport stability. These results demonstrate that principled semi-supervised learning can unlock the diagnostic potential of the vast repositories of partially annotated neuroimaging data accumulating worldwide, substantially reducing annotation burden while maintaining accuracy suitable for clinical deployment.

Fonte: arXiv cs.LG

Vision • Score 95

WDFFU-Mamba: A Wavelet-guided Dual-attention Feature Fusion Mamba for Breast Tumor Segmentation in Ultrasound Images

arXiv:2512.17278v1 Announce Type: new Abstract: Breast ultrasound (BUS) image segmentation plays a vital role in assisting clinical diagnosis and early tumor screening. However, challenges such as speckle noise, imaging artifacts, irregular lesion morphology, and blurred boundaries severely hinder accurate segmentation. To address these challenges, this work aims to design a robust and efficient model capable of automatically segmenting breast tumors in BUS images.We propose a novel segmentation network named WDFFU-Mamba, which integrates wavelet-guided enhancement and dual-attention feature fusion within a U-shaped Mamba architecture. A Wavelet-denoised High-Frequency-guided Feature (WHF) module is employed to enhance low-level representations through noise-suppressed high-frequency cues. A Dual Attention Feature Fusion (DAFF) module is also introduced to effectively merge skip-connected and semantic features, improving contextual consistency.Extensive experiments on two public BUS datasets demonstrate that WDFFU-Mamba achieves superior segmentation accuracy, significantly outperforming existing methods in terms of Dice coefficient and 95th percentile Hausdorff Distance (HD95).The combination of wavelet-domain enhancement and attention-based fusion greatly improves both the accuracy and robustness of BUS image segmentation, while maintaining computational efficiency.The proposed WDFFU-Mamba model not only delivers strong segmentation performance but also exhibits desirable generalization ability across datasets, making it a promising solution for real-world clinical applications in breast tumor ultrasound analysis.

Fonte: arXiv cs.CV

Theory/Optimization • Score 88

Spectral Concentration at the Edge of Stability: Information Geometry of Kernel Associative Memory

arXiv:2511.23083v4 Announce Type: replace-cross Abstract: High-capacity kernel Hopfield networks exhibit a \textit{Ridge of Optimization} characterized by extreme stability. While previously linked to \textit{Spectral Concentration}, its origin remains elusive. Here, we analyze the network dynamics on a statistical manifold, revealing that the Ridge corresponds to the Edge of Stability, a critical boundary where the Fisher Information Matrix becomes singular. We demonstrate that the apparent Euclidean force antagonism is a manifestation of \textit{Dual Equilibrium} in the Riemannian space. This unifies learning dynamics and capacity via the Minimum Description Length principle, offering a geometric theory of self-organized criticality.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning

arXiv:2512.17312v1 Announce Type: new Abstract: Recent releases such as o3 highlight human-like "thinking with images" reasoning that combines structured tool use with stepwise verification, yet most open-source approaches still rely on text-only chains, rigid visual schemas, or single-step pipelines, limiting flexibility, interpretability, and transferability on complex tasks. We introduce CodeDance, which explores executable code as a general solver for visual reasoning. Unlike fixed-schema calls (e.g., only predicting bounding-box coordinates), CodeDance defines, composes, and executes code to orchestrate multiple tools, compute intermediate results, and render visual artifacts (e.g., boxes, lines, plots) that support transparent, self-checkable reasoning. To guide this process, we introduce a reward for balanced and adaptive tool-call, which balances exploration with efficiency and mitigates tool overuse. Interestingly, beyond the expected capabilities taught by atomic supervision, we empirically observe novel emergent behaviors during RL training: CodeDance demonstrates novel tool invocations, unseen compositions, and cross-task transfer. These behaviors arise without task-specific fine-tuning, suggesting a general and scalable mechanism of executable visual reasoning. Extensive experiments across reasoning benchmarks (e.g., visual search, math, chart QA) show that CodeDance not only consistently outperforms schema-driven and text-only baselines, but also surpasses advanced closed models such as GPT-4o and larger open-source models.

Fonte: arXiv cs.CV

RL • Score 93

A Theoretical Analysis of State Similarity Between Markov Decision Processes

arXiv:2512.17265v1 Announce Type: new Abstract: The bisimulation metric (BSM) is a powerful tool for analyzing state similarities within a Markov decision process (MDP), revealing that states closer in BSM have more similar optimal value functions. While BSM has been successfully utilized in reinforcement learning (RL) for tasks like state representation learning and policy exploration, its application to state similarity between multiple MDPs remains challenging. Prior work has attempted to extend BSM to pairs of MDPs, but a lack of well-established mathematical properties has limited further theoretical analysis between MDPs. In this work, we formally establish a generalized bisimulation metric (GBSM) for measuring state similarity between arbitrary pairs of MDPs, which is rigorously proven with three fundamental metric properties, i.e., GBSM symmetry, inter-MDP triangle inequality, and a distance bound on identical spaces. Leveraging these properties, we theoretically analyze policy transfer, state aggregation, and sampling-based estimation across MDPs, obtaining explicit bounds that are strictly tighter than existing ones derived from the standard BSM. Additionally, GBSM provides a closed-form sample complexity for estimation, improving upon existing asymptotic results based on BSM. Numerical results validate our theoretical findings and demonstrate the effectiveness of GBSM in multi-MDP scenarios.

Fonte: arXiv cs.LG

MLOps/Systems • Score 93

SDUM: Um Modelo Profundo Desdobrado Escalável para Reconstrução Universal de MRI

O SDUM é um framework universal que combina um reconstrutor baseado em Restormer, um estimador de mapa de sensibilidade de bobina aprendido (CSME) e consistência de dados ponderada ciente de amostragem (SWDC). Ele demonstra um comportamento de escalabilidade semelhante a modelos fundamentais, alcançando resultados de ponta em desafios de reconstrução de MRI sem ajuste fino específico de tarefa.

Fonte: arXiv cs.AI

Vision • Score 95

Nem sempre é mais verde do outro lado: Percepção de áreas verdes entre diferentes demografias e personalidades em várias cidades

Quantificar e avaliar a vegetação urbana é crucial para o planejamento e desenvolvimento, refletindo a importância contínua dos espaços verdes para múltiplas dimensões climáticas e de bem-estar nas cidades. Este trabalho mede as diferenças entre percepções subjetivas e medidas objetivas de áreas verdes, utilizando dados de uma pesquisa abrangente com 1.000 pessoas em cinco países.

Fonte: arXiv cs.CV

RL • Score 96

Electric Vehicle Charging Load Forecasting: An Experimental Comparison of Machine Learning Methods

arXiv:2512.17257v1 Announce Type: new Abstract: With the growing popularity of electric vehicles as a means of addressing climate change, concerns have emerged regarding their impact on electric grid management. As a result, predicting EV charging demand has become a timely and important research problem. While substantial research has addressed energy load forecasting in transportation, relatively few studies systematically compare multiple forecasting methods across different temporal horizons and spatial aggregation levels in diverse urban settings. This work investigates the effectiveness of five time series forecasting models, ranging from traditional statistical approaches to machine learning and deep learning methods. Forecasting performance is evaluated for short-, mid-, and long-term horizons (on the order of minutes, hours, and days, respectively), and across spatial scales ranging from individual charging stations to regional and city-level aggregations. The analysis is conducted on four publicly available real-world datasets, with results reported independently for each dataset. To the best of our knowledge, this is the first work to systematically evaluate EV charging demand forecasting across such a wide range of temporal horizons and spatial aggregation levels using multiple real-world datasets.

Fonte: arXiv cs.LG

MLOps/Systems • Score 96

MINPO: Memory-Informed Neural Pseudo-Operator to Resolve Nonlocal Spatiotemporal Dynamics

arXiv:2512.17273v1 Announce Type: new Abstract: Many physical systems exhibit nonlocal spatiotemporal behaviors described by integro-differential equations (IDEs). Classical methods for solving IDEs require repeatedly evaluating convolution integrals, whose cost increases quickly with kernel complexity and dimensionality. Existing neural solvers can accelerate selected instances of these computations, yet they do not generalize across diverse nonlocal structures. In this work, we introduce the Memory-Informed Neural Pseudo-Operator (MINPO), a unified framework for modeling nonlocal dynamics arising from long-range spatial interactions and/or long-term temporal memory. MINPO, employing either Kolmogorov-Arnold Networks (KANs) or multilayer perceptron networks (MLPs) as encoders, learns the nonlocal operator and its inverse directly through neural representations, and then explicitly reconstruct the unknown solution fields. The learning is guarded by a lightweight nonlocal consistency loss term to enforce coherence between the learned operator and reconstructed solution. The MINPO formulation allows to naturally capture and efficiently resolve nonlocal spatiotemporal dependencies governed by a wide spectrum of IDEs and their subsets, including fractional PDEs. We evaluate the efficacy of MINPO in comparison with classical techniques and state-of-the-art neural-based strategies based on MLPs, such as A-PINN and fPINN, along with their newly-developed KAN variants, A-PIKAN and fPIKAN, designed to facilitate a fair comparison. Our study offers compelling evidence of the accuracy of MINPO and demonstrates its robustness in handling (i) diverse kernel types, (ii) different kernel dimensionalities, and (iii) the substantial computational demands arising from repeated evaluations of kernel integrals. MINPO, thus, generalizes beyond problem-specific formulations, providing a unified framework for systems governed by nonlocal operators.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

DAVE: A VLM Vision Encoder for Document Understanding and Web Agents

arXiv:2512.17221v1 Announce Type: new Abstract: While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial information essential for document understanding and web agents. To bridge this gap, we introduce DAVE, a vision encoder purpose-built for VLMs and tailored for these tasks. Our training pipeline is designed to leverage abundant unlabeled data to bypass the need for costly large-scale annotations for document and web images. We begin with a self-supervised pretraining stage on unlabeled images, followed by a supervised autoregressive pretraining stage, where the model learns tasks like parsing and localization from limited, high-quality data. Within the supervised stage, we adopt two strategies to improve our encoder's alignment with both general visual knowledge and diverse document and web agentic tasks: (i) We introduce a novel model-merging scheme, combining encoders trained with different text decoders to ensure broad compatibility with different web agentic architectures. (ii) We use ensemble training to fuse features from pretrained generalist encoders (e.g., SigLIP2) with our own document and web-specific representations. Extensive experiments on classic document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness of our approach, establishing DAVE as a strong vision encoder for document and web applications.

Fonte: arXiv cs.CV

RL • Score 93

Valor Sob Ignorância na Inteligência Artificial Universal

Neste trabalho, generalizamos o agente de aprendizado por reforço AIXI para admitir uma classe mais ampla de funções de utilidade. Atribuir uma utilidade a cada possível histórico de interação nos leva a confrontar a ambiguidade de que algumas hipóteses na distribuição de crenças do agente preveem apenas um prefixo finito da história, o que é interpretado como uma chance de morte igual a uma quantidade chamada perda semimeasure.

Fonte: arXiv cs.AI

RL • Score 96

Riscos de Segurança de Veículos Agentes: Uma Análise Sistemática de Ameaças Cognitivas e Intercamadas

A IA Agente está sendo cada vez mais explorada em veículos manuais e autônomos, levando à noção de Veículos Agentes (AgVs). Este artigo investiga ameaças de segurança em AgVs, incluindo riscos no estilo OWASP e ciberataques de outras camadas que afetam a camada agente. Um novo framework é proposto para analisar esses riscos em plataformas de veículos atuais e emergentes.

Fonte: arXiv cs.AI

RL • Score 93

Learning solution operator of dynamical systems with diffusion maps kernel ridge regression

arXiv:2512.17203v1 Announce Type: new Abstract: Many scientific and engineering systems exhibit complex nonlinear dynamics that are difficult to predict accurately over long time horizons. Although data-driven models have shown promise, their performance often deteriorates when the geometric structures governing long-term behavior are unknown or poorly represented. We demonstrate that a simple kernel ridge regression (KRR) framework, when combined with a dynamics-aware validation strategy, provides a strong baseline for long-term prediction of complex dynamical systems. By employing a data-driven kernel derived from diffusion maps, the proposed Diffusion Maps Kernel Ridge Regression (DM-KRR) method implicitly adapts to the intrinsic geometry of the system's invariant set, without requiring explicit manifold reconstruction or attractor modeling, procedures that often limit predictive performance. Across a broad range of systems, including smooth manifolds, chaotic attractors, and high-dimensional spatiotemporal flows, DM-KRR consistently outperforms state-of-the-art random feature, neural-network and operator-learning methods in both accuracy and data efficiency. These findings underscore that long-term predictive skill depends not only on model expressiveness, but critically on respecting the geometric constraints encoded in the data through dynamically consistent model selection. Together, simplicity, geometry awareness, and strong empirical performance point to a promising path for reliable and efficient learning of complex dynamical systems.

Fonte: arXiv cs.LG

Applications • Score 90

Pesquisa sobre o Algoritmo de Navegação por Dead Reckoning para Robôs de Pipeline Autopropelidos em Tubulações Complexas Tridimensionais

No campo da localização de tubulações de gás, os métodos existentes dependem principalmente de instrumentos de localização. Este trabalho apresenta um robô de pipeline autopropelido que realiza a localização de tubulações complexas sem arrasto externo, utilizando um método que integra navegação inercial e odômetros de roda, melhorando a precisão através do algoritmo de filtro de Kalman estendido (EKF).

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

The Effect of Negation on CLIP in Medical Imaging: Limitations of Contrastive Language-Image Pretraining

arXiv:2512.17121v1 Announce Type: new Abstract: Large vision-language models like CLIP are increasingly used in medical imaging tasks due to their ability to align images and text without the need for extensive labeled data. This makes them particularly useful for applications like image retrieval, report generation, and classification in clinical settings. A potential issue to this approach is that CLIP-based models often under perform when interpreting negated phrases, which is especially problematic in the context of medical diagnosing. In this study, we evaluate the Stanford AIMI CheXagent model on its ability to correctly retrieve chest X-ray images using prompts with and without negation. The goal of this project is to understand where this model fails and then use it as a base model to improve its retrieval accuracy by fine tuning methods outlined in previous work. Results from this study show improvement in handling of negation in the CLIP model with a slight decrease in accuracy of positive prompt evaluation. Alongside retrieval accuracy, we examined internal model behavior through token attribution, t-SNE projection, and attention-head ablation to better characterize how each fine tuning approach reshaped the text encoders representation of negated clinical language. Through this work, we hope to better understand the internal behavior of CLIP and improve its handling of negation using clinically relevant language for improving its reliability in medical AI devices.

Fonte: arXiv cs.LG

RL • Score 92

Text-Conditioned Background Generation for Editable Multi-Layer Documents

arXiv:2512.17151v1 Announce Type: new Abstract: We present a framework for document-centric background generation with multi-page editing and thematic continuity. To ensure text regions remain readable, we employ a \emph{latent masking} formulation that softly attenuates updates in the diffusion space, inspired by smooth barrier functions in physics and numerical optimization. In addition, we introduce \emph{Automated Readability Optimization (ARO)}, which automatically places semi-transparent, rounded backing shapes behind text regions. ARO determines the minimal opacity needed to satisfy perceptual contrast standards (WCAG 2.2) relative to the underlying background, ensuring readability while maintaining aesthetic harmony without human intervention. Multi-page consistency is maintained through a summarization-and-instruction process, where each page is distilled into a compact representation that recursively guides subsequent generations. This design reflects how humans build continuity by retaining prior context, ensuring that visual motifs evolve coherently across an entire document. Our method further treats a document as a structured composition in which text, figures, and backgrounds are preserved or regenerated as separate layers, allowing targeted background editing without compromising readability. Finally, user-provided prompts allow stylistic adjustments in color and texture, balancing automated consistency with flexible customization. Our training-free framework produces visually coherent, text-preserving, and thematically aligned documents, bridging generative modeling with natural design workflows.

Fonte: arXiv cs.CV

RL • Score 93

UniCoMTE: A Universal Counterfactual Framework for Explaining Time-Series Classifiers on ECG Data

arXiv:2512.17100v1 Announce Type: cross Abstract: Machine learning models, particularly deep neural networks, have demonstrated strong performance in classifying complex time series data. However, their black-box nature limits trust and adoption, especially in high-stakes domains such as healthcare. To address this challenge, we introduce UniCoMTE, a model-agnostic framework for generating counterfactual explanations for multivariate time series classifiers. The framework identifies temporal features that most heavily influence a model's prediction by modifying the input sample and assessing its impact on the model's prediction. UniCoMTE is compatible with a wide range of model architectures and operates directly on raw time series inputs. In this study, we evaluate UniCoMTE's explanations on a time series ECG classifier. We quantify explanation quality by comparing our explanations' comprehensibility to comprehensibility of established techniques (LIME and SHAP) and assessing their generalizability to similar samples. Furthermore, clinical utility is assessed through a questionnaire completed by medical experts who review counterfactual explanations presented alongside original ECG samples. Results show that our approach produces concise, stable, and human-aligned explanations that outperform existing methods in both clarity and applicability. By linking model predictions to meaningful signal patterns, the framework advances the interpretability of deep learning models for real-world time series applications.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

ResSVD: Residual Compensated SVD for Large Language Model Compression

arXiv:2505.20112v3 Announce Type: replace Abstract: Large language models (LLMs) have demonstrated impressive capabilities in a wide range of downstream natural language processing tasks. Nevertheless, their considerable sizes and memory demands hinder practical deployment, underscoring the importance of developing efficient compression strategies. Singular value decomposition (SVD) decomposes a matrix into orthogonal components, enabling efficient low-rank approximation. This is particularly suitable for LLM compression, where weight matrices often exhibit significant redundancy. However, current SVD-based methods neglect the residual matrix from truncation, resulting in significant truncation loss. Additionally, compressing all layers of the model results in severe performance degradation. To overcome these limitations, we propose ResSVD, a new post-training SVD-based LLM compression method. Specifically, we leverage the residual matrix generated during the truncation process to reduce truncation loss. Moreover, under a fixed overall compression ratio, we selectively compress the last few layers of the model, which mitigates error propagation and significantly improves the performance of compressed models. Comprehensive evaluations of ResSVD on diverse LLM families and multiple benchmark datasets indicate that ResSVD consistently achieves superior performance over existing counterpart methods, demonstrating its practical effectiveness.

Fonte: arXiv cs.CL

Theory/Optimization • Score 90

Universal consistency of the $k$-NN rule in metric spaces and Nagata dimension. III

arXiv:2512.17058v1 Announce Type: new Abstract: We prove the last remaining implication allowing to claim the equivalence of the following conditions for a complete separable metric space $X$: (1) The $k$-nearest neighbour classifier is (weakly) universally consistent in $X$, (2) The strong Lebesgue--Besicovitch differentiation property holds in $X$ for every locally finite Borel measure, (3) $X$ is sigma-finite dimensional in the sense of Nagata. The equivalence (2)$\iff$(3) was announced by Preiss (1983), while a detailed proof of the implication (3)$\Rightarrow$(2) has appeared in Assouad and Quentin de Gromard (2006). The implication (2)$\Rightarrow$(1) was established by C\'erou and Guyader (2006). We prove the implication (1)$\Rightarrow$(3). The result was conjectured in the first article in the series (Collins, Kumari, Pestov 2020), and here we also correct a wrong claim made in the second article (Kumari and Pestov 2024).

Fonte: arXiv cs.LG

Vision • Score 95

Interpretable Similarity of Synthetic Image Utility

arXiv:2512.17080v1 Announce Type: new Abstract: Synthetic medical image data can unlock the potential of deep learning (DL)-based clinical decision support (CDS) systems through the creation of large scale, privacy-preserving, training sets. Despite the significant progress in this field, there is still a largely unanswered research question: "How can we quantitatively assess the similarity of a synthetically generated set of images with a set of real images in a given application domain?". Today, answers to this question are mainly provided via user evaluation studies, inception-based measures, and the classification performance achieved on synthetic images. This paper proposes a novel measure to assess the similarity between synthetically generated and real sets of images, in terms of their utility for the development of DL-based CDS systems. Inspired by generalized neural additive models, and unlike inception-based measures, the proposed measure is interpretable (Interpretable Utility Similarity, IUS), explaining why a synthetic dataset could be more useful than another one in the context of a CDS system based on clinically relevant image features. The experimental results on publicly available datasets from various color medical imaging modalities including endoscopic, dermoscopic and fundus imaging, indicate that selecting synthetic images of high utility similarity using IUS can result in relative improvements of up to 54.6% in terms of classification performance. The generality of IUS for synthetic data assessment is demonstrated also for greyscale X-ray and ultrasound imaging modalities. IUS implementation is available at https://github.com/innoisys/ius

Fonte: arXiv cs.CV

Vision • Score 95

Endo-SemiS: Towards Robust Semi-Supervised Image Segmentation for Endoscopic Video

arXiv:2512.16977v1 Announce Type: new Abstract: In this paper, we present Endo-SemiS, a semi-supervised segmentation framework for providing reliable segmentation of endoscopic video frames with limited annotation. EndoSemiS uses 4 strategies to improve performance by effectively utilizing all available data, particularly unlabeled data: (1) Cross-supervision between two individual networks that supervise each other; (2) Uncertainty-guided pseudo-labels from unlabeled data, which are generated by selecting high-confidence regions to improve their quality; (3) Joint pseudolabel supervision, which aggregates reliable pixels from the pseudo-labels of both networks to provide accurate supervision for unlabeled data; and (4) Mutual learning, where both networks learn from each other at the feature and image levels, reducing variance and guiding them toward a consistent solution. Additionally, a separate corrective network that utilizes spatiotemporal information from endoscopy video to improve segmentation performance. Endo-SemiS is evaluated on two clinical applications: kidney stone laser lithotomy from ureteroscopy and polyp screening from colonoscopy. Compared to state-of-the-art segmentation methods, Endo-SemiS substantially achieves superior results on both datasets with limited labeled data. The code is publicly available at https://github.com/MedICL-VU/Endo-SemiS

Fonte: arXiv cs.CV

Applications • Score 90

BumpNet: A Sparse Neural Network Framework for Learning PDE Solutions

arXiv:2512.17198v1 Announce Type: new Abstract: We introduce BumpNet, a sparse neural network framework for PDE numerical solution and operator learning. BumpNet is based on meshless basis function expansion, in a similar fashion to radial-basis function (RBF) networks. Unlike RBF networks, the basis functions in BumpNet are constructed from ordinary sigmoid activation functions. This enables the efficient use of modern training techniques optimized for such networks. All parameters of the basis functions, including shape, location, and amplitude, are fully trainable. Model parsimony and h-adaptivity are effectively achieved through dynamically pruning basis functions during training. BumpNet is a general framework that can be combined with existing neural architectures for learning PDE solutions: here, we propose Bump-PINNs (BumpNet with physics-informed neural networks) for solving general PDEs; Bump-EDNN (BumpNet with evolutionary deep neural networks) to solve time-evolution PDEs; and Bump-DeepONet (BumpNet with deep operator networks) for PDE operator learning. Bump-PINNs are trained using the same collocation-based approach used by PINNs, Bump-EDNN uses a BumpNet only in the spatial domain and uses EDNNs to advance the solution in time, while Bump-DeepONets employ a BumpNet regression network as the trunk network of a DeepONet. Extensive numerical experiments demonstrate the efficiency and accuracy of the proposed architecture.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Luzes, Câmera, Consistência: Um Pipeline Multietapas para Histórias em Vídeo com IA de Personagens Estáveis

Gerar histórias em vídeo longas e coesas com personagens consistentes é um desafio significativo para a IA atual de texto-para-vídeo. Apresentamos um método que aborda a geração de vídeo de maneira semelhante a um cineasta, utilizando um pipeline que envolve a criação de um roteiro detalhado e a geração de visuais consistentes para cada personagem.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Towards Safer Chatbots: Automated Policy Compliance Evaluation of Custom GPTs

arXiv:2502.01436v3 Announce Type: replace Abstract: User-configured chatbots built on top of large language models are increasingly available through centralized marketplaces such as OpenAI's GPT Store. While these platforms enforce usage policies intended to prevent harmful or inappropriate behavior, the scale and opacity of customized chatbots make systematic policy enforcement challenging. As a result, policy-violating chatbots continue to remain publicly accessible despite existing review processes. This paper presents a fully automated method for evaluating the compliance of Custom GPTs with its marketplace usage policy using black-box interaction. The method combines large-scale GPT discovery, policy-driven red-teaming prompts, and automated compliance assessment using an LLM-as-a-judge. We focus on three policy-relevant domains explicitly addressed in OpenAI's usage policies: Romantic, Cybersecurity, and Academic GPTs. We validate our compliance assessment component against a human-annotated ground-truth dataset, achieving an F1 score of 0.975 for binary policy violation detection. We then apply the method in a large-scale empirical study of 782 Custom GPTs retrieved from the GPT Store. The results show that 58.7% of the evaluated GPTs exhibit at least one policy-violating response, with substantial variation across policy domains. A comparison with the base models (GPT-4 and GPT-4o) indicates that most violations originate from model-level behavior, while customization tends to amplify these tendencies rather than create new failure modes. Our findings reveal limitations in current review mechanisms for user-configured chatbots and demonstrate the feasibility of scalable, behavior-based policy compliance evaluation.

Fonte: arXiv cs.CL

RL • Score 92

Computational analysis reveals historical trajectory of East-Polynesian lunar calendars

arXiv:2512.17525v1 Announce Type: cross Abstract: We investigate a type of lunar calendar known as lists of the 'nights of the moon', found throughout East Polynesia, including Rapa Nui (Easter Island). Using computational methods, we analyzed the lexical and structural divergence of 49 calendric lists from all major archipelagos, each containing about 30 night names. Our results, presented as a rooted phylogenetic tree, show a clear split into two main groups: one including lists from Rapa Nui, Mangareva, and the Marquesas; the other comprising lists from New Zealand, Hawaii, the Cook Islands, the Austral Islands, Tahiti, and the Tuamotu. This pattern aligns with a recent alternative classification of East Polynesian languages into 'Distal' (Marquesan, Mangarevan, Rapanui) and 'Proximal' (Maori, Hawaiian, Tahitian, etc.) subgroups. Since both language and lunar calendars are symbolic systems passed down and changed within communities - and given the geographic isolation of many archipelagos - we interpret this correspondence as evidence that the early divergence of East Polynesian lunar calendars mirrors early population movements and language splits in the region.

Fonte: arXiv cs.CL

MLOps/Systems • Score 93

Digitizing Nepal's Written Heritage: A Comprehensive HTR Pipeline for Old Nepali Manuscripts

arXiv:2512.17111v1 Announce Type: new Abstract: This paper presents the first end-to-end pipeline for Handwritten Text Recognition (HTR) for Old Nepali, a historically significant but low-resource language. We adopt a line-level transcription approach and systematically explore encoder-decoder architectures and data-centric techniques to improve recognition accuracy. Our best model achieves a Character Error Rate (CER) of 4.9\%. In addition, we implement and evaluate decoding strategies and analyze token-level confusions to better understand model behaviour and error patterns. While the dataset we used for evaluation is confidential, we release our training code, model configurations, and evaluation scripts to support further research in HTR for low-resource historical scripts.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Predictive Modeling of Maritime Radar Data Using Transformer Architecture

arXiv:2512.17098v1 Announce Type: new Abstract: Maritime autonomous systems require robust predictive capabilities to anticipate vessel motion and environmental dynamics. While transformer architectures have revolutionized AIS-based trajectory prediction and demonstrated feasibility for sonar frame forecasting, their application to maritime radar frame prediction remains unexplored, creating a critical gap given radar's all-weather reliability for navigation. This survey systematically reviews predictive modeling approaches relevant to maritime radar, with emphasis on transformer architectures for spatiotemporal sequence forecasting, where existing representative methods are analyzed according to data type, architecture, and prediction horizon. Our review shows that, while the literature has demonstrated transformer-based frame prediction for sonar sensing, no prior work addresses transformer-based maritime radar frame prediction, thereby defining a clear research gap and motivating a concrete research direction for future work in this area.

Fonte: arXiv cs.CV

MLOps/Systems • Score 95

Mapping the Podcast Ecosystem with the Structured Podcast Research Corpus

arXiv:2411.07892v2 Announce Type: replace Abstract: Podcasts provide highly diverse content to a massive listener base through a unique on-demand modality. However, limited data has prevented large-scale computational analysis of the podcast ecosystem. To fill this gap, we introduce a massive dataset of over 1.1M podcast transcripts that is largely comprehensive of all English language podcasts available through public RSS feeds from May and June of 2020. This data is not limited to text, but rather includes audio features and speaker turns for a subset of 370K episodes, and speaker role inferences and other metadata for all 1.1M episodes. Using this data, we also conduct a foundational investigation into the content, structure, and responsiveness of this ecosystem. Together, our data and analyses open the door to continued computational research of this popular and impactful medium.

Fonte: arXiv cs.CL

RL • Score 96

Learning to Plan, Planning to Learn: Adaptive Hierarchical RL-MPC for Sample-Efficient Decision Making

arXiv:2512.17091v1 Announce Type: cross Abstract: We propose a new approach for solving planning problems with a hierarchical structure, fusing reinforcement learning and MPC planning. Our formulation tightly and elegantly couples the two planning paradigms. It leverages reinforcement learning actions to inform the MPPI sampler, and adaptively aggregates MPPI samples to inform the value estimation. The resulting adaptive process leverages further MPPI exploration where value estimates are uncertain, and improves training robustness and the overall resulting policies. This results in a robust planning approach that can handle complex planning problems and easily adapts to different applications, as demonstrated over several domains, including race driving, modified Acrobot, and Lunar Lander with added obstacles. Our results in these domains show better data efficiency and overall performance in terms of both rewards and task success, with up to a 72% increase in success rate compared to existing approaches, as well as accelerated convergence (x2.1) compared to non-adaptive sampling.

Fonte: arXiv cs.AI

Vision • Score 95

PhysFire-WM: A Physics-Informed World Model for Emulating Fire Spread Dynamics

arXiv:2512.17152v1 Announce Type: new Abstract: Fine-grained fire prediction plays a crucial role in emergency response. Infrared images and fire masks provide complementary thermal and boundary information, yet current methods are predominantly limited to binary mask modeling with inherent signal sparsity, failing to capture the complex dynamics of fire. While world models show promise in video generation, their physical inconsistencies pose significant challenges for fire forecasting. This paper introduces PhysFire-WM, a Physics-informed World Model for emulating Fire spread dynamics. Our approach internalizes combustion dynamics by encoding structured priors from a Physical Simulator to rectify physical discrepancies, coupled with a Cross-task Collaborative Training strategy (CC-Train) that alleviates the issue of limited information in mask-based modeling. Through parameter sharing and gradient coordination, CC-Train effectively integrates thermal radiation dynamics and spatial boundary delineation, enhancing both physical realism and geometric accuracy. Extensive experiments on a fine-grained multimodal fire dataset demonstrate the superior accuracy of PhysFire-WM in fire spread prediction. Validation underscores the importance of physical priors and cross-task collaboration, providing new insights for applying physics-informed world models to disaster prediction.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

Dynamic Tool Dependency Retrieval for Efficient Function Calling

arXiv:2512.17052v1 Announce Type: new Abstract: Function calling agents powered by Large Language Models (LLMs) select external tools to automate complex tasks. On-device agents typically use a retrieval module to select relevant tools, improving performance and reducing context length. However, existing retrieval methods rely on static and limited inputs, failing to capture multi-step tool dependencies and evolving task context. This limitation often introduces irrelevant tools that mislead the agent, degrading efficiency and accuracy. We propose Dynamic Tool Dependency Retrieval (DTDR), a lightweight retrieval method that conditions on both the initial query and the evolving execution context. DTDR models tool dependencies from function calling demonstrations, enabling adaptive retrieval as plans unfold. We benchmark DTDR against state-of-the-art retrieval methods across multiple datasets and LLM backbones, evaluating retrieval precision, downstream task accuracy, and computational efficiency. Additionally, we explore strategies to integrate retrieved tools into prompts. Our results show that dynamic tool retrieval improves function calling success rates between $23\%$ and $104\%$ compared to state-of-the-art static retrievers.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

PILAR: Personalizando Interações de Realidade Aumentada com Explicações Humanas Centrais e Confiáveis Baseadas em LLM para Casos de Uso Diários

Sistemas de realidade aumentada (AR) impulsionados por inteligência artificial (AI) estão se integrando cada vez mais à vida cotidiana, aumentando a necessidade de explicabilidade nas interações em tempo real. O PILAR é um novo framework que utiliza um modelo de linguagem grande (LLM) pré-treinado para gerar explicações personalizadas e contextuais, melhorando a experiência do usuário em sistemas AR baseados em AI.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering

arXiv:2512.17396v1 Announce Type: cross Abstract: In this work, we introduce RadImageNet-VQA, a large-scale dataset designed to advance radiologic visual question answering (VQA) on CT and MRI exams. Existing medical VQA datasets are limited in scale, dominated by X-ray imaging or biomedical illustrations, and often prone to text-based shortcuts. RadImageNet-VQA is built from expert-curated annotations and provides 750K images paired with 7.5M question-answer samples. It covers three key tasks - abnormality detection, anatomy recognition, and pathology identification - spanning eight anatomical regions and 97 pathology categories, and supports open-ended, closed-ended, and multiple-choice questions. Extensive experiments show that state-of-the-art vision-language models still struggle with fine-grained pathology identification, particularly in open-ended settings and even after fine-tuning. Text-only analysis further reveals that model performance collapses to near-random without image inputs, confirming that RadImageNet-VQA is free from linguistic shortcuts. The full dataset and benchmark are publicly available at https://huggingface.co/datasets/raidium/RadImageNet-VQA.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

LLM-as-a-qualitative-judge: automating error analysis in natural language generation

arXiv:2506.09147v4 Announce Type: replace Abstract: Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that instance-specific issues output by LLM-as-a-qualitative-judge match those annotated by humans in 2/3 cases, and that LLM-as-a-qualitative-judge is capable of producing error type reports resembling the reports composed by human annotators. We also demonstrate in a case study how the use of LLM-as-a-qualitative-judge can substantially improve NLG systems performance. Our code and data are publicly available at https://github.com/tunde-ajayi/llm-as-a-qualitative-judge.

Fonte: arXiv cs.CL

MLOps/Systems • Score 95

Regularized Random Fourier Features and Finite Element Reconstruction for Operator Learning in Sobolev Space

arXiv:2512.17884v1 Announce Type: cross Abstract: Operator learning is a data-driven approximation of mappings between infinite-dimensional function spaces, such as the solution operators of partial differential equations. Kernel-based operator learning can offer accurate, theoretically justified approximations that require less training than standard methods. However, they can become computationally prohibitive for large training sets and can be sensitive to noise. We propose a regularized random Fourier feature (RRFF) approach, coupled with a finite element reconstruction map (RRFF-FEM), for learning operators from noisy data. The method uses random features drawn from multivariate Student's $t$ distributions, together with frequency-weighted Tikhonov regularization that suppresses high-frequency noise. We establish high-probability bounds on the extreme singular values of the associated random feature matrix and show that when the number of features $N$ scales like $m \log m$ with the number of training samples $m$, the system is well-conditioned, which yields estimation and generalization guarantees. Detailed numerical experiments on benchmark PDE problems, including advection, Burgers', Darcy flow, Helmholtz, Navier-Stokes, and structural mechanics, demonstrate that RRFF and RRFF-FEM are robust to noise and achieve improved performance with reduced training time compared to the unregularized random feature model, while maintaining competitive accuracy relative to kernel and neural operator tests.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents

arXiv:2503.10689v2 Announce Type: replace Abstract: Recent advances in large language models (LLMs) have led to a growing interest in developing LLM-based agents for automating web tasks. However, these agents often struggle with even simple tasks on real-world websites due to their limited capability to understand and process complex web page structures. In this work, we introduce LCoW, a framework for Learning language models to Contextualize complex Web pages into a more comprehensible form, thereby enhancing decision making by LLM agents. LCoW decouples web page understanding from decision making by training a separate contextualization module to transform complex web pages into comprehensible format, which are then utilized by the decision-making agent. We demonstrate that our contextualization module effectively integrates with LLM agents of various scales to significantly enhance their decision-making capabilities in web automation tasks. Notably, LCoW improves the success rates of closed-source LLMs (e.g., Gemini-1.5-flash, GPT-4o, Claude-3.5-Sonnet) by an average of 15.6%, and demonstrates a 23.7% average improvement in success rates for open-source LMs (e.g., Llama-3.1-8B, Llama-3.1-70B) on the WorkArena benchmark. Moreover, the Gemini-1.5-flash agent with LCoW achieves state-of-the-art results on the WebShop benchmark, outperforming human experts. The relevant code materials are available at our project page: https://lcowiclr2025.github.io.

Fonte: arXiv cs.CL

RL • Score 96

Distributed Learning in Markovian Restless Bandits over Interference Graphs for Stable Spectrum Sharing

arXiv:2512.17161v1 Announce Type: new Abstract: We study distributed learning for spectrum access and sharing among multiple cognitive communication entities, such as cells, subnetworks, or cognitive radio users (collectively referred to as cells), in communication-constrained wireless networks modeled by interference graphs. Our goal is to achieve a globally stable and interference-aware channel allocation. Stability is defined through a generalized Gale-Shapley multi-to-one matching, a well-established solution concept in wireless resource allocation. We consider wireless networks where L cells share S orthogonal channels and cannot simultaneously use the same channel as their neighbors. Each channel evolves as an unknown restless Markov process with cell-dependent rewards, making this the first work to establish global Gale-Shapley stability for channel allocation in a stochastic, temporally varying restless environment. To address this challenge, we develop SMILE (Stable Multi-matching with Interference-aware LEarning), a communication-efficient distributed learning algorithm that integrates restless bandit learning with graph-constrained coordination. SMILE enables cells to distributedly balance exploration of unknown channels with exploitation of learned information. We prove that SMILE converges to the optimal stable allocation and achieves logarithmic regret relative to a genie with full knowledge of expected utilities. Simulations validate the theoretical guarantees and demonstrate SMILE's robustness, scalability, and efficiency across diverse spectrum-sharing scenarios.

Fonte: arXiv cs.LG

MLOps/Systems • Score 89

Generative modeling of conditional probability distributions on the level-sets of collective variables

arXiv:2512.17374v1 Announce Type: new Abstract: Given a probability distribution $\mu$ in $\mathbb{R}^d$ represented by data, we study in this paper the generative modeling of its conditional probability distributions on the level-sets of a collective variable $\xi: \mathbb{R}^d \rightarrow \mathbb{R}^k$, where $1 \le k<d$. We propose a general and effcient learning approach that is able to learn generative models on different level-sets of $\xi$ simultaneously. To improve the learning quality on level-sets in low-probability regions, we also propose a strategy for data enrichment by utilizing data from enhanced sampling techniques. We demonstrate the effectiveness of our proposed learning approach through concrete numerical examples. The proposed approach is potentially useful for the generative modeling of molecular systems in biophysics, for instance.

Fonte: arXiv stat.ML

RL • Score 96

Quando o Raciocínio Encontra Suas Leis

arXiv:2512.17901v1 Tipo de Anúncio: novo. Este artigo apresenta as Leis do Raciocínio (LoRe), um framework unificado que caracteriza padrões intrínsecos de raciocínio em Large Reasoning Models (LRMs). Propomos a lei de computação e uma lei de precisão suplementar, introduzindo o LoRe-Bench para medir essas propriedades em modelos de raciocínio. Avaliações mostram que a maioria dos modelos apresenta monotonicidade razoável, mas carece de composicionalidade.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens

arXiv:2512.17375v1 Announce Type: new Abstract: Reward models and LLM-as-a-Judge systems are central to modern post-training pipelines such as RLHF, DPO, and RLAIF, where they provide scalar feedback and binary decisions that guide model selection and RL-based fine-tuning. We show that these judge systems exhibit a recurring vulnerability: short sequences of low-perplexity control tokens can flip many binary evaluations from correct ``No'' judgments to incorrect ``Yes'' judgments by steering the last-layer logit gap. These control tokens are patterns that a policy model could plausibly generate during post-training, and thus represent realistic reward-hacking risks rather than worst-case adversarial strings. Our method, AdvJudge-Zero, uses the model's next-token distribution and beam-search exploration to discover diverse control-token sequences from scratch, and our analysis shows that the induced hidden-state perturbations concentrate in a low-rank ``soft mode'' that is anti-aligned with the judge's refusal direction. Empirically, these tokens cause very high false positive rates when large open-weight and specialized judge models score incorrect answers on math and reasoning benchmarks. Finally, we show that LoRA-based adversarial training on small sets of control-token-augmented examples can markedly reduce these false positives while preserving evaluation quality.

Fonte: arXiv cs.LG

Vision • Score 95

Rede deformável ciente de distorção em múltiplos níveis para super-resolução de imagens omnidirecionais

Com o aumento da popularidade de aplicações de realidade aumentada e virtual, o processamento de imagens para Imagens Omnidirecionais (ODIs) tem atraído atenção crescente. A Super-Resolução de Imagens Omnidirecionais (ODISR) é uma técnica promissora para melhorar a qualidade visual das ODIs. Propomos uma nova Rede Deformável Ciente de Distorção em Múltiplos Níveis (MDDN) para ODISR, projetada para expandir o alcance de amostragem e o campo receptivo.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

PAACE: Um Framework de Engenharia de Contexto Automatizado e Consciente de Planejamento

Modelos de Linguagem de Grande Escala (LLM) estão sendo cada vez mais utilizados em fluxos de trabalho complexos que envolvem planejamento, uso de ferramentas, reflexão e interação com sistemas de conhecimento externos. Este trabalho apresenta o PAACE, um framework unificado para otimizar o estado evolutivo de agentes LLM através de modelagem de relevância de próximas tarefas, análise de estrutura de planejamento, co-refinamento de instruções e compressão que preserva funções.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Strategic Planning and Rationalizing on Trees Make LLMs Better Debaters

arXiv:2505.14886v2 Announce Type: replace Abstract: Winning competitive debates requires sophisticated reasoning and argument skills. There are unique challenges in the competitive debate: (1) The time constraints force debaters to make strategic choices about which points to pursue rather than covering all possible arguments; (2) The persuasiveness of the debate relies on the back-and-forth interaction between arguments, which a single final game status cannot evaluate. To address these challenges, we propose TreeDebater, a novel debate framework that excels in competitive debate. We introduce two tree structures: the Rehearsal Tree and Debate Flow Tree. The Rehearsal Tree anticipates the attack and defenses to evaluate the strength of the claim, while the Debate Flow Tree tracks the debate status to identify the active actions. TreeDebater allocates its time budget among candidate actions and uses the speech time controller and feedback from the simulated audience to revise its statement. The human evaluation on both the stage-level and the debate-level comparison shows that our TreeDebater outperforms the state-of-the-art multi-agent debate system, with a +15.6% improvement in stage-level persuasiveness with DeepSeek and +10% debate-level opinion shift win. Further investigation shows that TreeDebater shows better strategies in limiting time to important debate actions, aligning with the strategies of human debate experts.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Generating Completions for Broca's Aphasic Sentences Using Large Language Models

arXiv:2412.17669v2 Announce Type: replace Abstract: Broca's aphasia is a type of aphasia characterized by non-fluent, effortful and agrammatic speech production with relatively good comprehension. Since traditional aphasia treatment methods are often time-consuming, labour-intensive, and do not reflect real-world conversations, applying natural language processing based approaches such as Large Language Models (LLMs) could potentially contribute to improving existing treatment approaches. To address this issue, we explore the use of sequence-to-sequence LLMs for completing Broca's aphasic sentences. We first generate synthetic Broca's aphasic data using a rule-based system designed to mirror the linguistic characteristics of Broca's aphasic speech. Using this synthetic data (without authentic aphasic samples), we then fine-tune four pre-trained LLMs on the task of completing agrammatic sentences. We evaluate our fine-tuned models on both synthetic and authentic Broca's aphasic data. We demonstrate LLMs' capability for reconstructing agrammatic sentences, with the models showing improved performance with longer input utterances. Our result highlights the LLMs' potential in advancing communication aids for individuals with Broca's aphasia and possibly other clinical populations.

Fonte: arXiv cs.CL

Evaluation/Benchmarks • Score 96

Bridging Training and Merging Through Momentum-Aware Optimization

arXiv:2512.17109v1 Announce Type: new Abstract: Training large neural networks and merging task-specific models both exploit low-rank structure and require parameter importance estimation, yet these challenges have been pursued in isolation. Current workflows compute curvature information during training, discard it, then recompute similar information for merging -- wasting computation and discarding valuable trajectory data. We introduce a unified framework that maintains factorized momentum and curvature statistics during training, then reuses this information for geometry-aware model composition. The proposed method achieves memory efficiency comparable to state-of-the-art approaches while accumulating task saliency scores that enable curvature-aware merging without post-hoc Fisher computation. We establish convergence guarantees for non-convex objectives with approximation error bounded by gradient singular value decay. On natural language understanding benchmarks, curvature-aware parameter selection outperforms magnitude-only baselines across all sparsity levels, with multi-task merging improving over strong baselines. The proposed framework exhibits rank-invariant convergence and superior hyperparameter robustness compared to existing low-rank optimizers. By treating the optimization trajectory as a reusable asset rather than discarding it, our approach eliminates redundant computation while enabling more principled model composition.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Video Detective: Seek Critical Clues Recurrently to Answer Question from Long Videos

arXiv:2512.17229v1 Announce Type: new Abstract: Long Video Question-Answering (LVQA) presents a significant challenge for Multi-modal Large Language Models (MLLMs) due to immense context and overloaded information, which could also lead to prohibitive memory consumption. While existing methods attempt to address these issues by reducing visual tokens or extending model's context length, they may miss useful information or take considerable computation. In fact, when answering given questions, only a small amount of crucial information is required. Therefore, we propose an efficient question-aware memory mechanism, enabling MLLMs to recurrently seek these critical clues. Our approach, named VideoDetective, simplifies this task by iteratively processing video sub-segments. For each sub-segment, a question-aware compression strategy is employed by introducing a few special memory tokens to achieve purposefully compression. This allows models to effectively seek critical clues while reducing visual tokens. Then, due to history context could have a significant impact, we recurrently aggregate and store these memory tokens to update history context, which would be reused for subsequent sub-segments. Furthermore, to more effectively measure model's long video understanding ability, we introduce GLVC (Grounding Long Video Clues), a long video question-answering dataset, which features grounding critical and concrete clues scattered throughout entire videos. Experimental results demonstrate our method enables MLLMs with limited context length of 32K to efficiently process 100K tokens (3600 frames, an hour-long video sampled at 1fps), requiring only 2 minutes and 37GB GPU memory usage. Evaluation results across multiple long video benchmarks illustrate our method can more effectively seek critical clues from massive information.

Fonte: arXiv cs.CV

Applications • Score 90

How to Square Tensor Networks and Circuits Without Squaring Them

arXiv:2512.17090v1 Announce Type: cross Abstract: Squared tensor networks (TNs) and their extension as computational graphs--squared circuits--have been used as expressive distribution estimators, yet supporting closed-form marginalization. However, the squaring operation introduces additional complexity when computing the partition function or marginalizing variables, which hinders their applicability in ML. To solve this issue, canonical forms of TNs are parameterized via unitary matrices to simplify the computation of marginals. However, these canonical forms do not apply to circuits, as they can represent factorizations that do not directly map to a known TN. Inspired by the ideas of orthogonality in canonical forms and determinism in circuits enabling tractable maximization, we show how to parameterize squared circuits to overcome their marginalization overhead. Our parameterizations unlock efficient marginalization even in factorizations different from TNs, but encoded as circuits, whose structure would otherwise make marginalization computationally hard. Finally, our experiments on distribution estimation show how our proposed conditions in squared circuits come with no expressiveness loss, while enabling more efficient learning.

Fonte: arXiv cs.AI

NLP/LLMs • Score 92

SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories

arXiv:2512.17419v1 Announce Type: cross Abstract: Benchmarks like SWE-bench have standardized the evaluation of Large Language Models (LLMs) on repository-level software engineering tasks. However, these efforts remain limited by manual curation, static datasets, and a focus on Python-based bug fixes. We introduce SWE-Bench++, an automated framework that generates repository-level coding tasks from open-source GitHub projects. Unlike synthetic approaches, our pipeline harvests live pull requests to cover both bug fixes and feature requests across 11 languages. SWE-Bench++ turns GitHub pull requests (PRs) into reproducible, execution-based tasks via four stages: programmatic sourcing, environment synthesis, test oracle extraction, and quality assurance. A final hint-guided trajectory synthesis step converts instances that strong models fail on into training trajectories. Our initial benchmark consists of 11,133 instances from 3,971 repositories across 11 languages. On a subset of 1,782 instances of this benchmark, today's strongest models perform as follows: claude-sonnet-4.5 achieves 36.20% pass@10, gpt-5-2025-08-07 34.57%, gemini/gemini-2.5-pro 24.92%, and gpt-4o 16.89%. We further demonstrate the utility of our dataset by showing that fine-tuning on SWE-Bench++ instances yields measurable improvements on the SWE-bench Multilingual benchmark. SWE-Bench++ provides a scalable, multilingual benchmark for evaluating and improving repository-level code generation.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Large Language Models as Pok\'emon Battle Agents: Strategic Play and Content Generation

arXiv:2512.17308v1 Announce Type: new Abstract: Strategic decision-making in Pok\'emon battles presents a unique testbed for evaluating large language models. Pok\'emon battles demand reasoning about type matchups, statistical trade-offs, and risk assessment, skills that mirror human strategic thinking. This work examines whether Large Language Models (LLMs) can serve as competent battle agents, capable of both making tactically sound decisions and generating novel, balanced game content. We developed a turn-based Pok\'emon battle system where LLMs select moves based on battle state rather than pre-programmed logic. The framework captures essential Pok\'emon mechanics: type effectiveness multipliers, stat-based damage calculations, and multi-Pok\'emon team management. Through systematic evaluation across multiple model architectures we measured win rates, decision latency, type-alignment accuracy, and token efficiency. These results suggest LLMs can function as dynamic game opponents without domain-specific training, offering a practical alternative to reinforcement learning for turn-based strategic games. The dual capability of tactical reasoning and content creation, positions LLMs as both players and designers, with implications for procedural generation and adaptive difficulty systems in interactive entertainment.

Fonte: arXiv cs.AI

RL • Score 93

SFBD-OMNI: Bridge models for lossy measurement restoration with limited clean samples

arXiv:2512.17051v1 Announce Type: new Abstract: In many real-world scenarios, obtaining fully observed samples is prohibitively expensive or even infeasible, while partial and noisy observations are comparatively easy to collect. In this work, we study distribution restoration with abundant noisy samples, assuming the corruption process is available as a black-box generator. We show that this task can be framed as a one-sided entropic optimal transport problem and solved via an EM-like algorithm. We further provide a test criterion to determine whether the true underlying distribution is recoverable under per-sample information loss, and show that in otherwise unrecoverable cases, a small number of clean samples can render the distribution largely recoverable. Building on these insights, we introduce SFBD-OMNI, a bridge model-based framework that maps corrupted sample distributions to the ground-truth distribution. Our method generalizes Stochastic Forward-Backward Deconvolution (SFBD; Lu et al., 2025) to handle arbitrary measurement models beyond Gaussian corruption. Experiments across benchmark datasets and diverse measurement settings demonstrate significant improvements in both qualitative and quantitative performance.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Incorporando Embeddings de Nível de Erro de Ruído para Melhorar a Robustez Assistida por LLM no Reconhecimento de Fala Persa

Sistemas de Reconhecimento Automático de Fala (ASR) enfrentam degradação significativa de desempenho em ambientes ruidosos, especialmente para idiomas de baixo recurso como o persa. Este estudo apresenta um framework robusto de correção de erros em ASR que combina múltiplas hipóteses e modelagem sensível ao ruído, utilizando fala persa ruidosa para gerar hipóteses e introduzindo o Error Level Noise (ELN) como uma representação que quantifica distorções linguísticas causadas pelo ruído.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

ShareChat: A Dataset of Chatbot Conversations in the Wild

arXiv:2512.17843v1 Announce Type: new Abstract: While Large Language Models (LLMs) have evolved into distinct platforms with unique interface designs and capabilities, existing public datasets treat models as generic text generators, stripping away the interface context that actively shapes user interaction. To address this limitation, we present ShareChat, a large-scale, cross-platform corpus comprising 142,808 conversations and over 660,000 turns collected from publicly shared URLs across five major platforms: ChatGPT, Claude, Gemini, Perplexity, and Grok. ShareChat distinguishes itself by preserving native platform affordances often lost in standard logs, including reasoning traces, source links, and code artifacts, while spanning 101 languages over the period from April 2023 to October 2025. Furthermore, ShareChat offers substantially longer context windows and greater interaction depth than prior datasets. We demonstrate the dataset's multifaceted utility through three representative analyses: (1) analyzing conversation completeness to measure user intent satisfaction; (2) evaluating source citation behaviors in content generation; and (3) conducting temporal analysis to track evolving usage patterns. This work provides the community with a vital and timely resource for understanding authentic user-LLM chatbot interactions in the wild.

Fonte: arXiv cs.CL

NLP/LLMs • Score 89

LookAhead Tuning: Safer Language Models via Partial Answer Previews

arXiv:2503.19041v4 Announce Type: replace Abstract: Fine-tuning enables large language models (LLMs) to adapt to specific domains, but often compromises their previously established safety alignment. To mitigate the degradation of model safety during fine-tuning, we introduce LookAhead Tuning, a lightweight and effective data-driven approach that preserves safety during fine-tuning. The method introduces two simple strategies that modify training data by previewing partial answer prefixes, thereby minimizing perturbations to the model's initial token distributions and maintaining its built-in safety mechanisms. Comprehensive experiments demonstrate that LookAhead Tuning effectively maintains model safety without sacrificing robust performance on downstream tasks. Our findings position LookAhead Tuning as a reliable and efficient solution for the safe and effective adaptation of LLMs.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Investigando a Inteligência Geral Científica de LLMs com Fluxos de Trabalho Alinhados a Cientistas

Apesar dos avanços em IA científica, falta um framework coerente para Inteligência Geral Científica (SGI) — a capacidade de conceber, investigar e raciocinar autonomamente em domínios científicos. Apresentamos uma definição operacional de SGI baseada no Modelo de Investigação Prática (PIM) e a operacionalizamos através de quatro tarefas alinhadas a cientistas: pesquisa profunda, geração de ideias, experimentos secos/molhados e raciocínio experimental.

Fonte: arXiv cs.AI

RL • Score 95

Affect, Body, Cognition, Demographics, and Emotion: The ABCDE of Text Features for Computational Affective Science

arXiv:2512.17752v1 Announce Type: new Abstract: Work in Computational Affective Science and Computational Social Science explores a wide variety of research questions about people, emotions, behavior, and health. Such work often relies on language data that is first labeled with relevant information, such as the use of emotion words or the age of the speaker. Although many resources and algorithms exist to enable this type of labeling, discovering, accessing, and using them remains a substantial impediment, particularly for practitioners outside of computer science. Here, we present the ABCDE dataset (Affect, Body, Cognition, Demographics, and Emotion), a large-scale collection of over 400 million text utterances drawn from social media, blogs, books, and AI-generated sources. The dataset is annotated with a wide range of features relevant to computational affective and social science. ABCDE facilitates interdisciplinary research across numerous fields, including affective science, cognitive science, the digital humanities, sociology, political science, and computational linguistics.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

arXiv:2512.17131v1 Announce Type: cross Abstract: We propose Generalized Primal Averaging (GPA), an extension of Nesterov's method in its primal averaging formulation that addresses key limitations of recent averaging-based optimizers such as single-worker DiLoCo and Schedule-Free (SF) in the non-distributed setting. These two recent algorithmic approaches improve the performance of base optimizers, such as AdamW, through different iterate averaging strategies. Schedule-Free explicitly maintains a uniform average of past weights, while single-worker DiLoCo performs implicit averaging by periodically aggregating trajectories, called pseudo-gradients, to update the model parameters. However, single-worker DiLoCo's periodic averaging introduces a two-loop structure, increasing its memory requirements and number of hyperparameters. GPA overcomes these limitations by decoupling the interpolation constant in the primal averaging formulation of Nesterov. This decoupling enables GPA to smoothly average iterates at every step, generalizing and improving upon single-worker DiLoCo. Empirically, GPA consistently outperforms single-worker DiLoCo while removing the two-loop structure, simplifying hyperparameter tuning, and reducing its memory overhead to a single additional buffer. On the Llama-160M model, GPA provides a 24.22% speedup in terms of steps to reach the baseline (AdamW's) validation loss. Likewise, GPA achieves speedups of 12% and 27% on small and large batch setups, respectively, to attain AdamW's validation accuracy on the ImageNet ViT workload. Furthermore, we prove that for any base optimizer with regret bounded by $O(\sqrt{T})$, where $T$ is the number of iterations, GPA can match or exceed the convergence guarantee of the original optimizer, depending on the choice of interpolation constants.

Fonte: arXiv cs.AI

NLP/LLMs • Score 92

Linear Personality Probing and Steering in LLMs: A Big Five Study

arXiv:2512.17639v1 Announce Type: new Abstract: Large language models (LLMs) exhibit distinct and consistent personalities that greatly impact trust and engagement. While this means that personality frameworks would be highly valuable tools to characterize and control LLMs' behavior, current approaches remain either costly (post-training) or brittle (prompt engineering). Probing and steering via linear directions has recently emerged as a cheap and efficient alternative. In this paper, we investigate whether linear directions aligned with the Big Five personality traits can be used for probing and steering model behavior. Using Llama 3.3 70B, we generate descriptions of 406 fictional characters and their Big Five trait scores. We then prompt the model with these descriptions and questions from the Alpaca questionnaire, allowing us to sample hidden activations that vary along personality traits in known, quantifiable ways. Using linear regression, we learn a set of per-layer directions in activation space, and test their effectiveness for probing and steering model behavior. Our results suggest that linear directions aligned with trait-scores are effective probes for personality detection, while their steering capabilities strongly depend on context, producing reliable effects in forced-choice tasks but limited influence in open-ended generation or when additional context is present in the prompt.

Fonte: arXiv cs.CL

RL • Score 96

SHARP-QoS: Sparsely-gated Hierarchical Adaptive Routing for joint Prediction of QoS

arXiv:2512.17262v1 Announce Type: new Abstract: Dependable service-oriented computing relies on multiple Quality of Service (QoS) parameters that are essential to assess service optimality. However, real-world QoS data are extremely sparse, noisy, and shaped by hierarchical dependencies arising from QoS interactions, and geographical and network-level factors, making accurate QoS prediction challenging. Existing methods often predict each QoS parameter separately, requiring multiple similar models, which increases computational cost and leads to poor generalization. Although recent joint QoS prediction studies have explored shared architectures, they suffer from negative transfer due to loss-scaling caused by inconsistent numerical ranges across QoS parameters and further struggle with inadequate representation learning, resulting in degraded accuracy. This paper presents an unified strategy for joint QoS prediction, called SHARP-QoS, that addresses these issues using three components. First, we introduce a dual mechanism to extract the hierarchical features from both QoS and contextual structures via hyperbolic convolution formulated in the Poincar\'e ball. Second, we propose an adaptive feature-sharing mechanism that allows feature exchange across informative QoS and contextual signals. A gated feature fusion module is employed to support dynamic feature selection among structural and shared representations. Third, we design an EMA-based loss balancing strategy that allows stable joint optimization, thereby mitigating the negative transfer. Evaluations on three datasets with two, three, and four QoS parameters demonstrate that SHARP-QoS outperforms both single- and multi-task baselines. Extensive study shows that our model effectively addresses major challenges, including sparsity, robustness to outliers, and cold-start, while maintaining moderate computational overhead, underscoring its capability for reliable joint QoS prediction.

Fonte: arXiv cs.LG

RL • Score 96

Navegando Expansões Taxonômicas de Conjuntos de Entidades Impulsionadas por Bases de Conhecimento

Reconhecer semelhanças entre entidades é central tanto para a cognição humana quanto para a inteligência computacional. A Expansão de Conjuntos de Entidades é uma tarefa que visa identificar entidades adicionais que compartilham propriedades semânticas relevantes, mas a abordagem linear não revela as estruturas taxonômicas mais ricas presentes em recursos de conhecimento. Um novo framework baseado em lógica introduz o conceito de um grafo de expansão.

Fonte: arXiv cs.AI

Vision • Score 95

Stakeholder Suite: A Unified AI Framework for Mapping Actors, Topics and Arguments in Public Debates

arXiv:2512.17347v1 Announce Type: new Abstract: Public debates surrounding infrastructure and energy projects involve complex networks of stakeholders, arguments, and evolving narratives. Understanding these dynamics is crucial for anticipating controversies and informing engagement strategies, yet existing tools in media intelligence largely rely on descriptive analytics with limited transparency. This paper presents Stakeholder Suite, a framework deployed in operational contexts for mapping actors, topics, and arguments within public debates. The system combines actor detection, topic modeling, argument extraction and stance classification in a unified pipeline. Tested on multiple energy infrastructure projects as a case study, the approach delivers fine-grained, source-grounded insights while remaining adaptable to diverse domains. The framework achieves strong retrieval precision and stance accuracy, producing arguments judged relevant in 75% of pilot use cases. Beyond quantitative metrics, the tool has proven effective for operational use: helping project teams visualize networks of influence, identify emerging controversies, and support evidence-based decision-making.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Modelos de Raciocínio Grande Podem Melhorar a Precisão em Tarefas Matemáticas Usando Pensamento Falho?

O prompting de cadeia de pensamento (CoT) tornou-se central para o raciocínio matemático em grandes modelos de linguagem, mas os modelos ainda são frágeis a erros iniciais. Investigamos se o treinamento em rastros de raciocínio intencionalmente falhos pode ensinar modelos a detectar e se recuperar de tais erros sem degradar a capacidade de resolução de problemas padrão.

Fonte: arXiv cs.AI

Vision • Score 92

Comparison of deep learning models: CNN and VGG-16 in identifying pornographic content

arXiv:2512.16947v1 Announce Type: new Abstract: In 2020, a total of 59,741 websites were blocked by the Indonesian government due to containing negative content, including pornography, with 14,266 websites falling into this category. However, these blocked websites could still be accessed by the public using virtual private networks (VPNs). This prompted the research idea to quickly identify pornographic content. This study aims to develop a system capable of identifying websites suspected of containing pornographic image content, using a deep learning approach with convolutional neural network (CNN) and visual geometry group 16 (VGG-16) model. The two models were then explored comprehensively and holistically to determine which model was most effective in detecting pornographic content quickly. Based on the findings of the comparison between testing the CNN and VGG-16 models, research results showed that the best test results were obtained in the eighth experiment using the CNN model at an epoch value level of 50 and a learning rate of 0.001 of 0.9487 or 94.87%. This can be interpreted that the CNN model is more effective in detecting pornographic content quickly and accurately compared to using the VGG-16 model.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

Atom: Efficient On-Device Video-Language Pipelines Through Modular Reuse

arXiv:2512.17108v1 Announce Type: new Abstract: Recent advances in video-language models have enabled powerful applications like video retrieval, captioning, and assembly. However, executing such multi-stage pipelines efficiently on mobile devices remains challenging due to redundant model loads and fragmented execution. We introduce Atom, an on-device system that restructures video-language pipelines for fast and efficient execution. Atom decomposes a billion-parameter model into reusable modules, such as the visual encoder and language decoder, and reuses them across subtasks like captioning, reasoning, and indexing. This reuse-centric design eliminates repeated model loading and enables parallel execution, reducing end-to-end latency without sacrificing performance. On commodity smartphones, Atom achieves 27--33% faster execution compared to non-reuse baselines, with only marginal performance drop ($\leq$ 2.3 Recall@1 in retrieval, $\leq$ 1.5 CIDEr in captioning). These results position Atom as a practical, scalable approach for efficient video-language understanding on edge devices.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

GB-DQN: Gradient Boosted DQN Models for Non-stationary Reinforcement Learning

arXiv:2512.17034v1 Announce Type: new Abstract: Non-stationary environments pose a fundamental challenge for deep reinforcement learning, as changes in dynamics or rewards invalidate learned value functions and cause catastrophic forgetting. We propose \emph{Gradient-Boosted Deep Q-Networks (GB-DQN)}, an adaptive ensemble method that addresses model drift through incremental residual learning. Instead of retraining a single Q-network, GB-DQN constructs an additive ensemble in which each new learner is trained to approximate the Bellman residual of the current ensemble after drift. We provide theoretical results showing that each boosting step reduces the empirical Bellman residual and that the ensemble converges to the post-drift optimal value function under standard assumptions. Experiments across a diverse set of control tasks with controlled dynamics changes demonstrate faster recovery, improved stability, and greater robustness compared to DQN and common non-stationary baselines.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports

arXiv:2512.17776v1 Announce Type: new Abstract: As large language models (LLMs) advance, deep research systems can generate expert-level reports via multi-step reasoning and evidence-based synthesis, but evaluating such reports remains challenging. Existing benchmarks often lack systematic criteria for expert reporting, evaluations that rely heavily on LLM judges can fail to capture issues that require expert judgment, and source verification typically covers only a limited subset of explicitly cited statements rather than report-wide factual reliability. We introduce DEER, a benchmark for evaluating expert-level deep research reports. DEER comprises 50 report-writing tasks spanning 13 domains and an expert-grounded evaluation taxonomy (7 dimensions, 25 sub-dimension) operationalized into 130 fine-grained rubric items. DEER further provides task-specific expert guidance to help LLM judges assess expert-level report quality more consistently. Complementing rubric-based assessment, we propose a document-level fact-checking architecture that extracts and verifies all claims across the entire report, including both cited and uncited ones, and quantifies external-evidence quality. DEER correlates closely with human expert judgments and yields interpretable diagnostics of system strengths and weaknesses.

Fonte: arXiv cs.CL

RL • Score 92

Convergence Guarantees for Federated SARSA with Local Training and Heterogeneous Agents

arXiv:2512.17688v1 Announce Type: cross Abstract: We present a novel theoretical analysis of Federated SARSA (FedSARSA) with linear function approximation and local training. We establish convergence guarantees for FedSARSA in the presence of heterogeneity, both in local transitions and rewards, providing the first sample and communication complexity bounds in this setting. At the core of our analysis is a new, exact multi-step error expansion for single-agent SARSA, which is of independent interest. Our analysis precisely quantifies the impact of heterogeneity, demonstrating the convergence of FedSARSA with multiple local updates. Crucially, we show that FedSARSA achieves linear speed-up with respect to the number of agents, up to higher-order terms due to Markovian sampling. Numerical experiments support our theoretical findings.

Fonte: arXiv stat.ML

RL • Score 96

DiffeoMorph: Aprendendo a Morfose de Formas 3D Usando Simulações Baseadas em Agentes Diferenciáveis

Sistemas biológicos podem formar estruturas tridimensionais complexas através do comportamento coletivo de agentes idênticos. Neste trabalho, apresentamos o DiffeoMorph, um framework diferenciável para aprender um protocolo de morfogênese que orienta uma população de agentes a se transformar em uma forma 3D alvo, utilizando uma rede neural gráfica SE(3)-equivariant baseada em atenção.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Adversarial VR: Um Testbed Open-Source para Avaliação da Robustez Adversarial na Detecção e Mitigação de Ciberdoença em VR

Métodos automatizados de detecção de ciberdoença baseados em deep learning (DL) podem melhorar o conforto e a interação do usuário. No entanto, esses sistemas são vulneráveis a ataques adversariais, que podem degradar o desempenho do modelo e interromper a experiência imersiva do usuário. Este artigo apresenta o Adversarial-VR, um testbed em tempo real para avaliar estratégias de detecção e mitigação de ciberdoença sob condições adversariais.

Fonte: arXiv cs.AI

MLOps/Systems • Score 93

Diagnóstico e Quantificação de Falhas para Arrays Fotovoltaicos Baseados em Modelos Físicos Diferenciáveis

O diagnóstico e a quantificação precisos de falhas são essenciais para a operação confiável e a manutenção inteligente de arrays fotovoltaicos (PV). Este artigo propõe uma nova abordagem de quantificação de falhas para strings PV baseada em um modelo de simulação de falhas rápidas diferenciáveis (DFFSM), que fornece gradientes analíticos em relação aos parâmetros de falha e alcança alta precisão na quantificação de diferentes tipos de falhas.

Fonte: arXiv cs.LG

MLOps/Systems • Score 90

Mistura de Especialistas Adaptativa Eficiente em Largura de Banda via Compensação de Baixa Classificação

Modelos de Mistura de Especialistas (MoE) ampliam a capacidade por meio de ativação esparsa, mas sobrecarregam a memória e a largura de banda. A descarga alivia a memória da GPU ao buscar especialistas sob demanda, mas o roteamento em nível de token causa transferências irregulares que tornam a inferência limitada pela I/O. Apresentamos a Mistura de Especialistas Adaptativa Eficiente em Largura de Banda via Compensação de Baixa Classificação, que restaura a precisão guiada por roteador usando compensadores de baixa classificação pré-computados.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Um Referencial de Saúde da Mulher para Modelos de Linguagem de Grande Escala

Com o aumento do uso de modelos de linguagem de grande escala (LLMs) como fontes primárias de informação em saúde, sua precisão em saúde da mulher ainda não foi devidamente analisada. Apresentamos o Women's Health Benchmark (WHB), o primeiro referencial que avalia o desempenho de LLMs especificamente em saúde da mulher, revelando lacunas alarmantes na precisão.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Turn-PPO: Estimativa de Vantagem em Nível de Turno com PPO para Melhoria do RL Multi-Turno em LLMs Agentes

O aprendizado por reforço (RL) ressurgiu como uma abordagem natural para treinar agentes LLM interativos em ambientes do mundo real. No entanto, a aplicação direta do algoritmo Group Relative Policy Optimization (GRPO) em tarefas multi-turno revela limitações significativas. Para enfrentar esses desafios, investigamos estratégias de estimativa de vantagem mais estáveis e eficazes, introduzindo o turn-PPO, uma variante que opera em uma formulação MDP em nível de turno.

Fonte: arXiv cs.LG

Vision • Score 96

Aprendizado de Máquina Leve Informado por Física para Previsão de Visibilidade na Aviação em Vários Regimes Climáticos

A previsão de curto prazo (nowcasting) de eventos de baixa visibilidade e precipitação é crucial para a segurança da aviação e a eficiência operacional. Este estudo apresenta um framework leve de gradient boosting (XGBoost) treinado exclusivamente com dados de observação de superfície (METAR) e aprimorado por meio de engenharia de características guiada por princípios termodinâmicos.

Fonte: arXiv cs.LG

Vision • Score 95

Towards Pixel-Wise Anomaly Location for High-Resolution PCBA \\ via Self-Supervised Image Reconstruction

arXiv:2512.17296v1 Announce Type: new Abstract: Automated defect inspection of assembled Printed Circuit Board Assemblies (PCBA) is quite challenging due to the insufficient labeled data, micro-defects with just a few pixels in visually-complex and high-resolution images. To address these challenges, we present HiSIR-Net, a High resolution, Self-supervised Reconstruction framework for pixel-wise PCBA localization. Our design combines two lightweight modules that make this practical on real 4K-resolution boards: (i) a Selective Input-Reconstruction Gate (SIR-Gate) that lets the model decide where to trust reconstruction versus the original input, thereby reducing irrelevant reconstruction artifacts and false alarms; and (ii) a Region-level Optimized Patch Selection (ROPS) scheme with positional cues to select overlapping patch reconstructions coherently across arbitrary resolutions. Organically integrating these mechanisms yields clean, high-resolution anomaly maps with low false positive (FP) rate. To bridge the gap in high-resolution PCBA datasets, we further contribute a self-collected dataset named SIPCBA-500 of 500 images. We conduct extensive experiments on our SIPCBA-500 as well as public benchmarks, demonstrating the superior localization performance of our method while running at practical speed. Full code and dataset will be made available upon acceptance.

Fonte: arXiv cs.CV

NLP/LLMs • Score 93

Compressão é Roteamento: Erro de Reconstrução como um Sinal Intrínseco para Modelos de Linguagem Modulares

Modelos de Linguagem de Grande Escala (LLMs) enfrentam desafios como limitações de comprimento de contexto e altos custos de inferência. Este artigo propõe a filosofia arquitetônica de que 'Compressão é Roteamento', apresentando um Autoencoder Transformer de 87M de parâmetros que alcança uma compressão de sequência de 64x, validando o erro de reconstrução como uma Impressão Digital Intrínseca de Distribuição.

Fonte: arXiv cs.LG

Vision • Score 95

Any-Optical-Model: Um Modelo de Fundação Universal para Sensoriamento Remoto Óptico

Os satélites ópticos, com suas diversas configurações de bandas e distâncias de amostragem no solo, fornecem evidências indispensáveis para tarefas que vão desde a vigilância de ecossistemas até a resposta a emergências. No entanto, discrepâncias significativas na composição de bandas e na resolução espacial entre diferentes sensores ópticos apresentam grandes desafios para os Modelos de Fundação de Sensoriamento Remoto (RSFMs) existentes.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Toward Ethical AI Through Bayesian Uncertainty in Neural Question Answering

arXiv:2512.17677v1 Announce Type: new Abstract: We explore Bayesian reasoning as a means to quantify uncertainty in neural networks for question answering. Starting with a multilayer perceptron on the Iris dataset, we show how posterior inference conveys confidence in predictions. We then extend this to language models, applying Bayesian inference first to a frozen head and finally to LoRA-adapted transformers, evaluated on the CommonsenseQA benchmark. Rather than aiming for state-of-the-art accuracy, we compare Laplace approximations against maximum a posteriori (MAP) estimates to highlight uncertainty calibration and selective prediction. This allows models to abstain when confidence is low. An ``I don't know'' response not only improves interpretability but also illustrates how Bayesian methods can contribute to more responsible and ethical deployment of neural question-answering systems.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

QSMOTE-PGM/kPGM: Classificação de Conjuntos de Dados Desbalanceados Baseada em QSMOTE e kPGM

O aprendizado de máquina inspirado em quantum (QiML) utiliza estruturas matemáticas da teoria quântica para aprimorar algoritmos clássicos, focando em estruturas de produto interno em espaços de características de alta dimensão. Este trabalho apresenta uma comparação teórica e empírica unificada de classificadores baseados em PGM e KPGM, analisando seu desempenho em cenários de oversampling sintético usando variantes do Quantum SMOTE (QSMOTE).

Fonte: arXiv cs.LG

RL • Score 96

Conjunto de Dados Sintético que Preserva a Privacidade de Trajetórias Diárias Individuais para Análises de Mobilidade em Escala Urbana

Os dados de mobilidade urbana são essenciais para planejamento urbano e previsão de demanda de transporte, mas os rastros de GPS derivados de celulares não podem ser compartilhados devido a riscos de reidentificação. Este estudo apresenta um conjunto de dados sintético que preserva a privacidade, reconstruindo trajetórias diárias a partir de entradas agregadas, permitindo análises realistas sem a necessidade de identificadores pessoais.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Globally Optimal Solution to the Generalized Relative Pose Estimation Problem using Affine Correspondences

arXiv:2512.17188v1 Announce Type: new Abstract: Mobile devices equipped with a multi-camera system and an inertial measurement unit (IMU) are widely used nowadays, such as self-driving cars. The task of relative pose estimation using visual and inertial information has important applications in various fields. To improve the accuracy of relative pose estimation of multi-camera systems, we propose a globally optimal solver using affine correspondences to estimate the generalized relative pose with a known vertical direction. First, a cost function about the relative rotation angle is established after decoupling the rotation matrix and translation vector, which minimizes the algebraic error of geometric constraints from affine correspondences. Then, the global optimization problem is converted into two polynomials with two unknowns based on the characteristic equation and its first derivative is zero. Finally, the relative rotation angle can be solved using the polynomial eigenvalue solver, and the translation vector can be obtained from the eigenvector. Besides, a new linear solution is proposed when the relative rotation is small. The proposed solver is evaluated on synthetic data and real-world datasets. The experiment results demonstrate that our method outperforms comparable state-of-the-art methods in accuracy.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Are Vision Language Models Cross-Cultural Theory of Mind Reasoners?

arXiv:2512.17394v1 Announce Type: new Abstract: Theory of Mind (ToM) -- the ability to attribute beliefs, desires, and emotions to others -- is fundamental for human social intelligence, yet remains a major challenge for artificial agents. Existing Vision-Language Models (VLMs) are increasingly applied in socially grounded tasks, but their capacity for cross-cultural ToM reasoning is largely unexplored. In this work, we introduce CulturalToM-VQA, a new evaluation benchmark containing 5095 questions designed to probe ToM reasoning across diverse cultural contexts through visual question answering. The dataset captures culturally grounded cues such as rituals, attire, gestures, and interpersonal dynamics, enabling systematic evaluation of ToM reasoning beyond Western-centric benchmarks. Our dataset is built through a VLM-assisted human-in-the-loop pipeline, where human experts first curate culturally rich images across traditions, rituals, and social interactions; a VLM then assist in generating structured ToM-focused scene descriptions, which are refined into question-answer pairs spanning a taxonomy of six ToM tasks and four graded complexity levels. The resulting dataset covers diverse theory of mind facets such as mental state attribution, false belief reasoning, non-literal communication, social norm violations, perspective coordination, and multi-agent reasoning.

Fonte: arXiv cs.CL

Vision • Score 96

BIONIX: Um Braço Protético Sem Fio e de Baixo Custo com Controle de EEG e EMG de Duplo Sinal

As próteses de membros superiores acessíveis frequentemente carecem de sistemas de controle intuitivos, limitando a funcionalidade e a acessibilidade para amputados em ambientes de poucos recursos. Este projeto apresenta um sistema de controle neuro-muscular de baixo custo que integra eletroencefalografia (EEG) e eletromiografia (EMG) para permitir o controle em tempo real de um braço protético.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Subjective Question Generation and Answer Evaluation using NLP

arXiv:2512.17289v1 Announce Type: new Abstract: Natural Language Processing (NLP) is one of the most revolutionary technologies today. It uses artificial intelligence to understand human text and spoken words. It is used for text summarization, grammar checking, sentiment analysis, and advanced chatbots and has many more potential use cases. Furthermore, it has also made its mark on the education sector. Much research and advancements have already been conducted on objective question generation; however, automated subjective question generation and answer evaluation are still in progress. An automated system to generate subjective questions and evaluate the answers can help teachers assess student work and enhance the student's learning experience by allowing them to self-assess their understanding after reading an article or a chapter of a book. This research aims to improve current NLP models or make a novel one for automated subjective question generation and answer evaluation from text input.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

CIFE: Code Instruction-Following Evaluation

arXiv:2512.17387v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly applied to real-world code generation, where functional correctness alone is insufficient for reliable deployment, developers also expect adherence to explicit requirements for robustness, formatting, and security. Existing benchmarks primarily assess correctness through test-case execution, offering limited insight into how reliably models follow such constraints. We introduce a benchmark of 1,000 Python tasks, each paired with an average of 7 developer-specified constraints spanning 13 categories. Constraints are curated through a four-stage human-LLM pipeline to ensure they are atomic, relevant, and objective. We evaluate 14 open- and closed-source models using complementary adherence metrics and propose the C2A Score, a composite measure that jointly captures correctness and constraint compliance. Results reveal a substantial gap between partial and strict satisfaction, while strong models achieve over 90% partial adherence, strict adherence remains between 39-66%. These findings highlight that trustworthy code generation requires not only correctness but also consistent adherence to developer intent.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Bangla MedER: Multi-BERT Ensemble Approach for the Recognition of Bangla Medical Entity

arXiv:2512.17769v1 Announce Type: new Abstract: Medical Entity Recognition (MedER) is an essential NLP task for extracting meaningful entities from the medical corpus. Nowadays, MedER-based research outcomes can remarkably contribute to the development of automated systems in the medical sector, ultimately enhancing patient care and outcomes. While extensive research has been conducted on MedER in English, low-resource languages like Bangla remain underexplored. Our work aims to bridge this gap. For Bangla medical entity recognition, this study first examined a number of transformer models, including BERT, DistilBERT, ELECTRA, and RoBERTa. We also propose a novel Multi-BERT Ensemble approach that outperformed all baseline models with the highest accuracy of 89.58%. Notably, it provides an 11.80% accuracy improvement over the single-layer BERT model, demonstrating its effectiveness for this task. A major challenge in MedER for low-resource languages is the lack of annotated datasets. To address this issue, we developed a high-quality dataset tailored for the Bangla MedER task. The dataset was used to evaluate the effectiveness of our model through multiple performance metrics, demonstrating its robustness and applicability. Our findings highlight the potential of Multi-BERT Ensemble models in improving MedER for Bangla and set the foundation for further advancements in low-resource medical NLP.

Fonte: arXiv cs.CL

Multimodal • Score 95

Peeking Into The Future For Contextual Biasing

arXiv:2512.17657v1 Announce Type: new Abstract: While end-to-end (E2E) automatic speech recognition (ASR) models excel at general transcription, they struggle to recognize rare or unseen named entities (e.g., contact names, locations), which are critical for downstream applications like virtual assistants. In this paper, we propose a contextual biasing method for attention based encoder decoder (AED) models using a list of candidate named entities. Instead of predicting only the next token, we simultaneously predict multiple future tokens, enabling the model to "peek into the future" and score potential candidate entities in the entity list. Moreover, our approach leverages the multi-token prediction logits directly without requiring additional entity encoders or cross-attention layers, significantly reducing architectural complexity. Experiments on Librispeech demonstrate that our approach achieves up to 50.34% relative improvement in named entity word error rate compared to the baseline AED model.

Fonte: arXiv cs.CL

Applications • Score 89

Dion2: Um Método Simples para Reduzir Matrizes no Muon

O otimizador Muon apresenta forte desempenho empírico e fundamentação teórica. No entanto, o custo super-linear de seu passo de ortonormalização introduz um aumento de sobrecarga com a escala. Apresentamos o Dion2, um método muito mais simples para reduzir a matriz envolvida no cálculo do Muon em comparação com abordagens anteriores.

Fonte: arXiv cs.LG

RL • Score 96

O Papel da Ética Islâmica na Prevenção do Abuso de Deepfakes Baseados em Inteligência Artificial (AI)

O desenvolvimento significativo da tecnologia de deepfake, impulsionada pela inteligência artificial (AI), gerou preocupações globais sobre a alteração de informações falsas e a usurpação de identidades online. Este estudo visa formular um framework ético islâmico abrangente para mitigar os riscos do uso indevido de deepfakes, propondo recomendações estratégicas para a regulação e gestão responsável da tecnologia.

Fonte: arXiv cs.AI

RL • Score 96

Viés Conservador em Aprendizado Multi-Professor: Por Que Agentes Preferem Aconselhadores de Baixa Recompensa

O aprendizado por reforço interativo (IRL) tem mostrado potencial para permitir que agentes autônomos e robôs aprendam comportamentos complexos com professores humanos, mas a dinâmica de seleção de professores ainda é pouco compreendida. Este artigo revela um fenômeno inesperado no IRL: agentes de aprendizado preferem professores conservadores, de baixa recompensa, em vez de aqueles que oferecem recompensas 20x maiores.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models

arXiv:2512.17385v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, their effectiveness heavily relies on supervised training with extensive labeled (e.g., question-answering pairs) or unlabeled datasets (e.g., code snippets), which are often expensive and difficult to obtain at scale. To address this limitation, this paper introduces a method IPC, an unsupervised framework that leverages Internal Probing of LLMs for Code generation without any external corpus, even unlabeled code snippets. We introduce the problem space probing, test understanding probing, solution space probing, and knowledge consolidation and reinforcement to probe the internal knowledge and confidence patterns existing in LLMs. Further, IPC identifies reliable code candidates through self-consistency mechanisms and representation-based quality estimation to train UCoder (coder with unsupervised learning). We validate the proposed approach across multiple code benchmarks, demonstrating that unsupervised methods can achieve competitive performance compared to supervised approaches while significantly reducing the dependency on labeled data and computational resources. Analytic experiments reveal that internal model states contain rich signals about code quality and correctness, and that properly harnessing these signals enables effective unsupervised learning for code generation tasks, opening new directions for training code LLMs in resource-constrained scenarios.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators

arXiv:2512.17267v1 Announce Type: new Abstract: Evaluating user-facing AI applications remains a central challenge, especially in open-ended domains such as travel planning, clinical note generation, or dialogue. The gold standard is user feedback (e.g., thumbs up/down) or behavioral signals (e.g., retention), but these are often scarce in prototypes and research projects, or too-slow to use for system optimization. We present AutoMetrics, a framework for synthesizing evaluation metrics under low-data constraints. AutoMetrics combines retrieval from MetricBank, a collection of 48 metrics we curate, with automatically generated LLM-as-a-Judge criteria informed by lightweight human feedback. These metrics are composed via regression to maximize correlation with human signal. AutoMetrics takes you from expensive measures to interpretable automatic metrics. Across 5 diverse tasks, AutoMetrics improves Kendall correlation with human ratings by up to 33.4% over LLM-as-a-Judge while requiring fewer than 100 feedback points. We show that AutoMetrics can be used as a proxy reward to equal effect as a verifiable reward. We release the full AutoMetrics toolkit and MetricBank to accelerate adaptive evaluation of LLM applications.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

FORMSpoT: A Decade of Tree-Level, Country-Scale Forest Monitoring

arXiv:2512.17021v1 Announce Type: new Abstract: The recent decline of the European forest carbon sink highlights the need for spatially explicit and frequently updated forest monitoring tools. Yet, existing satellite-based disturbance products remain too coarse to detect changes at the scale of individual trees, typically below 100 m$^{2}$. Here, we introduce FORMSpoT (Forest Mapping with SPOT Time series), a decade-long (2014-2024) nationwide mapping of forest canopy height at 1.5 m resolution, together with annual disturbance polygons (FORMSpoT-$\Delta$) covering mainland France. Canopy heights were derived from annual SPOT-6/7 composites using a hierarchical transformer model (PVTv2) trained on high-resolution airborne laser scanning (ALS) data. To enable robust change detection across heterogeneous acquisitions, we developed a dedicated post-processing pipeline combining co-registration and spatio-temporal total variation denoising. Validation against ALS revisits across 19 sites and 5,087 National Forest Inventory plots shows that FORMSpoT-$\Delta$ substantially outperforms existing disturbance products. In mountainous forests, where disturbances are small and spatially fragmented, FORMSpoT-$\Delta$ achieves an F1-score of 0.44, representing an order of magnitude higher than existing benchmarks. By enabling tree-level monitoring of forest dynamics at national scale, FORMSpoT-$\Delta$ provides a unique tool to analyze management practices, detect early signals of forest decline, and better quantify carbon losses from subtle disturbances such as thinning or selective logging. These results underscore the critical importance of sustaining very high-resolution satellite missions like SPOT and open-data initiatives such as DINAMIS for monitoring forests under climate change.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

When the Gold Standard isn't Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content

arXiv:2512.17738v1 Announce Type: new Abstract: User-generated content (UGC) is characterised by frequent use of non-standard language, from spelling errors to expressive choices such as slang, character repetitions, and emojis. This makes evaluating UGC translation particularly challenging: what counts as a "good" translation depends on the level of standardness desired in the output. To explore this, we examine the human translation guidelines of four UGC datasets, and derive a taxonomy of twelve non-standard phenomena and five translation actions (NORMALISE, COPY, TRANSFER, OMIT, CENSOR). Our analysis reveals notable differences in how UGC is treated, resulting in a spectrum of standardness in reference translations. Through a case study on large language models (LLMs), we show that translation scores are highly sensitive to prompts with explicit translation instructions for UGC, and that they improve when these align with the dataset's guidelines. We argue that when preserving UGC style is important, fair evaluation requires both models and metrics to be aware of translation guidelines. Finally, we call for clear guidelines during dataset creation and for the development of controllable, guideline-aware evaluation frameworks for UGC translation.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Confidence-Credibility Aware Weighted Ensembles of Small LLMs Outperform Large LLMs in Emotion Detection

arXiv:2512.17630v1 Announce Type: new Abstract: This paper introduces a confidence-weighted, credibility-aware ensemble framework for text-based emotion detection, inspired by Condorcet's Jury Theorem (CJT). Unlike conventional ensembles that often rely on homogeneous architectures, our approach combines architecturally diverse small transformer-based large language models (sLLMs) - BERT, RoBERTa, DistilBERT, DeBERTa, and ELECTRA, each fully fine-tuned for emotion classification. To preserve error diversity, we minimize parameter convergence while taking advantage of the unique biases of each model. A dual-weighted voting mechanism integrates both global credibility (validation F1 score) and local confidence (instance-level probability) to dynamically weight model contributions. Experiments on the DAIR-AI dataset demonstrate that our credibility-confidence ensemble achieves a macro F1 score of 93.5 percent, surpassing state-of-the-art benchmarks and significantly outperforming large-scale LLMs, including Falcon, Mistral, Qwen, and Phi, even after task-specific Low-Rank Adaptation (LoRA). With only 595M parameters in total, our small LLMs ensemble proves more parameter-efficient and robust than models up to 7B parameters, establishing that carefully designed ensembles of small, fine-tuned models can outperform much larger LLMs in specialized natural language processing (NLP) tasks such as emotion detection.

Fonte: arXiv cs.CL

NLP/LLMs • Score 92

Governance-Aware Hybrid Fine-Tuning for Multilingual Large Language Models

arXiv:2512.17344v1 Announce Type: new Abstract: We present a governance-aware hybrid fine-tuning framework for multilingual, low-resource adaptation of large language models. The core algorithm combines gradient-aligned low-rank updates with structured orthogonal transformations through layer-wise mixing and introduces unitary constraints in selected sub-layers to stabilize deep optimization. In tandem with lightweight, label-free data governance steps, including language identification, near-duplicate removal, and quality filtering, the framework targets accuracy, calibration, and cross-language parity under tight compute budgets. Across XNLI and FLORES, the hybrid approach delivers consistent gains over strong PEFT baselines while maintaining directional balance and improving probability calibration, as shown in Tables II and III. It is more resilient to lightweight orthographic variants, as shown in Table IV, and benefits additively from simple governance steps, as shown in Table V. Training footprint measurements indicate modest overhead and a favorable cost-quality frontier, as shown in Table VI and Figure 2. Together, these results show that hybrid and unitary PEFT provide a stable and accessible path to resource-efficient multilingual adaptation when paired with practical data governance.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding

arXiv:2512.17220v1 Announce Type: new Abstract: Humans understand long and complex texts by relying on a holistic semantic representation of the content. This global view helps organize prior knowledge, interpret new information, and integrate evidence dispersed across a document, as revealed by the Mindscape-Aware Capability of humans in psychology. Current Retrieval-Augmented Generation (RAG) systems lack such guidance and therefore struggle with long-context tasks. In this paper, we propose Mindscape-Aware RAG (MiA-RAG), the first approach that equips LLM-based RAG systems with explicit global context awareness. MiA-RAG builds a mindscape through hierarchical summarization and conditions both retrieval and generation on this global semantic representation. This enables the retriever to form enriched query embeddings and the generator to reason over retrieved evidence within a coherent global context. We evaluate MiA-RAG across diverse long-context and bilingual benchmarks for evidence-based understanding and global sense-making. It consistently surpasses baselines, and further analysis shows that it aligns local details with a coherent global representation, enabling more human-like long-context retrieval and reasoning.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

AncientBench: Towards Comprehensive Evaluation on Excavated and Transmitted Chinese Corpora

arXiv:2512.17756v1 Announce Type: new Abstract: Comprehension of ancient texts plays an important role in archaeology and understanding of Chinese history and civilization. The rapid development of large language models needs benchmarks that can evaluate their comprehension of ancient characters. Existing Chinese benchmarks are mostly targeted at modern Chinese and transmitted documents in ancient Chinese, but the part of excavated documents in ancient Chinese is not covered. To meet this need, we propose the AncientBench, which aims to evaluate the comprehension of ancient characters, especially in the scenario of excavated documents. The AncientBench is divided into four dimensions, which correspond to the four competencies of ancient character comprehension: glyph comprehension, pronunciation comprehension, meaning comprehension, and contextual comprehension. The benchmark also contains ten tasks, including radical, phonetic radical, homophone, cloze, translation, and more, providing a comprehensive framework for evaluation. We convened archaeological researchers to conduct experimental evaluations, proposed an ancient model as baseline, and conducted extensive experiments on the currently best-performing large language models. The experimental results reveal the great potential of large language models in ancient textual scenarios as well as the gap with humans. Our research aims to promote the development and application of large language models in the field of archaeology and ancient Chinese language.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Simulstream: Open-Source Toolkit for Evaluation and Demonstration of Streaming Speech-to-Text Translation Systems

arXiv:2512.17648v1 Announce Type: new Abstract: Streaming Speech-to-Text Translation (StreamST) requires producing translations concurrently with incoming speech, imposing strict latency constraints and demanding models that balance partial-information decision-making with high translation quality. Research efforts on the topic have so far relied on the SimulEval repository, which is no longer maintained and does not support systems that revise their outputs. In addition, it has been designed for simulating the processing of short segments, rather than long-form audio streams, and it does not provide an easy method to showcase systems in a demo. As a solution, we introduce simulstream, the first open-source framework dedicated to unified evaluation and demonstration of StreamST systems. Designed for long-form speech processing, it supports not only incremental decoding approaches, but also re-translation methods, enabling for their comparison within the same framework both in terms of quality and latency. In addition, it also offers an interactive web interface to demo any system built within the tool.

Fonte: arXiv cs.CL

NLP/LLMs • Score 92

Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers

arXiv:2512.17351v1 Announce Type: new Abstract: Understanding architectural differences in language models is challenging, especially at academic-scale pretraining (e.g., 1.3B parameters, 100B tokens), where results are often dominated by noise and randomness. To overcome this, we introduce controlled synthetic pretraining tasks that isolate and evaluate core model capabilities. Within this framework, we discover CANON LAYERS: lightweight architectural components -- named after the musical term "canon" -- that promote horizontal information flow across neighboring tokens. Canon layers compute weighted sums of nearby token representations and integrate seamlessly into Transformers, linear attention, state-space models, or any sequence architecture. We present 12 key results. This includes how Canon layers enhance reasoning depth (e.g., by $2\times$), reasoning breadth, knowledge manipulation, etc. They lift weak architectures like NoPE to match RoPE, and linear attention to rival SOTA linear models like Mamba2/GDN -- validated both through synthetic tasks and real-world academic-scale pretraining. This synthetic playground offers an economical, principled path to isolate core model capabilities often obscured at academic scales. Equipped with infinite high-quality data, it may even PREDICT how future architectures will behave as training pipelines improve -- e.g., through better data curation or RL-based post-training -- unlocking deeper reasoning and hierarchical inference.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience

arXiv:2512.17260v1 Announce Type: new Abstract: Large language models have recently made significant progress to generate rigorous mathematical proofs. In contrast, utilizing LLMs for theorem proving in formal languages (such as Lean) remains challenging and computationally expensive, particularly when addressing problems at the undergraduate level and beyond. In this work, we present \textbf{Seed-Prover 1.5}, a formal theorem-proving model trained via large-scale agentic reinforcement learning, alongside an efficient test-time scaling (TTS) workflow. Through extensive interactions with Lean and other tools, the model continuously accumulates experience during the RL process, substantially enhancing the capability and efficiency of formal theorem proving. Furthermore, leveraging recent advancements in natural language proving, our TTS workflow efficiently bridges the gap between natural and formal languages. Compared to state-of-the-art methods, Seed-Prover 1.5 achieves superior performance with a smaller compute budget. It solves \textbf{88\% of PutnamBench} (undergraduate-level), \textbf{80\% of Fate-H} (graduate-level), and \textbf{33\% of Fate-X} (PhD-level) problems. Notably, using our system, we solved \textbf{11 out of 12 problems} from Putnam 2025 within 9 hours. Our findings suggest that scaling learning from experience, driven by high-quality formal feedback, holds immense potential for the future of formal mathematical reasoning.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

When F1 Fails: Granularity-Aware Evaluation for Dialogue Topic Segmentation

arXiv:2512.17083v1 Announce Type: cross Abstract: Dialogue topic segmentation supports summarization, retrieval, memory management, and conversational continuity. Despite decades of prior work, evaluation practice in dialogue topic segmentation remains dominated by strict boundary matching and F1-based metrics, even as modern LLM-based conversational systems increasingly rely on segmentation to manage conversation history beyond the model's fixed context window, where unstructured context accumulation degrades efficiency and coherence. This paper introduces an evaluation objective for dialogue topic segmentation that treats boundary density and segment coherence as primary criteria, alongside window-tolerant F1 (W-F1). Through extensive cross-dataset empirical evaluation, we show that reported performance differences across dialogue segmentation benchmarks are driven not by model quality, but by annotation granularity mismatches and sparse boundary labels. This indicates that many reported improvements arise from evaluation artifacts rather than improved boundary detection. We evaluated multiple, structurally distinct dialogue segmentation strategies across eight dialogue datasets spanning task-oriented, open-domain, meeting-style, and synthetic interactions. Across these settings, we observe high segment coherence combined with extreme oversegmentation relative to sparse labels, producing misleadingly low exact-match F1 scores. We show that topic segmentation is best understood as selecting an appropriate granularity rather than predicting a single correct boundary set. We operationalize this view by explicitly separating boundary scoring from boundary selection.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Anatomical Region-Guided Contrastive Decoding: A Plug-and-Play Strategy for Mitigating Hallucinations in Medical VLMs

arXiv:2512.17189v1 Announce Type: new Abstract: Medical Vision-Language Models (MedVLMs) show immense promise in clinical applicability. However, their reliability is hindered by hallucinations, where models often fail to derive answers from visual evidence, instead relying on learned textual priors. Existing mitigation strategies for MedVLMs have distinct limitations: training-based methods rely on costly expert annotations, limiting scalability, while training-free interventions like contrastive decoding, though data-efficient, apply a global, untargeted correction whose effects in complex real-world clinical settings can be unreliable. To address these challenges, we introduce Anatomical Region-Guided Contrastive Decoding (ARCD), a plug-and-play strategy that mitigates hallucinations by providing targeted, region-specific guidance. Our module leverages an anatomical mask to direct a three-tiered contrastive decoding process. By dynamically re-weighting at the token, attention, and logits levels, it verifiably steers the model's focus onto specified regions, reinforcing anatomical understanding and suppressing factually incorrect outputs. Extensive experiments across diverse datasets, including chest X-ray, CT, brain MRI, and ocular ultrasound, demonstrate our method's effectiveness in improving regional understanding, reducing hallucinations, and enhancing overall diagnostic accuracy.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

Destilação de Conhecimento com Cadeia de Pensamento Estruturada para Text-to-SQL

A implementação de sistemas precisos de Text-to-SQL em nível empresarial enfrenta um difícil dilema envolvendo custo, segurança e desempenho. As soluções atuais forçam as empresas a escolher entre Modelos de Linguagem Grande (LLMs) caros e Proprietários e Modelos de Linguagem Pequena (SLMs) de baixo desempenho. Propomos o Struct-SQL, um novo framework de Knowledge Distillation (KD) que treina um SLM para emular um poderoso LLM.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Radar de Risco Sistêmico: Um Framework de Gráfico Multi-Camada para Aviso Precoce de Colapso de Mercado

As crises financeiras surgem quando vulnerabilidades estruturais se acumulam entre setores, mercados e comportamentos dos investidores. O Radar de Risco Sistêmico (SRR) modela os mercados financeiros como gráficos multi-camada para detectar sinais precoces de fragilidade sistêmica e transições de regime de colapso. Avaliamos o SRR em três crises principais: o colapso da Dot-com, a Crise Financeira Global e o choque da COVID-19.

Fonte: arXiv cs.AI

Vision • Score 96

Bots Não Permanecem Parados: Um Estudo Longitudinal sobre Mudanças Comportamentais de Bots, Deriva Temporal e Evolução da Estrutura de Recursos

Bots sociais estão profundamente integrados em plataformas online para promoção, persuasão e manipulação. A maioria dos sistemas de detecção de bots ainda considera as características comportamentais como estáticas, assumindo implicitamente que os bots se comportam de maneira estacionária ao longo do tempo. Este estudo analisa mudanças em sinais comportamentais individuais e suas inter-relações, utilizando 2.615 contas de bots promocionais e 2,8 milhões de tweets.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

arXiv:2512.17012v1 Announce Type: new Abstract: Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.

Fonte: arXiv cs.CV

Vision • Score 96

Mais um Fit Cai: Predição Conformal como um Padrão de Calibração para Machine Learning em Física de Altas Energias

As técnicas de machine learning são essenciais na pesquisa moderna de colisores, mas suas saídas probabilísticas frequentemente carecem de estimativas de incerteza calibradas. A predição conformal (CP) oferece um framework simples e independente de distribuição para calibrar modelos preditivos arbitrários, garantindo quantificação rigorosa da incerteza. Neste trabalho, investigamos a CP como uma camada de calibração unificadora para aplicações de machine learning em física de altas energias.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

AVM: Towards Structure-Preserving Neural Response Modeling in the Visual Cortex Across Stimuli and Individuals

arXiv:2512.16948v1 Announce Type: new Abstract: While deep learning models have shown strong performance in simulating neural responses, they often fail to clearly separate stable visual encoding from condition-specific adaptation, which limits their ability to generalize across stimuli and individuals. We introduce the Adaptive Visual Model (AVM), a structure-preserving framework that enables condition-aware adaptation through modular subnetworks, without modifying the core representation. AVM keeps a Vision Transformer-based encoder frozen to capture consistent visual features, while independently trained modulation paths account for neural response variations driven by stimulus content and subject identity. We evaluate AVM in three experimental settings, including stimulus-level variation, cross-subject generalization, and cross-dataset adaptation, all of which involve structured changes in inputs and individuals. Across two large-scale mouse V1 datasets, AVM outperforms the state-of-the-art V1T model by approximately 2% in predictive correlation, demonstrating robust generalization, interpretable condition-wise modulation, and high architectural efficiency. Specifically, AVM achieves a 9.1% improvement in explained variance (FEVE) under the cross-dataset adaptation setting. These results suggest that AVM provides a unified framework for adaptive neural modeling across biological and experimental conditions, offering a scalable solution under structural constraints. Its design may inform future approaches to cortical modeling in both neuroscience and biologically inspired AI systems.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

arXiv:2512.16978v1 Announce Type: new Abstract: Long-form multimodal video understanding requires integrating vision, speech, and ambient audio with coherent long-range reasoning. Existing benchmarks emphasize either temporal length or multimodal richness, but rarely both and while some incorporate open-ended questions and advanced metrics, they mostly rely on single-score accuracy, obscuring failure modes. We introduce LongShOTBench, a diagnostic benchmark with open-ended, intent-driven questions; single- and multi-turn dialogues; and tasks requiring multimodal reasoning and agentic tool use across video, audio, and speech. Each item includes a reference answer and graded rubric for interpretable, and traceable evaluation. LongShOTBench is produced via a scalable, human-validated pipeline to ensure coverage and reproducibility. All samples in our LongShOTBench are human-verified and corrected. Furthermore, we present LongShOTAgent, an agentic system that analyzes long videos via preprocessing, search, and iterative refinement. On LongShOTBench, state-of-the-art MLLMs show large gaps: Gemini-2.5-Flash achieves 52.95%, open-source models remain below 30%, and LongShOTAgent attains 44.66%. These results underscore the difficulty of real-world long-form video understanding. LongShOTBench provides a practical, reproducible foundation for evaluating and improving MLLMs. All resources are available on GitHub: https://github.com/mbzuai-oryx/longshot.

Fonte: arXiv cs.CV

RL • Score 96

Aprimorando a Classificação de Espécies de Árvores: Insights do YOLOv8 e IA Explicável Aplicados a Projeções de Nuvem de Pontos TLS

Classificar espécies de árvores é uma área de pesquisa central em sensoriamento remoto florestal há décadas. Novos sensores e abordagens de classificação, como TLS e deep learning, alcançam precisão de ponta, mas seus processos de decisão permanecem obscuros. Propomos um método inovador que liga explicações do Finer-CAM a segmentos de projeções TLS, avaliando sistematicamente quais características impulsionam a discriminação de espécies.

Fonte: arXiv cs.AI

Vision • Score 95

Diagnostic Performance of Universal-Learning Ultrasound AI Across Multiple Organs and Tasks: the UUSIC25 Challenge

arXiv:2512.17279v1 Announce Type: new Abstract: IMPORTANCE: Current ultrasound AI remains fragmented into single-task tools, limiting clinical utility compared to versatile modern ultrasound systems. OBJECTIVE: To evaluate the diagnostic accuracy and efficiency of single general-purpose deep learning models for multi-organ classification and segmentation. DESIGN: The Universal UltraSound Image Challenge 2025 (UUSIC25) involved developing algorithms on 11,644 images (public/private). Evaluation used an independent, multi-center test set of 2,479 images, including data from a center completely unseen during training to assess generalization. OUTCOMES: Diagnostic performance (Dice Similarity Coefficient [DSC]; Area Under the Receiver Operating Characteristic Curve [AUC]) and computational efficiency (inference time, GPU memory). RESULTS: Of 15 valid algorithms, the top model (SMART) achieved a macro-averaged DSC of 0.854 across 5 segmentation tasks and AUC of 0.766 for binary classification. Models showed high capability in segmentation (e.g., fetal head DSC: 0.942) but variability in complex tasks subject to domain shift. Notably, in breast cancer molecular subtyping, the top model's performance dropped from AUC 0.571 (internal) to 0.508 (unseen external center), highlighting generalization challenges. CONCLUSIONS: General-purpose AI models achieve high accuracy and efficiency across multiple tasks using a single architecture. However, performance degradation on unseen data suggests domain generalization is critical for future clinical deployment.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Learning When to Look: A Disentangled Curriculum for Strategic Perception in Multimodal Reasoning

arXiv:2512.17227v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) demonstrate significant potential but remain brittle in complex, long-chain visual reasoning tasks. A critical failure mode is "visual forgetting", where models progressively lose visual grounding as reasoning extends, a phenomenon aptly described as "think longer, see less". We posit this failure stems from current training paradigms prematurely entangling two distinct cognitive skills: (1) abstract logical reasoning "how-to-think") and (2) strategic visual perception ("when-to-look"). This creates a foundational cold-start deficiency -- weakening abstract reasoning -- and a strategic perception deficit, as models lack a policy for when to perceive. In this paper, we propose a novel curriculum-based framework to disentangle these skills. First, we introduce a disentangled Supervised Fine-Tuning (SFT) curriculum that builds a robust abstract reasoning backbone on text-only data before anchoring it to vision with a novel Perception-Grounded Chain-of-Thought (PG-CoT) paradigm. Second, we resolve the strategic perception deficit by formulating timing as a reinforcement learning problem. We design a Pivotal Perception Reward that teaches the model when to look by coupling perceptual actions to linguistic markers of cognitive uncertainty (e.g., "wait", "verify"), thereby learning an autonomous grounding policy. Our contributions include the formalization of these two deficiencies and the development of a principled, two-stage framework to address them, transforming the model from a heuristic-driven observer to a strategic, grounded reasoner. \textbf{Code}: \url{https://github.com/gaozilve-max/learning-when-to-look}.

Fonte: arXiv cs.CV

Vision • Score 95

Robust Scene Coordinate Regression via Geometrically-Consistent Global Descriptors

arXiv:2512.17226v1 Announce Type: new Abstract: Recent learning-based visual localization methods use global descriptors to disambiguate visually similar places, but existing approaches often derive these descriptors from geometric cues alone (e.g., covisibility graphs), limiting their discriminative power and reducing robustness in the presence of noisy geometric constraints. We propose an aggregator module that learns global descriptors consistent with both geometrical structure and visual similarity, ensuring that images are close in descriptor space only when they are visually similar and spatially connected. This corrects erroneous associations caused by unreliable overlap scores. Using a batch-mining strategy based solely on the overlap scores and a modified contrastive loss, our method trains without manual place labels and generalizes across diverse environments. Experiments on challenging benchmarks show substantial localization gains in large-scale environments while preserving computational and memory efficiency. Code is available at \href{https://github.com/sontung/robust\_scr}{github.com/sontung/robust\_scr}.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

MemoryGraft: Comprometimento Persistente de Agentes LLM via Recuperação de Experiências Envenenadas

Os agentes de Modelos de Linguagem de Grande Escala (LLM) dependem cada vez mais da memória de longo prazo e da Geração Aumentada por Recuperação (RAG) para persistir experiências e aprimorar o desempenho futuro. Este artigo apresenta o MemoryGraft, um novo ataque de injeção indireta que compromete o comportamento do agente ao implantar experiências maliciosas em sua memória de longo prazo.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Imagens Sintéticas Podem Servir como Protótipos de Classe Eficazes e Eficientes?

Modelos de Visão-Linguagem (VLMs) demonstraram forte desempenho em tarefas de classificação de imagens zero-shot. No entanto, métodos existentes, como o Contrastive Language-Image Pre-training (CLIP), dependem de pares de texto-imagem anotados, o que aumenta os custos e requisitos de precisão na preparação de conjuntos de dados de alta qualidade. Apresentamos o framework LGCLIP, que utiliza um Large Language Model (LLM) para gerar prompts específicos de classe, permitindo a síntese de imagens de referência de forma leve e eficiente.

Fonte: arXiv cs.CV

Evaluation/Benchmarks • Score 92

A Generic Machine Learning Framework for Radio Frequency Fingerprinting

arXiv:2510.09775v3 Announce Type: replace-cross Abstract: Fingerprinting radio frequency (RF) emitters typically involves finding unique characteristics that are featured in their received signal. These fingerprints are nuanced, but sufficiently detailed, motivating the pursuit of methods that can successfully extract them. The downstream task that requires the most meticulous RF fingerprinting (RFF) is known as specific emitter identification (SEI), which entails recognising each individual transmitter. RFF and SEI have a long history, with numerous defence and civilian applications such as signal intelligence, electronic surveillance, physical-layer authentication of wireless devices, to name a few. In recent years, data-driven RFF approaches have become popular due to their ability to automatically learn intricate fingerprints. They generally deliver superior performance when compared to traditional RFF techniques that are often labour-intensive, inflexible, and only applicable to a particular emitter type or transmission scheme. In this paper, we present a generic and versatile machine learning (ML) framework for data-driven RFF with several popular downstream tasks such as SEI, data association (EDA) and RF emitter clustering (RFEC). It is emitter-type agnostic. We then demonstrate the introduced framework for several tasks using real RF datasets for spaceborne surveillance, signal intelligence and countering drones applications.

Fonte: arXiv stat.ML

Privacy/Security/Fairness • Score 90

Redes de Atenção Gráfica para Detecção de Epilepsia a partir de Sinais de EEG Usando Hardware Acessível em Ambientes de Baixos Recursos

A epilepsia continua subdiagnosticada em países de baixa renda devido à escassez de neurologistas e ferramentas de diagnóstico caras. Propomos um framework de deep learning baseado em grafos para detectar epilepsia a partir de hardware de Eletroencefalografia (EEG) de baixo custo, testado em gravações da Nigéria e Guiné-Bissau.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Mitty: Geração de Vídeo de Humano para Robô Baseada em Difusão

Aprender diretamente com vídeos de demonstração humana é um marco importante para o aprendizado de robôs escalável e generalizável. Apresentamos Mitty, um Diffusion Transformer que possibilita o In-Context Learning de vídeo para geração de vídeo Human2Robot, utilizando um modelo de difusão de vídeo pré-treinado e sem a necessidade de rótulos de ação.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Sample, Don't Search: Rethinking Test-Time Alignment for Language Models

arXiv:2504.03790v2 Announce Type: replace-cross Abstract: Increasing test-time computation has emerged as a promising direction for improving language model performance, particularly in scenarios where model finetuning is impractical or impossible due to computational constraints or private model weights. However, existing test-time search methods using a reward model (RM) often degrade in quality as compute scales, due to the over-optimization of what are inherently imperfect reward proxies. We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access. We demonstrate the effectiveness of QAlign on mathematical reasoning benchmarks (GSM8K and GSM-Symbolic) using a task-specific RM, showing consistent improvements over existing test-time compute methods like best-of-n and majority voting. Furthermore, when applied with more realistic RMs trained on the Tulu 3 preference dataset, QAlign outperforms direct preference optimization (DPO), best-of-n, majority voting, and weighted majority voting on a diverse range of datasets (GSM8K, MATH500, IFEval, MMLU-Redux, and TruthfulQA). A practical solution to aligning language models at test time using additional computation without degradation, our approach expands the limits of the capability that can be obtained from off-the-shelf language models without further training.

Fonte: arXiv stat.ML

MLOps/Systems • Score 92

Look-Ahead Reasoning on Learning Platforms

arXiv:2511.14745v2 Announce Type: replace-cross Abstract: On many learning platforms, the optimization criteria guiding model training reflect the priorities of the designer rather than those of the individuals they affect. Consequently, users may act strategically to obtain more favorable outcomes. While past work has studied strategic user behavior on learning platforms, the focus has largely been on strategic responses to a deployed model, without considering the behavior of other users. In contrast, look-ahead reasoning takes into account that user actions are coupled, and -- at scale -- impact future predictions. Within this framework, we first formalize level-k thinking, a concept from behavioral economics, where users aim to outsmart their peers by looking one step ahead. We show that, while convergence to an equilibrium is accelerated, the equilibrium remains the same, providing no benefit of higher-level reasoning for individuals in the long run. Then, we focus on collective reasoning, where users take coordinated actions by optimizing through their joint impact on the model. By contrasting collective with selfish behavior, we characterize the benefits and limits of coordination; a new notion of alignment between the learner's and the users' utilities emerges as a key concept. Look-ahead reasoning can be seen as a generalization of algorithmic collective action; we thus offer the first results characterizing the utility trade-offs of coordination when contesting algorithmic systems.

Fonte: arXiv stat.ML

RL • Score 93

Traduzindo o Efeito Rashomon para Tarefas de Decisão Sequencial

O efeito Rashomon descreve o fenômeno em que múltiplos modelos treinados com os mesmos dados produzem previsões idênticas, mas diferem nas características que utilizam internamente. Este trabalho traduz o efeito Rashomon para a tomada de decisão sequencial, definindo-o como múltiplas políticas que exibem comportamento idêntico, mas com estruturas internas diferentes. Nossos experimentos demonstram a existência do efeito Rashomon em tarefas de decisão sequencial.

Fonte: arXiv cs.AI

RL • Score 89

Incremental Generation is Necessary and Sufficient for Universality in Flow-Based Modelling

arXiv:2511.09902v2 Announce Type: replace-cross Abstract: Incremental flow-based denoising models have reshaped generative modelling, but their empirical advantage still lacks a rigorous approximation-theoretic foundation. We show that incremental generation is necessary and sufficient for universal flow-based generation on the largest natural class of self-maps of $[0,1]^d$ compatible with denoising pipelines, namely the orientation-preserving homeomorphisms of $[0,1]^d$. All our guarantees are uniform on the underlying maps and hence imply approximation both samplewise and in distribution. Using a new topological-dynamical argument, we first prove an impossibility theorem: the class of all single-step autonomous flows, independently of the architecture, width, depth, or Lipschitz activation of the underlying neural network, is meagre and therefore not universal in the space of orientation-preserving homeomorphisms of $[0,1]^d$. By exploiting algebraic properties of autonomous flows, we conversely show that every orientation-preserving Lipschitz homeomorphism on $[0,1]^d$ can be approximated at rate $O(n^{-1/d})$ by a composition of at most $K_d$ such flows, where $K_d$ depends only on the dimension. Under additional smoothness assumptions, the approximation rate can be made dimension-free, and $K_d$ can be chosen uniformly over the class being approximated. Finally, by linearly lifting the domain into one higher dimension, we obtain structured universal approximation results for continuous functions and for probability measures on $[0,1]^d$, the latter realized as pushforwards of empirical measures with vanishing $1$-Wasserstein error.

Fonte: arXiv stat.ML

Applications • Score 89

On Agnostic PAC Learning in the Small Error Regime

arXiv:2502.09496v2 Announce Type: replace-cross Abstract: Binary classification in the classic PAC model exhibits a curious phenomenon: Empirical Risk Minimization (ERM) learners are suboptimal in the realizable case yet optimal in the agnostic case. Roughly speaking, this owes itself to the fact that non-realizable distributions $\mathcal{D}$ are simply more difficult to learn than realizable distributions -- even when one discounts a learner's error by $\mathrm{err}(h^*_{\mathcal{D}})$, the error of the best hypothesis in $\mathcal{H}$ for $\mathcal{D}$. Thus, optimal agnostic learners are permitted to incur excess error on (easier-to-learn) distributions $\mathcal{D}$ for which $\tau = \mathrm{err}(h^*_{\mathcal{D}})$ is small. Recent work of Hanneke, Larsen, and Zhivotovskiy (FOCS `24) addresses this shortcoming by including $\tau$ itself as a parameter in the agnostic error term. In this more fine-grained model, they demonstrate tightness of the error lower bound $\tau + \Omega \left(\sqrt{\frac{\tau (d + \log(1 / \delta))}{m}} + \frac{d + \log(1 / \delta)}{m} \right)$ in a regime where $\tau > d/m$, and leave open the question of whether there may be a higher lower bound when $\tau \approx d/m$, with $d$ denoting $\mathrm{VC}(\mathcal{H})$. In this work, we resolve this question by exhibiting a learner which achieves error $c \cdot \tau + O \left(\sqrt{\frac{\tau (d + \log(1 / \delta))}{m}} + \frac{d + \log(1 / \delta)}{m} \right)$ for a constant $c \leq 2.1$, thus matching the lower bound when $\tau \approx d/m$. Further, our learner is computationally efficient and is based upon careful aggregations of ERM classifiers, making progress on two other questions of Hanneke, Larsen, and Zhivotovskiy (FOCS `24). We leave open the interesting question of whether our approach can be refined to lower the constant from 2.1 to 1, which would completely settle the complexity of agnostic learning.

Fonte: arXiv stat.ML

Vision • Score 95

EMMA: Concept Erasure Benchmark with Comprehensive Semantic Metrics and Diverse Categories

arXiv:2512.17320v1 Announce Type: new Abstract: The widespread adoption of text-to-image (T2I) generation has raised concerns about privacy, bias, and copyright violations. Concept erasure techniques offer a promising solution by selectively removing undesired concepts from pre-trained models without requiring full retraining. However, these methods are often evaluated on a limited set of concepts, relying on overly simplistic and direct prompts. To test the boundaries of concept erasure techniques, and assess whether they truly remove targeted concepts from model representations, we introduce EMMA, a benchmark that evaluates five key dimensions of concept erasure over 12 metrics. EMMA goes beyond standard metrics like image quality and time efficiency, testing robustness under challenging conditions, including indirect descriptions, visually similar non-target concepts, and potential gender and ethnicity bias, providing a socially aware analysis of method behavior. Using EMMA, we analyze five concept erasure methods across five domains (objects, celebrities, art styles, NSFW, and copyright). Our results show that existing methods struggle with implicit prompts (i.e., generating the erased concept when it is indirectly referenced) and visually similar non-target concepts (i.e., failing to generate non-targeted concepts resembling the erased one), while some amplify gender and ethnicity bias compared to the original model.

Fonte: arXiv cs.CV

MLOps/Systems • Score 96

UmniBench: Modelo Unificado de Compreensão e Geração Orientado a Benchmark Omni-dimensional

O UmniBench é um benchmark projetado para modelos unificados multimodais (UMMs), permitindo uma avaliação omni-dimensional. Ele avalia a compreensão, geração e edição em um único processo, utilizando prompts e pares de perguntas e respostas examinados por humanos. Com cobertura de 13 domínios principais e mais de 200 conceitos, o UmniBench oferece uma avaliação abrangente e detalhada de UMMs.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Sobre o Papel da Informação Contextual e Estados do Ego no Comportamento de Agentes LLM para Diálogos de Análise Transacional

Agentes impulsionados por LLM são utilizados em diversas áreas, desde suporte ao cliente até educação, com crescente interesse em sua capacidade de agir de forma mais humana. Este artigo propõe um Sistema Multi-Agente (MAS) inspirado na teoria da Análise Transacional (TA), onde cada agente é dividido em três estados do ego: Pai, Adulto e Criança, enriquecendo o processo de resposta com um mecanismo de recuperação de informações contextuais.

Fonte: arXiv cs.AI

Vision • Score 95

ABE-CLIP: Training-Free Attribute Binding Enhancement for Compositional Image-Text Matching

arXiv:2512.17178v1 Announce Type: new Abstract: Contrastive Language-Image Pretraining (CLIP) has achieved remarkable performance in various multimodal tasks. However, it still struggles with compositional image-text matching, particularly in accurately associating objects with their corresponding attributes, because its inherent global representation often overlooks fine-grained semantics for attribute binding. Existing methods often require additional training or extensive hard negative sampling, yet they frequently show limited generalization to novel compositional concepts and fail to fundamentally address the drawbacks of global representations. In this paper, we propose ABE-CLIP, a novel training-free Attribute Binding Enhancement method designed to strengthen attribute-object binding in CLIP-like models. Specifically, we employ a Semantic Refinement Mechanism to refine token embeddings for both object and attribute phrases in the text, thereby mitigating attribute confusion and improving semantic precision. We further introduce a Local Token-Patch Alignment strategy that computes similarity scores between refined textual tokens and their most relevant image patches. By aggregating localized similarity scores, ABE-CLIP computes the final image-text similarity. Experiments on multiple datasets demonstrate that ABE-CLIP significantly improves attribute-object binding performance, even surpassing methods that require extensive training.

Fonte: arXiv cs.CV

RL • Score 95

Safeguarded Stochastic Polyak Step Sizes for Non-smooth Optimization: Robust Performance Without Small (Sub)Gradients

arXiv:2512.02342v2 Announce Type: replace-cross Abstract: The stochastic Polyak step size (SPS) has proven to be a promising choice for stochastic gradient descent (SGD), delivering competitive performance relative to state-of-the-art methods on smooth convex and non-convex optimization problems, including deep neural network training. However, extensions of this approach to non-smooth settings remain in their early stages, often relying on interpolation assumptions or requiring knowledge of the optimal solution. In this work, we propose a novel SPS variant, Safeguarded SPS (SPS$_{safe}$), for the stochastic subgradient method, and provide rigorous convergence guarantees for non-smooth convex optimization with no need for strong assumptions. We further incorporate momentum into the update rule, yielding equally tight theoretical results. On non-smooth convex benchmarks, our experiments are consistent with the theoretical predictions on how the safeguard affects the convergence neighborhood. On deep neural networks the proposed step size achieves competitive performance to existing adaptive baselines and exhibits stable behavior across a wide range of problem settings. Moreover, in these experiments, the gradient norms under our step size do not collapse to (near) zero, indicating robustness to vanishing gradients.

Fonte: arXiv stat.ML

RL • Score 95

A Certified Unlearning Approach without Access to Source Data

arXiv:2506.06486v3 Announce Type: replace-cross Abstract: With the growing adoption of data privacy regulations, the ability to erase private or copyrighted information from trained models has become a crucial requirement. Traditional unlearning methods often assume access to the complete training dataset, which is unrealistic in scenarios where the source data is no longer available. To address this challenge, we propose a certified unlearning framework that enables effective data removal \final{without access to the original training data samples}. Our approach utilizes a surrogate dataset that approximates the statistical properties of the source data, allowing for controlled noise scaling based on the statistical distance between the two. \updated{While our theoretical guarantees assume knowledge of the exact statistical distance, practical implementations typically approximate this distance, resulting in potentially weaker but still meaningful privacy guarantees.} This ensures strong guarantees on the model's behavior post-unlearning while maintaining its overall utility. We establish theoretical bounds, introduce practical noise calibration techniques, and validate our method through extensive experiments on both synthetic and real-world datasets. The results demonstrate the effectiveness and reliability of our approach in privacy-sensitive settings.

Fonte: arXiv stat.ML

RL • Score 96

Conhecimento Inesperado: Auditoria das Recomendações de Busca do Wikipedia e Grokipedia

As plataformas de conhecimento enciclopédico são portas de entrada essenciais para a exploração de informações online. A recente liberação do Grokipedia, uma enciclopédia totalmente gerada por IA, oferece uma nova alternativa às plataformas tradicionais como o Wikipedia. Este trabalho apresenta a primeira análise comparativa dos mecanismos de busca entre Wikipedia e Grokipedia.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Towards Human-Guided, Data-Centric LLM Co-Pilots

arXiv:2501.10321v3 Announce Type: replace-cross Abstract: Machine learning (ML) has the potential to revolutionize various domains, but its adoption is often hindered by the disconnect between the needs of domain experts and translating these needs into robust and valid ML tools. Despite recent advances in LLM-based co-pilots to democratize ML for non-technical domain experts, these systems remain predominantly focused on model-centric aspects while overlooking critical data-centric challenges. This limitation is problematic in complex real-world settings where raw data often contains complex issues, such as missing values, label noise, and domain-specific nuances requiring tailored handling. To address this we introduce CliMB-DC, a human-guided, data-centric framework for LLM co-pilots that combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing. At its core, CliMB-DC introduces a novel, multi-agent reasoning system that combines a strategic coordinator for dynamic planning and adaptation with a specialized worker agent for precise execution. Domain expertise is then systematically incorporated to guide the reasoning process using a human-in-the-loop approach. To guide development, we formalize a taxonomy of key data-centric challenges that co-pilots must address. Thereafter, to address the dimensions of the taxonomy, we integrate state-of-the-art data-centric tools into an extensible, open-source architecture, facilitating the addition of new tools from the research community. Empirically, using real-world healthcare datasets we demonstrate CliMB-DC's ability to transform uncurated datasets into ML-ready formats, significantly outperforming existing co-pilot baselines for handling data-centric challenges. CliMB-DC promises to empower domain experts from diverse domains -- healthcare, finance, social sciences and more -- to actively participate in driving real-world impact using ML.

Fonte: arXiv stat.ML

Evaluation/Benchmarks • Score 93

Otimização da Busca de Texto: Um Novo Algoritmo de Correspondência de Padrões Baseado na Abordagem de Ukkonen

No campo da ciência da computação, a eficiência dos algoritmos de busca de texto é crucial para processar grandes volumes de dados em áreas como processamento de linguagem natural e bioinformática. Este estudo investiga algoritmos de busca de texto, focando na otimização de Suffix Trees através de métodos como Splitting e o Algoritmo de Ukkonen, demonstrando eficiências superiores em tempo e espaço em comparação com métodos tradicionais.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Studying the Effects of Collaboration in Interactive Theme Discovery Systems

arXiv:2408.09030v4 Announce Type: replace Abstract: NLP-assisted solutions have gained considerable traction to support qualitative data analysis. However, there does not exist a unified evaluation framework that can account for the many different settings in which qualitative researchers may employ them. In this paper, we take a first step in this direction by proposing an evaluation framework to study the way in which different tools may result in different outcomes depending on the collaboration strategy employed. Specifically, we study the impact of synchronous vs. asynchronous collaboration using two different NLP-assisted qualitative research tools and present a comprehensive analysis of significant differences in the consistency, cohesiveness, and correctness of their outputs.

Fonte: arXiv cs.CL

RL • Score 95

A Survey on Archetypal Analysis

arXiv:2504.12392v2 Announce Type: replace-cross Abstract: Archetypal analysis (AA) was originally proposed in 1994 by Adele Cutler and Leo Breiman as a computational procedure for extracting distinct aspects, so-called archetypes, from observations, with each observational record approximated as a mixture (i.e., convex combination) of these archetypes. AA thereby provides straightforward, interpretable, and explainable representations for feature extraction and dimensionality reduction, facilitating the understanding of the structure of high-dimensional data and enabling wide applications across the sciences. However, AA also faces challenges, particularly as the associated optimization problem is non-convex. This is the first survey that provides researchers and data mining practitioners with an overview of the methodologies and opportunities that AA offers, surveying the many applications of AA across disparate fields of science, as well as best practices for modeling data with AA and its limitations. The survey concludes by explaining crucial future research directions concerning AA.

Fonte: arXiv stat.ML

RL • Score 95

Fairness via Independence: A (Conditional) Distance Covariance Framework

arXiv:2412.00720v2 Announce Type: replace-cross Abstract: We explore fairness from a statistical perspective by selectively utilizing either conditional distance covariance or distance covariance statistics as measures to assess the independence between predictions and sensitive attributes. We boost fairness with independence by adding a distance covariance-based penalty to the model's training. Additionally, we present the matrix form of empirical (conditional) distance covariance for parallel calculations to enhance computational efficiency. Theoretically, we provide a proof for the convergence between empirical and population (conditional) distance covariance, establishing necessary guarantees for batch computations. Through experiments conducted on a range of real-world datasets, we have demonstrated that our method effectively bridges the fairness gap in machine learning. Our code is available at \url{https://github.com/liuhaixias1/Fair_dc/}.

Fonte: arXiv stat.ML

RL • Score 92

Quantifying Uncertainty in the Presence of Distribution Shifts

arXiv:2506.18283v2 Announce Type: replace Abstract: Neural networks make accurate predictions but often fail to provide reliable uncertainty estimates, especially under covariate distribution shifts between training and testing. To address this problem, we propose a Bayesian framework for uncertainty estimation that explicitly accounts for covariate shifts. While conventional approaches rely on fixed priors, the key idea of our method is an adaptive prior, conditioned on both training and new covariates. This prior naturally increases uncertainty for inputs that lie far from the training distribution in regions where predictive performance is likely to degrade. To efficiently approximate the resulting posterior predictive distribution, we employ amortized variational inference. Finally, we construct synthetic environments by drawing small bootstrap samples from the training data, simulating a range of plausible covariate shift using only the original dataset. We evaluate our method on both synthetic and real-world data. It yields substantially improved uncertainty estimates under distribution shifts.

Fonte: arXiv stat.ML

RL • Score 91

Refined Analysis of Federated Averaging and Federated Richardson-Romberg

arXiv:2412.01389v2 Announce Type: replace Abstract: In this paper, we present a novel analysis of \FedAvg with constant step size, relying on the Markov property of the underlying process. We demonstrate that the global iterates of the algorithm converge to a stationary distribution and analyze its resulting bias and variance relative to the problem's solution. We provide a first-order bias expansion in both homogeneous and heterogeneous settings. Interestingly, this bias decomposes into two distinct components: one that depends solely on stochastic gradient noise and another on client heterogeneity. Finally, we introduce a new algorithm based on the Richardson-Romberg extrapolation technique to mitigate this bias.

Fonte: arXiv stat.ML

RL • Score 92

Regularized Langevin Dynamics for Combinatorial Optimization

arXiv:2502.00277v4 Announce Type: replace-cross Abstract: This work proposes a simple yet effective sampling framework for combinatorial optimization (CO). Our method builds on discrete Langevin dynamics (LD), an efficient gradient-guided generative paradigm. However, we observe that directly applying LD often leads to limited exploration. To overcome this limitation, we propose the Regularized Langevin Dynamics (RLD), which enforces an expected distance between the sampled and current solutions, effectively avoiding local minima. We develop two CO solvers on top of RLD, one based on simulated annealing (SA), and the other one based on neural network (NN). Empirical results on three classic CO problems demonstrate that both of our methods can achieve comparable or better performance against the previous state-of-the-art (SOTA) SA- and NN-based solvers. In particular, our SA algorithm reduces the runtime of the previous SOTA SA method by up to 80\%, while achieving equal or superior performance. In summary, RLD offers a promising framework for enhancing both traditional heuristics and NN models to solve CO problems. Our code is available at https://github.com/Shengyu-Feng/RLD4CO.

Fonte: arXiv stat.ML

RL • Score 96

Sobre o Tempo: Aprendizado por Reforço Sem Modelo com Máquinas de Recompensa Temporizadas

A especificação de recompensas desempenha um papel central no aprendizado por reforço (RL), orientando o comportamento do agente. Neste artigo, propomos máquinas de recompensa temporizadas (TRMs), uma extensão das máquinas de recompensa que incorpora restrições de tempo na estrutura de recompensas, permitindo especificações mais expressivas e políticas ótimas em frameworks de RL sem modelo.

Fonte: arXiv cs.AI

Vision • Score 95

DESSERT: Diffusion-based Event-driven Single-frame Synthesis via Residual Training

arXiv:2512.17323v1 Announce Type: new Abstract: Video frame prediction extrapolates future frames from previous frames, but suffers from prediction errors in dynamic scenes due to the lack of information about the next frame. Event cameras address this limitation by capturing per-pixel brightness changes asynchronously with high temporal resolution. Prior research on event-based video frame prediction has leveraged motion information from event data, often by predicting event-based optical flow and reconstructing frames via pixel warping. However, such approaches introduce holes and blurring when pixel displacement is inaccurate. To overcome this limitation, we propose DESSERT, a diffusion-based event-driven single-frame synthesis framework via residual training. Leveraging a pre-trained Stable Diffusion model, our method is trained on inter-frame residuals to ensure temporal consistency. The training pipeline consists of two stages: (1) an Event-to-Residual Alignment Variational Autoencoder (ER-VAE) that aligns the event frame between anchor and target frames with the corresponding residual, and (2) a diffusion model that denoises the residual latent conditioned on event data. Furthermore, we introduce Diverse-Length Temporal (DLT) augmentation, which improves robustness by training on frame segments of varying temporal lengths. Experimental results demonstrate that our method outperforms existing event-based reconstruction, image-based video frame prediction, event-based video frame prediction, and one-sided event-based video frame interpolation methods, producing sharper and more temporally consistent frame synthesis.

Fonte: arXiv cs.CV

Theory/Optimization • Score 92

Unifying Distributionally Robust Optimization via Optimal Transport Theory

arXiv:2308.05414v2 Announce Type: replace-cross Abstract: In recent years, two prominent paradigms have shaped distributionally robust optimization (DRO), modeling distributional ambiguity through $\phi$-divergences and Wasserstein distances, respectively. While the former focuses on ambiguity in likelihood ratios, the latter emphasizes ambiguity in outcomes and uses a transportation cost function to capture geometric structure in the outcome space. This paper proposes a unified framework that bridges these approaches by leveraging optimal transport (OT) with conditional moment constraints. Our formulation enables adversarial distributions to jointly perturb likelihood ratios and outcomes, yielding a generalized OT coupling between the nominal and perturbed distributions. We further establish key duality results and develop tractable reformulations that highlight the practical power of our unified approach.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

The Stochastic Occupation Kernel (SOCK) Method for Learning Stochastic Differential Equations

arXiv:2505.11622v2 Announce Type: replace Abstract: We present a novel kernel-based method for learning multivariate stochastic differential equations (SDEs). The method follows a two-step procedure: we first estimate the drift term function, then the (matrix-valued) diffusion function given the drift. Occupation kernels are integral functionals on a reproducing kernel Hilbert space (RKHS) that aggregate information over a trajectory. Our approach leverages vector-valued occupation kernels for estimating the drift component of the stochastic process. For diffusion estimation, we extend this framework by introducing operator-valued occupation kernels, enabling the estimation of an auxiliary matrix-valued function as a positive semi-definite operator, from which we readily derive the diffusion estimate. This enables us to avoid common challenges in SDE learning, such as intractable likelihoods, by optimizing a reconstruction-error-based objective. We propose a simple learning procedure that retains strong predictive accuracy while using Fenchel duality to promote efficiency. We validate the method on simulated benchmarks and a real-world dataset of Amyloid imaging in healthy and Alzheimer's disease subjects.

Fonte: arXiv stat.ML

RL • Score 89

SCAFFLSA: Taming Heterogeneity in Federated Linear Stochastic Approximation and TD Learning

arXiv:2402.04114v3 Announce Type: replace Abstract: In this paper, we analyze the sample and communication complexity of the federated linear stochastic approximation (FedLSA) algorithm. We explicitly quantify the effects of local training with agent heterogeneity. We show that the communication complexity of FedLSA scales polynomially with the inverse of the desired accuracy $\epsilon$. To overcome this, we propose SCAFFLSA a new variant of FedLSA that uses control variates to correct for client drift, and establish its sample and communication complexities. We show that for statistically heterogeneous agents, its communication complexity scales logarithmically with the desired accuracy, similar to Scaffnew. An important finding is that, compared to the existing results for Scaffnew, the sample complexity scales with the inverse of the number of agents, a property referred to as linear speed-up. Achieving this linear speed-up requires completely new theoretical arguments. We apply the proposed method to federated temporal difference learning with linear function approximation and analyze the corresponding complexity improvements.

Fonte: arXiv stat.ML

Privacy/Security/Fairness • Score 92

Mitigating Forgetting in Low Rank Adaptation

arXiv:2512.17720v1 Announce Type: cross Abstract: Parameter-efficient fine-tuning methods, such as Low-Rank Adaptation (LoRA), enable fast specialization of large pre-trained models to different downstream applications. However, this process often leads to catastrophic forgetting of the model's prior domain knowledge. We address this issue with LaLoRA, a weight-space regularization technique that applies a Laplace approximation to Low-Rank Adaptation. Our approach estimates the model's confidence in each parameter and constrains updates in high-curvature directions, preserving prior knowledge while enabling efficient target-domain learning. By applying the Laplace approximation only to the LoRA weights, the method remains lightweight. We evaluate LaLoRA by fine-tuning a Llama model for mathematical reasoning and demonstrate an improved learning-forgetting trade-off, which can be directly controlled via the method's regularization strength. We further explore different loss landscape curvature approximations for estimating parameter confidence, analyze the effect of the data used for the Laplace approximation, and study robustness across hyperparameters.

Fonte: arXiv stat.ML

Vision • Score 93

Fose: Fusão de Difusão em Uma Etapa e Rede de Ponta a Ponta para Pansharpening

O pansharpening é uma tarefa significativa de fusão de imagens que combina imagens multiespectrais de baixa resolução (LRMSI) e imagens panchromáticas de alta resolução (PAN) para obter imagens multiespectrais de alta resolução (HRMSI). Este trabalho propõe uma nova estratégia de treinamento em quatro etapas para obter uma rede leve chamada Fose, que funde um modelo de difusão em uma etapa e um modelo E2E, melhorando significativamente a eficiência do processo.

Fonte: arXiv cs.AI

Privacy/Security/Fairness • Score 92

Low-Rank Filtering and Smoothing for Sequential Deep Learning

arXiv:2410.06800v2 Announce Type: replace-cross Abstract: Learning multiple tasks sequentially requires neural networks to balance retaining knowledge, yet being flexible enough to adapt to new tasks. Regularizing network parameters is a common approach, but it rarely incorporates prior knowledge about task relationships, and limits information flow to future tasks only. We propose a Bayesian framework that treats the network's parameters as the state space of a nonlinear Gaussian model, unlocking two key capabilities: (1) A principled way to encode domain knowledge about task relationships, allowing, e.g., control over which layers should adapt between tasks. (2) A novel application of Bayesian smoothing, allowing task-specific models to also incorporate knowledge from models learned later. This does not require direct access to their data, which is crucial, e.g., for privacy-critical applications. These capabilities rely on efficient filtering and smoothing operations, for which we propose diagonal plus low-rank approximations of the precision matrix in the Laplace approximation (LR-LGF). Empirical results demonstrate the efficiency of LR-LGF and the benefits of the unlocked capabilities.

Fonte: arXiv stat.ML