NLP / LLMs

Veja os artigos deste label, com traduções para PT-BR.

Ver todos os labels

Artigos

🧠NLP / LLMs326 artigo(s) encontrados

Limpar filtro

NLP/LLMs • Score 85

Alinhamento Semântico de Grafos de Conhecimento Multilíngues via Projeções Vetoriais Contextualizadas

O artigo apresenta nosso trabalho em um sistema de alinhamento de ontologias cross-linguais que utiliza correspondência de similaridade coseno baseada em embeddings. As entidades da ontologia são enriquecidas contextualmente por meio de descrições criadas com técnicas inovadoras. Avaliamos nosso trabalho na trilha multifarm OAEI-2022, alcançando 71% de F1 score, indicando a eficácia do nosso pipeline de alinhamento.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Can Large Language Models Solve Engineering Equations? A Systematic Comparison of Direct Prediction and Solver-Assisted Approaches

arXiv:2601.01774v1 Announce Type: new Abstract: Transcendental equations requiring iterative numerical solution pervade engineering practice, from fluid mechanics friction factor calculations to orbital position determination. We systematically evaluate whether Large Language Models can solve these equations through direct numerical prediction or whether a hybrid architecture combining LLM symbolic manipulation with classical iterative solvers proves more effective. Testing six state-of-the-art models (GPT-5.1, GPT-5.2, Gemini-3-Flash, Gemini-2.5-Lite, Claude-Sonnet-4.5, Claude-Opus-4.5) on 100 problems spanning seven engineering domains, we compare direct prediction against solver-assisted computation where LLMs formulate governing equations and provide initial conditions while Newton-Raphson iteration performs numerical solution. Direct prediction yields mean relative errors of 0.765 to 1.262 across models, while solver-assisted computation achieves 0.225 to 0.301, representing error reductions of 67.9% to 81.8%. Domain-specific analysis reveals dramatic improvements in Electronics (93.1%) due to exponential equation sensitivity, contrasted with modest gains in Fluid Mechanics (7.2%) where LLMs exhibit effective pattern recognition. These findings establish that contemporary LLMs excel at symbolic manipulation and domain knowledge retrieval but struggle with precision-critical iterative arithmetic, suggesting their optimal deployment as intelligent interfaces to classical numerical solvers rather than standalone computational engines.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLMs

arXiv:2601.01836v1 Announce Type: new Abstract: As large language models are deployed in high-stakes enterprise applications, from healthcare to finance, ensuring adherence to organization-specific policies has become essential. Yet existing safety evaluations focus exclusively on universal harms. We present COMPASS (Company/Organization Policy Alignment Assessment), the first systematic framework for evaluating whether LLMs comply with organizational allowlist and denylist policies. We apply COMPASS to eight diverse industry scenarios, generating and validating 5,920 queries that test both routine compliance and adversarial robustness through strategically designed edge cases. Evaluating seven state-of-the-art models, we uncover a fundamental asymmetry: models reliably handle legitimate requests (>95% accuracy) but catastrophically fail at enforcing prohibitions, refusing only 13-40% of adversarial denylist violations. These results demonstrate that current LLMs lack the robustness required for policy-critical deployments, establishing COMPASS as an essential evaluation framework for organizational AI safety.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Simulated Reasoning is Reasoning

arXiv:2601.02043v1 Announce Type: new Abstract: Reasoning has long been understood as a pathway between stages of understanding. Proper reasoning leads to understanding of a given subject. This reasoning was conceptualized as a process of understanding in a particular way, i.e., "symbolic reasoning". Foundational Models (FM) demonstrate that this is not a necessary condition for many reasoning tasks: they can "reason" by way of imitating the process of "thinking out loud", testing the produced pathways, and iterating on these pathways on their own. This leads to some form of reasoning that can solve problems on its own or with few-shot learning, but appears fundamentally different from human reasoning due to its lack of grounding and common sense, leading to brittleness of the reasoning process. These insights promise to substantially alter our assessment of reasoning and its necessary conditions, but also inform the approaches to safety and robust defences against this brittleness of FMs. This paper offers and discusses several philosophical interpretations of this phenomenon, argues that the previously apt metaphor of the "stochastic parrot" has lost its relevance and thus should be abandoned, and reflects on different normative elements in the safety- and appropriateness-considerations emerging from these reasoning models and their growing capacity.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Clinical Knowledge Graph Construction and Evaluation with Multi-LLMs via Retrieval-Augmented Generation

arXiv:2601.01844v1 Announce Type: new Abstract: Large language models (LLMs) offer new opportunities for constructing knowledge graphs (KGs) from unstructured clinical narratives. However, existing approaches often rely on structured inputs and lack robust validation of factual accuracy and semantic consistency, limitations that are especially problematic in oncology. We introduce an end-to-end framework for clinical KG construction and evaluation directly from free text using multi-agent prompting and a schema-constrained Retrieval-Augmented Generation (KG-RAG) strategy. Our pipeline integrates (1) prompt-driven entity, attribute, and relation extraction; (2) entropy-based uncertainty scoring; (3) ontology-aligned RDF/OWL schema generation; and (4) multi-LLM consensus validation for hallucination detection and semantic refinement. Beyond static graph construction, the framework supports continuous refinement and self-supervised evaluation, enabling iterative improvement of graph quality. Applied to two oncology cohorts (PDAC and BRCA), our method produces interpretable, SPARQL-compatible, and clinically grounded knowledge graphs without relying on gold-standard annotations. Experimental results demonstrate consistent gains in precision, relevance, and ontology compliance over baseline methods.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Improving Behavioral Alignment in LLM Social Simulations via Context Formation and Navigation

arXiv:2601.01546v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used to simulate human behavior in experimental settings, but they systematically diverge from human decisions in complex decision-making environments, where participants must anticipate others' actions and form beliefs based on observed behavior. We propose a two-stage framework for improving behavioral alignment. The first stage, context formation, explicitly specifies the experimental design to establish an accurate representation of the decision task and its context. The second stage, context navigation, guides the reasoning process within that representation to make decisions. We validate this framework through a focal replication of a sequential purchasing game with quality signaling (Kremer and Debo, 2016), extending to a crowdfunding game with costly signaling (Cason et al., 2025) and a demand-estimation task (Gui and Toubia, 2025) to test generalizability across decision environments. Across four state-of-the-art (SOTA) models (GPT-4o, GPT-5, Claude-4.0-Sonnet-Thinking, DeepSeek-R1), we find that complex decision-making environments require both stages to achieve behavioral alignment with human benchmarks, whereas the simpler demand-estimation task requires only context formation. Our findings clarify when each stage is necessary and provide a systematic approach for designing and diagnosing LLM social simulations as complements to human subjects in behavioral research.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

XAI-MeD: Explainable Knowledge Guided Neuro-Symbolic Framework for Domain Generalization and Rare Class Detection in Medical Imaging

arXiv:2601.02008v1 Announce Type: new Abstract: Explainability domain generalization and rare class reliability are critical challenges in medical AI where deep models often fail under real world distribution shifts and exhibit bias against infrequent clinical conditions This paper introduces XAIMeD an explainable medical AI framework that integrates clinically accurate expert knowledge into deep learning through a unified neuro symbolic architecture XAIMeD is designed to improve robustness under distribution shift enhance rare class sensitivity and deliver transparent clinically aligned interpretations The framework encodes clinical expertise as logical connectives over atomic medical propositions transforming them into machine checkable class specific rules Their diagnostic utility is quantified through weighted feature satisfaction scores enabling a symbolic reasoning branch that complements neural predictions A confidence weighted fusion integrates symbolic and deep outputs while a Hunt inspired adaptive routing mechanism guided by Entropy Imbalance Gain EIG and Rare Class Gini mitigates class imbalance high intra class variability and uncertainty We evaluate XAIMeD across diverse modalities on four challenging tasks i Seizure Onset Zone SOZ localization from rs fMRI ii Diabetic Retinopathy grading across 6 multicenter datasets demonstrate substantial performance improvements including 6 percent gains in cross domain generalization and a 10 percent improved rare class F1 score far outperforming state of the art deep learning baselines Ablation studies confirm that the clinically grounded symbolic components act as effective regularizers ensuring robustness to distribution shifts XAIMeD thus provides a principled clinically faithful and interpretable approach to multimodal medical AI.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Reading Between the Lines: Deconfounding Causal Estimates using Text Embeddings and Deep Learning

arXiv:2601.01511v1 Announce Type: new Abstract: Estimating causal treatment effects in observational settings is frequently compromised by selection bias arising from unobserved confounders. While traditional econometric methods struggle when these confounders are orthogonal to structured covariates, high-dimensional unstructured text often contains rich proxies for these latent variables. This study proposes a Neural Network-Enhanced Double Machine Learning (DML) framework designed to leverage text embeddings for causal identification. Using a rigorous synthetic benchmark, we demonstrate that unstructured text embeddings capture critical confounding information that is absent from structured tabular data. However, we show that standard tree-based DML estimators retain substantial bias (+24%) due to their inability to model the continuous topology of embedding manifolds. In contrast, our deep learning approach reduces bias to -0.86% with optimized architectures, effectively recovering the ground-truth causal parameter. These findings suggest that deep learning architectures are essential for satisfying the unconfoundedness assumption when conditioning on high-dimensional natural language data

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Empowering Small Language Models with Factual Hallucination-Aware Reasoning for Financial Classification

arXiv:2601.01378v1 Announce Type: new Abstract: Small language models (SLMs) are increasingly used for financial classification due to their fast inference and local deployability. However, compared with large language models, SLMs are more prone to factual hallucinations in reasoning and exhibit weaker classification performance. This raises a natural question: Can mitigating factual hallucinations improve SLMs' financial classification? To address this, we propose a three-step pipeline named AAAI (Association Identification, Automated Detection, and Adaptive Inference). Experiments on three representative SLMs reveal that: (1) factual hallucinations are positively correlated with misclassifications; (2) encoder-based verifiers effectively detect factual hallucinations; and (3) incorporating feedback on factual errors enables SLMs' adaptive inference that enhances classification performance. We hope this pipeline contributes to trustworthy and effective applications of SLMs in finance.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Theory Trace Card: Theory-Driven Socio-Cognitive Evaluation of LLMs

arXiv:2601.01878v1 Announce Type: new Abstract: Socio-cognitive benchmarks for large language models (LLMs) often fail to predict real-world behavior, even when models achieve high benchmark scores. Prior work has attributed this evaluation-deployment gap to problems of measurement and validity. While these critiques are insightful, we argue that they overlook a more fundamental issue: many socio-cognitive evaluations proceed without an explicit theoretical specification of the target capability, leaving the assumptions linking task performance to competence implicit. Without this theoretical grounding, benchmarks that exercise only narrow subsets of a capability are routinely misinterpreted as evidence of broad competence: a gap that creates a systemic validity illusion by masking the failure to evaluate the capability's other essential dimensions. To address this gap, we make two contributions. First, we diagnose and formalize this theory gap as a foundational failure that undermines measurement and enables systematic overgeneralization of benchmark results. Second, we introduce the Theory Trace Card (TTC), a lightweight documentation artifact designed to accompany socio-cognitive evaluations, which explicitly outlines the theoretical basis of an evaluation, the components of the target capability it exercises, its operationalization, and its limitations. We argue that TTCs enhance the interpretability and reuse of socio-cognitive evaluations by making explicit the full validity chain, which links theory, task operationalization, scoring, and limitations, without modifying benchmarks or requiring agreement on a single theory.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Alinhamento de Admissibilidade

Este artigo introduz o Alinhamento de Admissibilidade: uma reformulação do alinhamento de IA como uma propriedade de seleção de ações e decisões admissíveis sobre distribuições de resultados sob incerteza, avaliada através do comportamento de políticas candidatas.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Counterfactual Self-Questioning for Stable Policy Optimization in Language Models

arXiv:2601.00885v1 Announce Type: new Abstract: Recent work on language model self-improvement shows that models can refine their own reasoning through reflection, verification, debate, or self-generated rewards. However, most existing approaches rely on external critics, learned reward models, or ensemble sampling, which increases complexity and training instability. We propose Counterfactual Self-Questioning, a framework in which a single language model generates and evaluates counterfactual critiques of its own reasoning. The method produces an initial reasoning trace, formulates targeted questions that challenge potential failure points, and generates alternative reasoning trajectories that expose incorrect assumptions or invalid steps. These counterfactual trajectories provide structured relative feedback that can be directly used for policy optimization without auxiliary models. Experiments on multiple mathematical reasoning benchmarks show that counterfactual self-questioning improves accuracy and training stability, particularly for smaller models, enabling scalable self-improvement using internally generated supervision alone.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

ChaosBench-Logic: A Benchmark for Logical and Symbolic Reasoning on Chaotic Dynamical Systems

arXiv:2601.01982v1 Announce Type: new Abstract: Large language models (LLMs) excel at natural language tasks but remain brittle in domains requiring precise logical and symbolic reasoning. Chaotic dynamical systems provide an especially demanding test because chaos is deterministic yet often misinterpreted as randomness or complexity. We introduce ChaosBench-Logic, a benchmark that evaluates LLM reasoning across 30 diverse dynamical systems using a unified first-order logic (FOL) ontology. Each system is annotated with truth assignments for 11 semantic predicates, and 621 questions are generated across seven reasoning categories, including multi-hop implications, cross-system analogies, counterfactual reasoning, bias probes, and multi-turn dialogues. We define metrics for logical accuracy, implication consistency, dialogue coherence, and contradiction, and we release an open-source evaluation pipeline. Initial experiments show that frontier LLMs such as GPT-4, Claude 3.5 Sonnet, Gemini 2.5 Flash, and the open-source LLaMA-3 70B achieve 91-94% per-item accuracy, yet still score 0% on compositional items and exhibit fragile global coherence. Dialogue-level accuracy ranges from 53.1% (GPT-4 CoT) to 75.5% (LLaMA-3 zero-shot). ChaosBench-Logic provides a rigorous testbed for diagnosing such failures and a foundation for developing neuro-symbolic approaches that improve scientific reasoning in LLMs.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Universal Conditional Logic: A Formal Language for Prompt Engineering

arXiv:2601.00880v1 Announce Type: new Abstract: We present Universal Conditional Logic (UCL), a mathematical framework for prompt optimization that transforms prompt engineering from heuristic practice into systematic optimization. Through systematic evaluation (N=305, 11 models, 4 iterations), we demonstrate significant token reduction (29.8%, t(10)=6.36, p < 0.001, Cohen's d = 2.01) with corresponding cost savings. UCL's structural overhead function O_s(A) explains version-specific performance differences through the Over-Specification Paradox: beyond threshold S* = 0.509, additional specification degrades performance quadratically. Core mechanisms -- indicator functions (I_i in {0,1}), structural overhead (O_s = gamma * sum(ln C_k)), early binding -- are validated. Notably, optimal UCL configuration varies by model architecture -- certain models (e.g., Llama 4 Scout) require version-specific adaptations (V4.1). This work establishes UCL as a calibratable framework for efficient LLM interaction, with model-family-specific optimization as a key research direction.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

MindChat: A Privacy-preserving Large Language Model for Mental Health Support

arXiv:2601.01993v1 Announce Type: new Abstract: Large language models (LLMs) have shown promise for mental health support, yet training such models is constrained by the scarcity and sensitivity of real counseling dialogues. In this article, we present MindChat, a privacy-preserving LLM for mental health support, together with MindCorpus, a synthetic multi-turn counseling dataset constructed via a multi-agent role-playing framework. To synthesize high-quality counseling data, the developed dialogue-construction framework employs a dual closed-loop feedback design to integrate psychological expertise and counseling techniques through role-playing: (i) turn-level critique-and-revision to improve coherence and counseling appropriateness within a session, and (ii) session-level strategy refinement to progressively enrich counselor behaviors across sessions. To mitigate privacy risks under decentralized data ownership, we fine-tune the base model using federated learning with parameter-efficient LoRA adapters and incorporate differentially private optimization to reduce membership and memorization risks. Experiments on synthetic-data quality assessment and counseling capability evaluation show that MindCorpus improves training effectiveness and that MindChat is competitive with existing general and counseling-oriented LLM baselines under both automatic LLM-judge and human evaluation protocols, while exhibiting reduced privacy leakage under membership inference attacks.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Can We Trust AI Explanations? Evidence of Systematic Underreporting in Chain-of-Thought Reasoning

arXiv:2601.00830v1 Announce Type: new Abstract: When AI systems explain their reasoning step-by-step, practitioners often assume these explanations reveal what actually influenced the AI's answer. We tested this assumption by embedding hints into questions and measuring whether models mentioned them. In a study of over 9,000 test cases across 11 leading AI models, we found a troubling pattern: models almost never mention hints spontaneously, yet when asked directly, they admit noticing them. This suggests models see influential information but choose not to report it. Telling models they are being watched does not help. Forcing models to report hints works, but causes them to report hints even when none exist and reduces their accuracy. We also found that hints appealing to user preferences are especially dangerous-models follow them most often while reporting them least. These findings suggest that simply watching AI reasoning is not enough to catch hidden influences.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Enhancing Temporal Awareness in LLMs for Temporal Point Processes

arXiv:2601.00845v1 Announce Type: new Abstract: Temporal point processes (TPPs) are crucial for analyzing events over time and are widely used in fields such as finance, healthcare, and social systems. These processes are particularly valuable for understanding how events unfold over time, accounting for their irregularity and dependencies. Despite the success of large language models (LLMs) in sequence modeling, applying them to temporal point processes remains challenging. A key issue is that current methods struggle to effectively capture the complex interaction between temporal information and semantic context, which is vital for accurate event modeling. In this context, we introduce TPP-TAL (Temporal Point Processes with Enhanced Temporal Awareness in LLMs), a novel plug-and-play framework designed to enhance temporal reasoning within LLMs. Rather than using the conventional method of simply concatenating event time and type embeddings, TPP-TAL explicitly aligns temporal dynamics with contextual semantics before feeding this information into the LLM. This alignment allows the model to better perceive temporal dependencies and long-range interactions between events and their surrounding contexts. Through comprehensive experiments on several benchmark datasets, it is shown that TPP-TAL delivers substantial improvements in temporal likelihood estimation and event prediction accuracy, highlighting the importance of enhancing temporal awareness in LLMs for continuous-time event modeling. The code is made available at https://github.com/chenlilil/TPP-TAL

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Cultural Encoding in Large Language Models: The Existence Gap in AI-Mediated Brand Discovery

arXiv:2601.00869v1 Announce Type: new Abstract: As artificial intelligence systems increasingly mediate consumer information discovery, brands face algorithmic invisibility. This study investigates Cultural Encoding in Large Language Models (LLMs) -- systematic differences in brand recommendations arising from training data composition. Analyzing 1,909 pure-English queries across 6 LLMs (GPT-4o, Claude, Gemini, Qwen3, DeepSeek, Doubao) and 30 brands, we find Chinese LLMs exhibit 30.6 percentage points higher brand mention rates than International LLMs (88.9% vs. 58.3%, p<.001). This disparity persists in identical English queries, indicating training data geography -- not language -- drives the effect. We introduce the Existence Gap: brands absent from LLM training corpora lack "existence" in AI responses regardless of quality. Through a case study of Zhizibianjie (OmniEdge), a collaboration platform with 65.6% mention rate in Chinese LLMs but 0% in International models (p<.001), we demonstrate how Linguistic Boundary Barriers create invisible market entry obstacles. Theoretically, we contribute the Data Moat Framework, conceptualizing AI-visible content as a VRIN strategic resource. We operationalize Algorithmic Omnipresence -- comprehensive brand visibility across LLM knowledge bases -- as the strategic objective for Generative Engine Optimization (GEO). Managerially, we provide an 18-month roadmap for brands to build Data Moats through semantic coverage, technical depth, and cultural localization. Our findings reveal that in AI-mediated markets, the limits of a brand's "Data Boundaries" define the limits of its "Market Frontiers."

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

CogCanvas: Extração de Artefatos Fundamentados Verbatim para Longas Conversas com LLM

O resumo de arXiv:2601.00821v2 apresenta o CogCanvas, um framework livre de treinamento que extrai artefatos fundamentados verbatim de conversas longas, superando métodos tradicionais em precisão. Em benchmarks, o CogCanvas alcança a maior precisão entre métodos sem treinamento, destacando-se em tarefas de raciocínio complexo.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Colapso de Contexto: Aprendizado em Contexto e Colapso de Modelo

Esta tese investiga dois fenômenos chave em grandes modelos de linguagem (LLMs): aprendizado em contexto (ICL) e colapso de modelo. Estudamos ICL em um transformer linear com pesos atados treinado em tarefas de regressão linear, mostrando que a minimização da perda em contexto leva a uma transição de fase nos parâmetros aprendidos.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

CaveAgent: Transforming LLMs into Stateful Runtime Operators

arXiv:2601.01569v1 Announce Type: new Abstract: LLM-based agents are increasingly capable of complex task execution, yet current agentic systems remain constrained by text-centric paradigms. Traditional approaches rely on procedural JSON-based function calling, which often struggles with long-horizon tasks due to fragile multi-turn dependencies and context drift. In this paper, we present CaveAgent, a framework that transforms the paradigm from "LLM-as-Text-Generator" to "LLM-as-Runtime-Operator." We introduce a Dual-stream Context Architecture that decouples state management into a lightweight semantic stream for reasoning and a persistent, deterministic Python Runtime stream for execution. In addition to leveraging code generation to efficiently resolve interdependent sub-tasks (e.g., loops, conditionals) in a single step, we introduce \textit{Stateful Runtime Management} in CaveAgent. Distinct from existing code-based approaches that remain text-bound and lack the support for external object injection and retrieval, CaveAgent injects, manipulates, and retrieves complex Python objects (e.g., DataFrames, database connections) that persist across turns. This persistence mechanism acts as a high-fidelity external memory to eliminate context drift, avoid catastrophic forgetting, while ensuring that processed data flows losslessly to downstream applications. Comprehensive evaluations on Tau$^2$-bench, BFCL and various case studies across representative SOTA LLMs demonstrate CaveAgent's superiority. Specifically, our framework achieves a 10.5\% success rate improvement on retail tasks and reduces total token consumption by 28.4\% in multi-turn scenarios. On data-intensive tasks, direct variable storage and retrieval reduces token consumption by 59\%, allowing CaveAgent to handle large-scale data that causes context overflow failures in both JSON-based and Code-based agents.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Temporal Attack Pattern Detection in Multi-Agent AI Workflows: An Open Framework for Training Trace-Based Security Models

arXiv:2601.00848v1 Announce Type: new Abstract: We present an openly documented methodology for fine-tuning language models to detect temporal attack patterns in multi-agent AI workflows using OpenTelemetry trace analysis. We curate a dataset of 80,851 examples from 18 public cybersecurity sources and 35,026 synthetic OpenTelemetry traces. We apply iterative QLoRA fine-tuning on resource-constrained ARM64 hardware (NVIDIA DGX Spark) through three training iterations with strategic augmentation. Our custom benchmark accuracy improves from 42.86% to 74.29%, a statistically significant 31.4-point gain. Targeted examples addressing specific knowledge gaps outperform indiscriminate scaling. Key contributions include: (1) synthetic trace generation methodology for multi-agent coordination attacks and regulatory violations, (2) empirical evidence that training data composition fundamentally determines behavior, and (3) complete open release of datasets, training scripts, and evaluation benchmarks on HuggingFace. While practical deployment requires human oversight due to false positive rates, this work establishes the first reproducible framework enabling practitioners to build custom agentic security models adapted to their threat landscapes.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

MathLedger: Um Substrato de Aprendizado Verificável com Feedback Atestado por Ledger

Os sistemas de IA contemporâneos alcançam desempenho extraordinário, mas permanecem opacos e não verificáveis, criando uma crise de confiança para implantações críticas de segurança. Apresentamos o MathLedger, um substrato para cognição de máquina verificável que integra verificação formal, atestação criptográfica e dinâmicas de aprendizado em um único loop epistêmico.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Decomposing LLM Self-Correction: The Accuracy-Correction Paradox and Error Depth Hypothesis

arXiv:2601.00828v1 Announce Type: new Abstract: Large Language Models (LLMs) are widely believed to possess self-correction capabilities, yet recent studies suggest that intrinsic self-correction--where models correct their own outputs without external feedback--remains largely ineffective. In this work, we systematically decompose self-correction into three distinct sub-capabilities: error detection, error localization, and error correction. Through cross-model experiments on GSM8K-Complex (n=500 per model, 346 total errors) with three major LLMs, we uncover a striking Accuracy-Correction Paradox: weaker models (GPT-3.5, 66% accuracy) achieve 1.6x higher intrinsic correction rates than stronger models (DeepSeek, 94% accuracy)--26.8% vs 16.7%. We propose the Error Depth Hypothesis: stronger models make fewer but deeper errors that resist self-correction. Error detection rates vary dramatically across architectures (10% to 82%), yet detection capability does not predict correction success--Claude detects only 10% of errors but corrects 29% intrinsically. Surprisingly, providing error location hints hurts all models. Our findings challenge linear assumptions about model capability and self-improvement, with important implications for the design of self-refinement pipelines.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Toward Auditable Neuro-Symbolic Reasoning in Pathology: SQL as an Explicit Trace of Evidence

arXiv:2601.01875v1 Announce Type: new Abstract: Automated pathology image analysis is central to clinical diagnosis, but clinicians still ask which slide features drive a model's decision and why. Vision-language models can produce natural language explanations, but these are often correlational and lack verifiable evidence. In this paper, we introduce an SQL-centered agentic framework that enables both feature measurement and reasoning to be auditable. Specifically, after extracting human-interpretable cellular features, Feature Reasoning Agents compose and execute SQL queries over feature tables to aggregate visual evidence into quantitative findings. A Knowledge Comparison Agent then evaluates these findings against established pathological knowledge, mirroring how pathologists justify diagnoses from measurable observations. Extensive experiments evaluated on two pathology visual question answering datasets demonstrate our method improves interpretability and decision traceability while producing executable SQL traces that link cellular measurements to diagnostic conclusions.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Comentário sobre: Seu Cérebro no ChatGPT: Acumulação de Dívida Cognitiva ao Usar um Assistente de IA para Tarefas de Redação de Ensaios

O trabalho recentemente publicado intitulado Seu Cérebro no ChatGPT: Acumulação de Dívida Cognitiva ao Usar um Assistente de IA para Tarefas de Redação de Ensaios, de Kosmyna et al. (2025), gerou um debate intenso sobre inteligência artificial (IA) e desempenho humano. Parabenizamos Kosmyna et al. pela pesquisa importante e pela coleta de um conjunto de dados valioso. Oferecemos comentários construtivos para melhorar a prontidão do manuscrito para publicação revisada por pares.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

AI Agente para Tomada de Decisão de Risco de Crédito Autônoma, Explicável e em Tempo Real

A digitalização significativa dos serviços financeiros gerou uma demanda urgente por sistemas de tomada de decisão de risco de crédito que sejam autônomos, transparentes e em tempo real. Este artigo apresenta um framework de AI Agente, onde agentes de IA avaliam o crédito de forma dinâmica, minimizando a intervenção humana e melhorando a velocidade e transparência das decisões.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

ElecTwit: A Framework for Studying Persuasion in Multi-Agent Social Systems

arXiv:2601.00994v1 Announce Type: new Abstract: This paper introduces ElecTwit, a simulation framework designed to study persuasion within multi-agent systems, specifically emulating the interactions on social media platforms during a political election. By grounding our experiments in a realistic environment, we aimed to overcome the limitations of game-based simulations often used in prior research. We observed the comprehensive use of 25 specific persuasion techniques across most tested LLMs, encompassing a wider range than previously reported. The variations in technique usage and overall persuasion output between models highlight how different model architectures and training can impact the dynamics in realistic social simulations. Additionally, we observed unique phenomena such as "kernel of truth" messages and spontaneous developments with an "ink" obsession, where agents collectively demanded written proof. Our study provides a foundation for evaluating persuasive LLM agents in real-world contexts, ensuring alignment and preventing dangerous outcomes.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Bayesian Orchestration of Multi-LLM Agents for Cost-Aware Sequential Decision-Making

arXiv:2601.01522v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as autonomous decision agents in settings with asymmetric error costs: hiring (missed talent vs wasted interviews), medical triage (missed emergencies vs unnecessary escalation), and fraud detection (approved fraud vs declined legitimate payments). The dominant design queries a single LLM for a posterior over states, thresholds "confidence," and acts; we prove this is inadequate for sequential decisions with costs. We propose a Bayesian, cost-aware multi-LLM orchestration framework that treats LLMs as approximate likelihood models rather than classifiers. For each candidate state, we elicit likelihoods via contrastive prompting, aggregate across diverse models with robust statistics, and update beliefs with Bayes rule under explicit priors as new evidence arrives. This enables coherent belief updating, expected-cost action selection, principled information gathering via value of information, and fairness gains via ensemble bias mitigation. In resume screening with costs of 40000 USD per missed hire, 2500 USD per interview, and 150 USD per phone screen, experiments on 1000 resumes using five LLMs (GPT-4o, Claude 4.5 Sonnet, Gemini Pro, Grok, DeepSeek) reduce total cost by 294000 USD (34 percent) versus the best single-LLM baseline and improve demographic parity by 45 percent (max group gap 22 to 5 percentage points). Ablations attribute 51 percent of savings to multi-LLM aggregation, 43 percent to sequential updating, and 20 percent to disagreement-triggered information gathering, consistent with the theoretical benefits of correct probabilistic foundations.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

OpenSocInt: A Multi-modal Training Environment for Human-Aware Social Navigation

arXiv:2601.01939v1 Announce Type: new Abstract: In this paper, we introduce OpenSocInt, an open-source software package providing a simulator for multi-modal social interactions and a modular architecture to train social agents. We described the software package and showcased its interest via an experimental protocol based on the task of social navigation. Our framework allows for exploring the use of different perceptual features, their encoding and fusion, as well as the use of different agents. The software is already publicly available under GPL at https://gitlab.inria.fr/robotlearn/OpenSocInt/.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Digital Twin AI: Opportunities and Challenges from Large Language Models to World Models

arXiv:2601.01321v1 Announce Type: new Abstract: Digital twins, as precise digital representations of physical systems, have evolved from passive simulation tools into intelligent and autonomous entities through the integration of artificial intelligence technologies. This paper presents a unified four-stage framework that systematically characterizes AI integration across the digital twin lifecycle, spanning modeling, mirroring, intervention, and autonomous management. By synthesizing existing technologies and practices, we distill a unified four-stage framework that systematically characterizes how AI methodologies are embedded across the digital twin lifecycle: (1) modeling the physical twin through physics-based and physics-informed AI approaches, (2) mirroring the physical system into a digital twin with real-time synchronization, (3) intervening in the physical twin through predictive modeling, anomaly detection, and optimization strategies, and (4) achieving autonomous management through large language models, foundation models, and intelligent agents. We analyze the synergy between physics-based modeling and data-driven learning, highlighting the shift from traditional numerical solvers to physics-informed and foundation models for physical systems. Furthermore, we examine how generative AI technologies, including large language models and generative world models, transform digital twins into proactive and self-improving cognitive systems capable of reasoning, communication, and creative scenario generation. Through a cross-domain review spanning eleven application domains, including healthcare, aerospace, smart manufacturing, robotics, and smart cities, we identify common challenges related to scalability, explainability, and trustworthiness, and outline directions for responsible AI-driven digital twin systems.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Beyond Gemini-3-Pro: Revisiting LLM Routing and Aggregation at Scale

arXiv:2601.01330v1 Announce Type: new Abstract: Large Language Models (LLMs) have rapidly advanced, with Gemini-3-Pro setting a new performance milestone. In this work, we explore collective intelligence as an alternative to monolithic scaling, and demonstrate that open-source LLMs' collaboration can surpass Gemini-3-Pro. We first revisit LLM routing and aggregation at scale and identify three key bottlenecks: (1) current train-free routers are limited by a query-based paradigm focusing solely on textual similarity; (2) recent aggregation methods remain largely static, failing to select appropriate aggregators for different tasks;(3) the complementarity of routing and aggregation remains underutilized. To address these problems, we introduce JiSi, a novel framework designed to release the full potential of LLMs' collaboration through three innovations: (1) Query-Response Mixed Routing capturing both semantic information and problem difficulty; (2) Support-Set-based Aggregator Selection jointly evaluating the aggregation and domain capacity of aggregators; (3) Adaptive Routing-Aggregation Switch dynamically leveraging the advantages of routing and aggregation. Comprehensive experiments on nine benchmarks demonstrate that JiSi can surpass Gemini-3-Pro with only 47% costs by orchestrating ten open-source LLMs, while outperforming mainstream baselines. It suggests that collective intelligence represents a novel path towards Artificial General Intelligence (AGI).

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Aletheia: Quantifying Cognitive Conviction in Reasoning Models via Regularized Inverse Confusion Matrix

arXiv:2601.01532v1 Announce Type: new Abstract: In the progressive journey toward Artificial General Intelligence (AGI), current evaluation paradigms face an epistemological crisis. Static benchmarks measure knowledge breadth but fail to quantify the depth of belief. While Simhi et al. (2025) defined the CHOKE phenomenon in standard QA, we extend this framework to quantify "Cognitive Conviction" in System 2 reasoning models. We propose Project Aletheia, a cognitive physics framework that employs Tikhonov Regularization to invert the judge's confusion matrix. To validate this methodology without relying on opaque private data, we implement a Synthetic Proxy Protocol. Our preliminary pilot study on 2025 baselines (e.g., DeepSeek-R1, OpenAI o1) suggests that while reasoning models act as a "cognitive buffer," they may exhibit "Defensive OverThinking" under adversarial pressure. Furthermore, we introduce the Aligned Conviction Score (S_aligned) to verify that conviction does not compromise safety. This work serves as a blueprint for measuring AI scientific integrity.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

AI Agent Systems: Architectures, Applications, and Evaluation

arXiv:2601.01743v1 Announce Type: new Abstract: AI agents -- systems that combine foundation models with reasoning, planning, memory, and tool use -- are rapidly becoming a practical interface between natural-language intent and real-world computation. This survey synthesizes the emerging landscape of AI agent architectures across: (i) deliberation and reasoning (e.g., chain-of-thought-style decomposition, self-reflection and verification, and constraint-aware decision making), (ii) planning and control (from reactive policies to hierarchical and multi-step planners), and (iii) tool calling and environment interaction (retrieval, code execution, APIs, and multimodal perception). We organize prior work into a unified taxonomy spanning agent components (policy/LLM core, memory, world models, planners, tool routers, and critics), orchestration patterns (single-agent vs.\ multi-agent; centralized vs.\ decentralized coordination), and deployment settings (offline analysis vs.\ online interactive assistance; safety-critical vs.\ open-ended tasks). We discuss key design trade-offs -- latency vs.\ accuracy, autonomy vs.\ controllability, and capability vs.\ reliability -- and highlight how evaluation is complicated by non-determinism, long-horizon credit assignment, tool and environment variability, and hidden costs such as retries and context growth. Finally, we summarize measurement and benchmarking practices (task suites, human preference and utility metrics, success under constraints, robustness and security) and identify open challenges including verification and guardrails for tool actions, scalable memory and context management, interpretability of agent decisions, and reproducible evaluation under realistic workloads.

Fonte: arXiv cs.AI

NLP/LLMs • Score 90

Logics-STEM: Empowering LLM Reasoning via Failure-Driven Post-Training and Document Knowledge Enhancement

arXiv:2601.01562v1 Announce Type: new Abstract: We present Logics-STEM, a state-of-the-art reasoning model fine-tuned on Logics-STEM-SFT-Dataset, a high-quality and diverse dataset at 10M scale that represents one of the largest-scale open-source long chain-of-thought corpora. Logics-STEM targets reasoning tasks in the domains of Science, Technology, Engineering, and Mathematics (STEM), and exhibits exceptional performance on STEM-related benchmarks with an average improvement of 4.68% over the next-best model at 8B scale. We attribute the gains to our data-algorithm co-design engine, where they are jointly optimized to fit a gold-standard distribution behind reasoning. Data-wise, the Logics-STEM-SFT-Dataset is constructed from a meticulously designed data curation engine with 5 stages to ensure the quality, diversity, and scalability, including annotation, deduplication, decontamination, distillation, and stratified sampling. Algorithm-wise, our failure-driven post-training framework leverages targeted knowledge retrieval and data synthesis around model failure regions in the Supervised Fine-tuning (SFT) stage to effectively guide the second-stage SFT or the reinforcement learning (RL) for better fitting the target distribution. The superior empirical performance of Logics-STEM reveals the vast potential of combining large-scale open-source data with carefully designed synthetic data, underscoring the critical role of data-algorithm co-design in enhancing reasoning capabilities through post-training. We make both the Logics-STEM models (8B and 32B) and the Logics-STEM-SFT-Dataset (10M and downsampled 2.2M versions) publicly available to support future research in the open-source community.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Structured Decomposition for LLM Reasoning: Cross-Domain Validation and Semantic Web Integration

arXiv:2601.01609v1 Announce Type: new Abstract: Rule-based reasoning over natural language input arises in domains where decisions must be auditable and justifiable: clinical protocols specify eligibility criteria in prose, evidence rules define admissibility through textual conditions, and scientific standards dictate methodological requirements. Applying rules to such inputs demands both interpretive flexibility and formal guarantees. Large language models (LLMs) provide flexibility but cannot ensure consistent rule application; symbolic systems provide guarantees but require structured input. This paper presents an integration pattern that combines these strengths: LLMs serve as ontology population engines, translating unstructured text into ABox assertions according to expert-authored TBox specifications, while SWRL-based reasoners apply rules with deterministic guarantees. The framework decomposes reasoning into entity identification, assertion extraction, and symbolic verification, with task definitions grounded in OWL 2 ontologies. Experiments across three domains (legal hearsay determination, scientific method-task application, clinical trial eligibility) and eleven language models validate the approach. Structured decomposition achieves statistically significant improvements over few-shot prompting in aggregate, with gains observed across all three domains. An ablation study confirms that symbolic verification provides substantial benefit beyond structured prompting alone. The populated ABox integrates with standard semantic web tooling for inspection and querying, positioning the framework for richer inference patterns that simpler formalisms cannot express.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Yuan3.0 Flash: An Open Multimodal Large Language Model for Enterprise Applications

arXiv:2601.01718v1 Announce Type: new Abstract: We introduce Yuan3.0 Flash, an open-source Mixture-of-Experts (MoE) MultiModal Large Language Model featuring 3.7B activated parameters and 40B total parameters, specifically designed to enhance performance on enterprise-oriented tasks while maintaining competitive capabilities on general-purpose tasks. To address the overthinking phenomenon commonly observed in Large Reasoning Models (LRMs), we propose Reflection-aware Adaptive Policy Optimization (RAPO), a novel RL training algorithm that effectively regulates overthinking behaviors. In enterprise-oriented tasks such as retrieval-augmented generation (RAG), complex table understanding, and summarization, Yuan3.0 Flash consistently achieves superior performance. Moreover, it also demonstrates strong reasoning capabilities in domains such as mathematics, science, etc., attaining accuracy comparable to frontier model while requiring only approximately 1/4 to 1/2 of the average tokens. Yuan3.0 Flash has been fully open-sourced to facilitate further research and real-world deployment: https://github.com/Yuan-lab-LLM/Yuan3.0.

Fonte: arXiv cs.AI

NLP/LLMs • Score 85

Jenius Agent: Towards Experience-Driven Accuracy Optimization in Real-World Scenarios

arXiv:2601.01857v1 Announce Type: new Abstract: As agent systems powered by large language models (LLMs) advance, improving the task performance of an autonomous agent, especially in context understanding, tool usage, and response generation, has become increasingly critical. Although prior studies have advanced the overall design of LLM-based agents, systematic optimization of their internal reasoning and tool-use pipelines remains underexplored. This paper introduces an agent framework grounded in real-world practical experience, with three key innovations: (1) an adaptive prompt generation strategy that aligns with the agent's state and task goals to improve reliability and robustness; (2) a context-aware tool orchestration module that performs tool categorization, semantic retrieval, and adaptive invocation based on user intent and context; and (3) a layered memory mechanism that integrates session memory, task history, and external summaries to improve relevance and efficiency through dynamic summarization and compression. An end-to-end framework named Jenius-Agent has been integrated with three key optimizations, including tools based on the Model Context Protocol (MCP), file input/output (I/O), and execution feedback. The experiments show a 20 percent improvement in task accuracy, along with a reduced token cost, response latency, and invocation failures. The framework is already deployed in Jenius (https://www.jenius.cn), providing a lightweight and scalable solution for robust, protocol-compatible autonomous agents.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

arXiv:2601.00264v1 Announce Type: new Abstract: Multimodal learning has revolutionized general domain tasks, yet its application in scientific discovery is hindered by the profound semantic gap between complex scientific imagery and sparse textual descriptions. We present S1-MMAlign, a large-scale, multi-disciplinary multimodal dataset comprising over 15.5 million high-quality image-text pairs derived from 2.5 million open-access scientific papers. Spanning disciplines from physics and biology to engineering, the dataset captures diverse visual modalities including experimental setups, heatmaps, and microscopic imagery. To address the pervasive issue of weak alignment in raw scientific captions, we introduce an AI-ready semantic enhancement pipeline that utilizes the Qwen-VL multimodal large model series to recaption images by synthesizing context from paper abstracts and citation contexts. Technical validation demonstrates that this enhancement significantly improves data quality: SciBERT-based pseudo-perplexity metrics show reduced semantic ambiguity, while CLIP scores indicate an 18.21% improvement in image-text alignment. S1-MMAlign provides a foundational resource for advancing scientific reasoning and cross-modal understanding in the era of AI for Science. The dataset is publicly available at https://huggingface.co/datasets/ScienceOne-AI/S1-MMAlign.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

Trajectory Guard -- Um Modelo Leve e Consciente de Sequência para Detecção de Anomalias em Tempo Real em AI Agente

Agentes autônomos de LLM geram planos de ação de múltiplos passos que podem falhar devido a desalinhamento contextual ou incoerência estrutural. Métodos existentes de detecção de anomalias não são adequados para esse desafio. Apresentamos o Trajectory Guard, um Autoencoder Recorrente Siamês que aprende alinhamento de tarefa e trajetória, permitindo a detecção unificada de planos incorretos e estruturas de planos malformadas.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

A Chain-of-Thought Approach to Semantic Query Categorization in e-Commerce Taxonomies

arXiv:2601.00510v1 Announce Type: cross Abstract: Search in e-Commerce is powered at the core by a structured representation of the inventory, often formulated as a category taxonomy. An important capability in e-Commerce with hierarchical taxonomies is to select a set of relevant leaf categories that are semantically aligned with a given user query. In this scope, we address a fundamental problem of search query categorization in real-world e-Commerce taxonomies. A correct categorization of a query not only provides a way to zoom into the correct inventory space, but opens the door to multiple intent understanding capabilities for a query. A practical and accurate solution to this problem has many applications in e-commerce, including constraining retrieved items and improving the relevance of the search results. For this task, we explore a novel Chain-of-Thought (CoT) paradigm that combines simple tree-search with LLM semantic scoring. Assessing its classification performance on human-judged query-category pairs, relevance tests, and LLM-based reference methods, we find that the CoT approach performs better than a benchmark that uses embedding-based query category predictions. We show how the CoT approach can detect problems within a hierarchical taxonomy. Finally, we also propose LLM-based approaches for query-categorization of the same spirit, but which scale better at the range of millions of queries.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Knowledge Distillation for Temporal Knowledge Graph Reasoning with Large Language Models

arXiv:2601.00202v1 Announce Type: new Abstract: Reasoning over temporal knowledge graphs (TKGs) is fundamental to improving the efficiency and reliability of intelligent decision-making systems and has become a key technological foundation for future artificial intelligence applications. Despite recent progress, existing TKG reasoning models typically rely on large parameter sizes and intensive computation, leading to high hardware costs and energy consumption. These constraints hinder their deployment on resource-constrained, low-power, and distributed platforms that require real-time inference. Moreover, most existing model compression and distillation techniques are designed for static knowledge graphs and fail to adequately capture the temporal dependencies inherent in TKGs, often resulting in degraded reasoning performance. To address these challenges, we propose a distillation framework specifically tailored for temporal knowledge graph reasoning. Our approach leverages large language models as teacher models to guide the distillation process, enabling effective transfer of both structural and temporal reasoning capabilities to lightweight student models. By integrating large-scale public knowledge with task-specific temporal information, the proposed framework enhances the student model's ability to model temporal dynamics while maintaining a compact and efficient architecture. Extensive experiments on multiple publicly available benchmark datasets demonstrate that our method consistently outperforms strong baselines, achieving a favorable trade-off between reasoning accuracy, computational efficiency, and practical deployability.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

CPPO: Contrastive Perception for Vision Language Policy Optimization

arXiv:2601.00501v1 Announce Type: new Abstract: We introduce CPPO, a Contrastive Perception Policy Optimization method for finetuning vision-language models (VLMs). While reinforcement learning (RL) has advanced reasoning in language models, extending it to multimodal reasoning requires improving both the perception and reasoning aspects. Prior works tackle this challenge mainly with explicit perception rewards, but disentangling perception tokens from reasoning tokens is difficult, requiring extra LLMs, ground-truth data, forced separation of perception from reasoning by policy model, or applying rewards indiscriminately to all output tokens. CPPO addresses this problem by detecting perception tokens via entropy shifts in the model outputs under perturbed input images. CPPO then extends the RL objective function with a Contrastive Perception Loss (CPL) that enforces consistency under information-preserving perturbations and sensitivity under information-removing ones. Experiments show that CPPO surpasses previous perception-rewarding methods, while avoiding extra models, making training more efficient and scalable.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Focal-RegionFace: Generating Fine-Grained Multi-attribute Descriptions for Arbitrarily Selected Face Focal Regions

arXiv:2601.00156v1 Announce Type: new Abstract: In this paper, we introduce an underexplored problem in facial analysis: generating and recognizing multi-attribute natural language descriptions, containing facial action units (AUs), emotional states, and age estimation, for arbitrarily selected face regions (termed FaceFocalDesc). We argue that the system's ability to focus on individual facial areas leads to better understanding and control. To achieve this capability, we construct a new multi-attribute description dataset for arbitrarily selected face regions, providing rich region-level annotations and natural language descriptions. Further, we propose a fine-tuned vision-language model based on Qwen2.5-VL, called Focal-RegionFace for facial state analysis, which incrementally refines its focus on localized facial features through multiple progressively fine-tuning stages, resulting in interpretable age estimation, FAU and emotion detection. Experimental results show that Focal-RegionFace achieves the best performance on the new benchmark in terms of traditional and widely used metrics, as well as new proposed metrics. This fully verifies its effectiveness and versatility in fine-grained multi-attribute face region-focal analysis scenarios.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark

arXiv:2601.00092v1 Announce Type: new Abstract: 4D spatial intelligence involves perceiving and processing how objects move or change over time. Humans naturally possess 4D spatial intelligence, supporting a broad spectrum of spatial reasoning abilities. To what extent can Multimodal Large Language Models (MLLMs) achieve human-level 4D spatial intelligence? In this work, we present Spatial4D-Bench, a versatile 4D spatial intelligence benchmark designed to comprehensively assess the 4D spatial reasoning abilities of MLLMs. Unlike existing spatial intelligence benchmarks that are often small-scale or limited in diversity, Spatial4D-Bench provides a large-scale, multi-task evaluation benchmark consisting of ~40,000 question-answer pairs covering 18 well-defined tasks. We systematically organize these tasks into six cognitive categories: object understanding, scene understanding, spatial relationship understanding, spatiotemporal relationship understanding, spatial reasoning and spatiotemporal reasoning. Spatial4D-Bench thereby offers a structured and comprehensive benchmark for evaluating the spatial cognition abilities of MLLMs, covering a broad spectrum of tasks that parallel the versatility of human spatial intelligence. We benchmark various state-of-the-art open-source and proprietary MLLMs on Spatial4D-Bench and reveal their substantial limitations in a wide variety of 4D spatial reasoning aspects, such as route plan, action recognition, and physical plausibility reasoning. We hope that the findings provided in this work offer valuable insights to the community and that our benchmark can facilitate the development of more capable MLLMs toward human-level 4D spatial intelligence. More resources can be found on our project page.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

FCMBench: Um Benchmark Multimodal Abrangente de Crédito Financeiro para Aplicações do Mundo Real

À medida que a IA multimodal se torna amplamente utilizada para avaliação de risco de crédito e revisão de documentos, um benchmark específico do domínio é urgentemente necessário. Apresentamos o FCMBench-V1.0, um benchmark multimodal de crédito financeiro em larga escala, cobrindo 18 tipos de certificados principais, com 4.043 imagens em conformidade com a privacidade e 8.446 amostras de QA.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

IntraStyler: Exemplar-based Style Synthesis for Cross-modality Domain Adaptation

arXiv:2601.00212v1 Announce Type: new Abstract: Image-level domain alignment is the de facto approach for unsupervised domain adaptation, where unpaired image translation is used to minimize the domain gap. Prior studies mainly focus on the domain shift between the source and target domains, whereas the intra-domain variability remains under-explored. To address the latter, an effective strategy is to diversify the styles of the synthetic target domain data during image translation. However, previous methods typically require intra-domain variations to be pre-specified for style synthesis, which may be impractical. In this paper, we propose an exemplar-based style synthesis method named IntraStyler, which can capture diverse intra-domain styles without any prior knowledge. Specifically, IntraStyler uses an exemplar image to guide the style synthesis such that the output style matches the exemplar style. To extract the style-only features, we introduce a style encoder to learn styles discriminatively based on contrastive learning. We evaluate the proposed method on the largest public dataset for cross-modality domain adaptation, CrossMoDA 2023. Our experiments show the efficacy of our method in controllable style synthesis and the benefits of diverse synthetic data for downstream segmentation. Code is available at https://github.com/han-liu/IntraStyler.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Efficient Prediction of Dense Visual Embeddings via Distillation and RGB-D Transformers

arXiv:2601.00359v1 Announce Type: new Abstract: In domestic environments, robots require a comprehensive understanding of their surroundings to interact effectively and intuitively with untrained humans. In this paper, we propose DVEFormer - an efficient RGB-D Transformer-based approach that predicts dense text-aligned visual embeddings (DVE) via knowledge distillation. Instead of directly performing classical semantic segmentation with fixed predefined classes, our method uses teacher embeddings from Alpha-CLIP to guide our efficient student model DVEFormer in learning fine-grained pixel-wise embeddings. While this approach still enables classical semantic segmentation, e.g., via linear probing, it further enables flexible text-based querying and other applications, such as creating comprehensive 3D maps. Evaluations on common indoor datasets demonstrate that our approach achieves competitive performance while meeting real-time requirements, operating at 26.3 FPS for the full model and 77.0 FPS for a smaller variant on an NVIDIA Jetson AGX Orin. Additionally, we show qualitative results that highlight the effectiveness and possible use cases in real-world applications. Overall, our method serves as a drop-in replacement for traditional segmentation approaches while enabling flexible natural-language querying and seamless integration into 3D mapping pipelines for mobile robotics.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

FlashInfer-Bench: Construindo o Ciclo Virtuoso para Sistemas LLM Impulsionados por IA

Avanços recentes mostram que modelos de linguagem de grande escala (LLMs) podem atuar como agentes autônomos capazes de gerar kernels de GPU, mas integrar esses kernels gerados por IA em sistemas de inferência do mundo real continua sendo um desafio. O FlashInfer-Bench aborda essa lacuna ao estabelecer um framework padronizado e de ciclo fechado que conecta geração de kernels, benchmarking e implantação.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

FreeText: Training-Free Text Rendering in Diffusion Transformers via Attention Localization and Spectral Glyph Injection

arXiv:2601.00535v1 Announce Type: new Abstract: Large-scale text-to-image (T2I) diffusion models excel at open-domain synthesis but still struggle with precise text rendering, especially for multi-line layouts, dense typography, and long-tailed scripts such as Chinese. Prior solutions typically require costly retraining or rigid external layout constraints, which can degrade aesthetics and limit flexibility. We propose \textbf{FreeText}, a training-free, plug-and-play framework that improves text rendering by exploiting intrinsic mechanisms of \emph{Diffusion Transformer (DiT)} models. \textbf{FreeText} decomposes the problem into \emph{where to write} and \emph{what to write}. For \emph{where to write}, we localize writing regions by reading token-wise spatial attribution from endogenous image-to-text attention, using sink-like tokens as stable spatial anchors and topology-aware refinement to produce high-confidence masks. For \emph{what to write}, we introduce Spectral-Modulated Glyph Injection (SGMI), which injects a noise-aligned glyph prior with frequency-domain band-pass modulation to strengthen glyph structure and suppress semantic leakage (rendering the concept instead of the word). Extensive experiments on Qwen-Image, FLUX.1-dev, and SD3 variants across longText-Benchmark, CVTG, and our CLT-Bench show consistent gains in text readability while largely preserving semantic alignment and aesthetic quality, with modest inference overhead.

Fonte: arXiv cs.CV

NLP/LLMs • Score 92

Attention to Detail: Global-Local Attention for High-Resolution AI-Generated Image Detection

arXiv:2601.00141v1 Announce Type: new Abstract: The rapid development of generative AI has made AI-generated images increasingly realistic and high-resolution. Most AI-generated image detection architectures typically downsample images before inputting them into models, risking the loss of fine-grained details. This paper presents GLASS (Global-Local Attention with Stratified Sampling), an architecture that combines a globally resized view with multiple randomly sampled local crops. These crops are original-resolution regions efficiently selected through spatially stratified sampling and aggregated using attention-based scoring. GLASS can be integrated into vision models to leverage both global and local information in images of any size. Vision Transformer, ResNet, and ConvNeXt models are used as backbones, and experiments show that GLASS outperforms standard transfer learning by achieving higher predictive performance within feasible computational constraints.

Fonte: arXiv cs.CV

NLP/LLMs • Score 94

Controles de Abstenção Explícita para Confiabilidade Previsível em Respostas a Perguntas em Vídeo

A implementação de modelos de visão-linguagem (VLMs) em situações críticas exige previsões seletivas, onde os sistemas se abstêm quando incertos, evitando erros custosos. Investigamos se a abstenção baseada em confiança oferece controle confiável sobre as taxas de erro em respostas a perguntas em vídeo e se esse controle se mantém robusto sob mudança de distribuição.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

ABFR-KAN: Kolmogorov-Arnold Networks for Functional Brain Analysis

arXiv:2601.00416v1 Announce Type: new Abstract: Functional connectivity (FC) analysis, a valuable tool for computer-aided brain disorder diagnosis, traditionally relies on atlas-based parcellation. However, issues relating to selection bias and a lack of regard for subject specificity can arise as a result of such parcellations. Addressing this, we propose ABFR-KAN, a transformer-based classification network that incorporates novel advanced brain function representation components with the power of Kolmogorov-Arnold Networks (KANs) to mitigate structural bias, improve anatomical conformity, and enhance the reliability of FC estimation. Extensive experiments on the ABIDE I dataset, including cross-site evaluation and ablation studies across varying model backbones and KAN configurations, demonstrate that ABFR-KAN consistently outperforms state-of-the-art baselines for autism spectrum distorder (ASD) classification. Our code is available at https://github.com/tbwa233/ABFR-KAN.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

RoLID-11K: A Dashcam Dataset for Small-Object Roadside Litter Detection

arXiv:2601.00398v1 Announce Type: new Abstract: Roadside litter poses environmental, safety and economic challenges, yet current monitoring relies on labour-intensive surveys and public reporting, providing limited spatial coverage. Existing vision datasets for litter detection focus on street-level still images, aerial scenes or aquatic environments, and do not reflect the unique characteristics of dashcam footage, where litter appears extremely small, sparse and embedded in cluttered road-verge backgrounds. We introduce RoLID-11K, the first large-scale dataset for roadside litter detection from dashcams, comprising over 11k annotated images spanning diverse UK driving conditions and exhibiting pronounced long-tail and small-object distributions. We benchmark a broad spectrum of modern detectors, from accuracy-oriented transformer architectures to real-time YOLO models, and analyse their strengths and limitations on this challenging task. Our results show that while CO-DETR and related transformers achieve the best localisation accuracy, real-time models remain constrained by coarse feature hierarchies. RoLID-11K establishes a challenging benchmark for extreme small-object detection in dynamic driving scenes and aims to support the development of scalable, low-cost systems for roadside-litter monitoring. The dataset is available at https://github.com/xq141839/RoLID-11K.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning

arXiv:2601.00791v1 Announce Type: cross Abstract: We present a training-free method for detecting valid mathematical reasoning in large language models through spectral analysis of attention patterns. By treating attention matrices as adjacency matrices of dynamic graphs over tokens, we extract four interpretable spectral diagnostics, the Fiedler value (algebraic connectivity), high-frequency energy ratio (HFER), graph signal smoothness, and spectral entropy, that exhibit statistically significant differences between valid and invalid mathematical proofs. Experiments across seven transformer models from four independent architectural families (Meta Llama, Alibaba Qwen, Microsoft Phi, and Mistral AI) demonstrate that this spectral signature produces effect sizes up to Cohen's $d = 3.30$ ($p < 10^{-116}$), enabling 85.0--95.6\% classification accuracy under rigorous evaluation, with calibrated thresholds reaching 93--95\% on the full dataset. The method requires no training data, fine-tuning, or learned classifiers: a single threshold on a spectral metric suffices for high accuracy. Through systematic label correction, we discover that the spectral method detects logical coherence rather than compiler acceptance, identifying mathematically valid proofs that formal verifiers reject due to technical failures. We further identify an architectural dependency: Mistral-7B's Sliding Window Attention shifts the discriminative signal from HFER to late-layer Smoothness ($d = 2.09$, $p_{\text{MW}} = 1.16 \times 10^{-48}$), revealing that attention mechanism design affects which spectral features capture reasoning validity. These findings establish spectral graph analysis as a principled framework for reasoning verification with immediate applications to hallucination detection and AI safety monitoring.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

A Coleira Agente: Extraindo Mapas Cognitivos Fuzzy de Feedback Causal com LLMs

Desenvolvemos um agente de modelo de linguagem grande (LLM) que extrai mapas cognitivos fuzzy de feedback causal (FCMs) a partir de texto bruto. O processo de aprendizado ou extração causal é agente tanto pela semi-autonomia do LLM quanto pela dinâmica do sistema FCM, que orienta os agentes LLM a buscar e processar texto causal.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Framework Auto-reparador Agente Bio-inspirado para Sistemas de Computação Distribuída Resilientes

Este artigo apresenta o ReCiSt, um framework auto-reparador bio-inspirado projetado para alcançar resiliência em Sistemas de Computação Distribuída (DCCS). O ReCiSt reconstrói fases biológicas em camadas computacionais que realizam isolamento autônomo de falhas, diagnóstico causal, recuperação adaptativa e consolidação de conhecimento a partir de agentes impulsionados por Language Model (LM).

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Universos Paralelos, Linguagens Paralelas: Um Estudo Abrangente sobre Geração de Exemplos Contrafactuais Multilíngues Baseados em LLM

Os contrafactuais referem-se a entradas minimamente editadas que fazem a previsão de um modelo mudar, servindo como uma abordagem promissora para explicar o comportamento do modelo. Este estudo investiga a eficácia dos LLMs na geração de contrafactuais multilíngues, comparando contrafactuais gerados diretamente e aqueles derivados de tradução em inglês.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

StockBot 2.0: Vanilla LSTMs Outperform Transformer-based Forecasting for Stock Prices

arXiv:2601.00197v1 Announce Type: cross Abstract: Accurate forecasting of financial markets remains a long-standing challenge due to complex temporal and often latent dependencies, non-linear dynamics, and high volatility. Building on our earlier recurrent neural network framework, we present an enhanced StockBot architecture that systematically evaluates modern attention-based, convolutional, and recurrent time-series forecasting models within a unified experimental setting. While attention-based and transformer-inspired models offer increased modeling flexibility, extensive empirical evaluation reveals that a carefully constructed vanilla LSTM consistently achieves superior predictive accuracy and more stable buy/sell decision-making when trained under a common set of default hyperparameters. These results highlight the robustness and data efficiency of recurrent sequence models for financial time-series forecasting, particularly in the absence of extensive hyperparameter tuning or the availability of sufficient data when discretized to single-day intervals. Additionally, these results underscore the importance of architectural inductive bias in data-limited market prediction tasks.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Agentes Potencializados por LLMs Tendem a Ter Viés Contra Humanos? Explorando a Vulnerabilidade Dependente da Crença

Agentes potencializados por LLMs podem apresentar não apenas viés demográfico, mas também viés intergrupal desencadeado por pistas mínimas de 'nós' versus 'eles'. Este estudo investiga como a crença de um agente sobre a presença de humanos pode influenciar seu comportamento, introduzindo um novo vetor de ataque chamado Belief Poisoning Attack (BPA).

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Democratizando Sistemas de IA Eletrônico-Fotônicos: Um Fluxo de Ferramentas de Co-Projeto e Automação de Design Infundido com IA e Código Aberto

A fotônica está se tornando uma tecnologia fundamental para sistemas de IA de alto desempenho e computação científica, oferecendo velocidade, paralelismo e eficiência energética incomparáveis. No entanto, o design e a implementação de sistemas de IA eletrônico-fotônicos permanecem desafiadores devido a uma curva de aprendizado acentuada. Apresentamos um framework de co-design e automação de múltiplas camadas para democratizar o desenvolvimento de sistemas de IA fotônicos.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Raciocínio em Ação: Recuperação de Conhecimento Orientada por MCTS para Modelos de Linguagem Grandes

Modelos de linguagem grandes (LLMs) geralmente melhoram seu desempenho por meio da recuperação de informações semanticamente semelhantes ou da melhoria de suas capacidades de raciocínio. Este artigo apresenta um método de recuperação de conhecimento consciente do raciocínio que enriquece os LLMs com informações alinhadas à estrutura lógica das conversas, superando a similaridade semântica superficial.

Fonte: arXiv cs.AI

NLP/LLMs • Score 92

Probabilistic Guarantees for Reducing Contextual Hallucinations in LLMs

arXiv:2601.00641v1 Announce Type: new Abstract: Large language models (LLMs) frequently produce contextual hallucinations, where generated content contradicts or ignores information explicitly stated in the prompt. Such errors are particularly problematic in deterministic automation workflows, where inputs are fixed and correctness is unambiguous. We introduce a simple and model-agnostic framework that provides explicit probabilistic guarantees for reducing hallucinations in this setting. We formalize the notion of a specific task, defined by a fixed input and a deterministic correctness criterion, and show that issuing the same prompt in independent context windows yields an exponential reduction in the probability that all model outputs are incorrect. To identify a correct answer among repeated runs, we incorporate an LLM-as-a-judge and prove that the probability that the judged pipeline fails decays at a rate determined by the judge's true- and false-positive probabilities. When the judge is imperfect, we strengthen it through majority vote over independent judge calls, obtaining ensemble-level error rates that decrease exponentially in the number of votes. This yields an explicit bound on the probability that the pipeline selects a hallucinated answer. Experiments on controlled extraction tasks with synthetic noisy judges match these predictions exactly: pipeline failure decreases exponentially with the number of repetitions, and hallucination-selection decreases exponentially with the number of judges in the ensemble. Together, these results provide a lightweight, modular, and theoretically grounded method for driving hallucination probabilities arbitrarily low in fixed-input LLM workflows-without modifying model weights, decoding strategies, or prompt engineering.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Os Chatbots LLMs Falam Demais? O Benchmark YapBench

Modelos de Linguagem de Grande Escala (LLMs) como ChatGPT, Claude e Gemini atuam cada vez mais como copilotos de propósito geral, mas frequentemente respondem com excessiva extensão em solicitações simples, aumentando a carga cognitiva e inflacionando o custo de inferência baseado em tokens. Apresentamos o YapBench, um benchmark leve para quantificar a sobregeração visível ao usuário em prompts de brevidade ideal.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Memory Bank Compression for Continual Adaptation of Large Language Models

arXiv:2601.00756v1 Announce Type: cross Abstract: Large Language Models (LLMs) have become a mainstay for many everyday applications. However, as data evolve their knowledge quickly becomes outdated. Continual learning aims to update LLMs with new information without erasing previously acquired knowledge. Although methods such as full fine-tuning can incorporate new data, they are computationally expensive and prone to catastrophic forgetting, where prior knowledge is overwritten. Memory-augmented approaches address this by equipping LLMs with a memory bank, that is an external memory module which stores information for future use. However, these methods face a critical limitation, in particular, the memory bank constantly grows in the real-world scenario when large-scale data streams arrive. In this paper, we propose MBC, a model that compresses the memory bank through a codebook optimization strategy during online adaptation learning. To ensure stable learning, we also introduce an online resetting mechanism that prevents codebook collapse. In addition, we employ Key-Value Low-Rank Adaptation in the attention layers of the LLM, enabling efficient utilization of the compressed memory representations. Experiments with benchmark question-answering datasets demonstrate that MBC reduces the memory bank size to 0.3% when compared against the most competitive baseline, while maintaining high retention accuracy during online adaptation learning. Our code is publicly available at https://github.com/Thomkat/MBC.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Adapting Natural Language Processing Models Across Jurisdictions: A pilot Study in Canadian Cancer Registries

arXiv:2601.00787v1 Announce Type: new Abstract: Population-based cancer registries depend on pathology reports as their primary diagnostic source, yet manual abstraction is resource-intensive and contributes to delays in cancer data. While transformer-based NLP systems have improved registry workflows, their ability to generalize across jurisdictions with differing reporting conventions remains poorly understood. We present the first cross-provincial evaluation of adapting BCCRTron, a domain-adapted transformer model developed at the British Columbia Cancer Registry, alongside GatorTron, a biomedical transformer model, for cancer surveillance in Canada. Our training dataset consisted of approximately 104,000 and 22,000 de-identified pathology reports from the Newfoundland & Labrador Cancer Registry (NLCR) for Tier 1 (cancer vs. non-cancer) and Tier 2 (reportable vs. non-reportable) tasks, respectively. Both models were fine-tuned using complementary synoptic and diagnosis focused report section input pipelines. Across NLCR test sets, the adapted models maintained high performance, demonstrating transformers pretrained in one jurisdiction can be localized to another with modest fine-tuning. To improve sensitivity, we combined the two models using a conservative OR-ensemble achieving a Tier 1 recall of 0.99 and reduced missed cancers to 24, compared with 48 and 54 for the standalone models. For Tier 2, the ensemble achieved 0.99 recall and reduced missed reportable cancers to 33, compared with 54 and 46 for the individual models. These findings demonstrate that an ensemble combining complementary text representations substantially reduce missed cancers and improve error coverage in cancer-registry NLP. We implement a privacy-preserving workflow in which only model weights are shared between provinces, supporting interoperable NLP infrastructure and a future pan-Canadian foundation model for cancer pathology and registry workflows.

Fonte: arXiv cs.CL

NLP/LLMs • Score 89

Sigmoid Head for Quality Estimation under Language Ambiguity

arXiv:2601.00680v1 Announce Type: new Abstract: Language model (LM) probability is not a reliable quality estimator, as natural language is ambiguous. When multiple output options are valid, the model's probability distribution is spread across them, which can misleadingly indicate low output quality. This issue is caused by two reasons: (1) LMs' final output activation is softmax, which does not allow multiple correct options to receive high probabilities simultaneuously and (2) LMs' training data is single, one-hot encoded references, indicating that there is only one correct option at each output step. We propose training a module for Quality Estimation on top of pre-trained LMs to address these limitations. The module, called Sigmoid Head, is an extra unembedding head with sigmoid activation to tackle the first limitation. To tackle the second limitation, during the negative sampling process to train the Sigmoid Head, we use a heuristic to avoid selecting potentially alternative correct tokens. Our Sigmoid Head is computationally efficient during training and inference. The probability from Sigmoid Head is notably better quality signal compared to the original softmax head. As the Sigmoid Head does not rely on human-annotated quality data, it is more robust to out-of-domain settings compared to supervised QE.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns

arXiv:2601.00588v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed in cost-sensitive and on-device scenarios, and safety guardrails have advanced mainly in English. However, real-world Chinese malicious queries typically conceal intent via homophones, pinyin, symbol-based splitting, and other Chinese-specific patterns. These Chinese-specific adversarial patterns create the safety evaluation gap that is not well captured by existing benchmarks focused on English. This gap is particularly concerning for lightweight models, which may be more vulnerable to such specific adversarial perturbations. To bridge this gap, we introduce the Chinese-Specific Safety Benchmark (CSSBench) that emphasizes these adversarial patterns and evaluates the safety of lightweight LLMs in Chinese. Our benchmark covers six domains that are common in real Chinese scenarios, including illegal activities and compliance, privacy leakage, health and medical misinformation, fraud and hate, adult content, and public and political safety, and organizes queries into multiple task types. We evaluate a set of popular lightweight LLMs and measure over-refusal behavior to assess safety-induced performance degradation. Our results show that the Chinese-specific adversarial pattern is a critical challenge for lightweight LLMs. This benchmark offers a comprehensive evaluation of LLM safety in Chinese, assisting robust deployments in practice.

Fonte: arXiv cs.CL

NLP/LLMs • Score 92

InfoSynth: Information-Guided Benchmark Synthesis for LLMs

arXiv:2601.00575v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation. However, efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher novelty and diversity compared to their seed datasets. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self-verifying pipeline for constructing high-quality, novel and diverse benchmarks for LLMs. Project Page: https://ishirgarg.github.io/infosynth_web/

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak

arXiv:2601.00213v1 Announce Type: cross Abstract: The widespread deployment of large language models (LLMs) has raised growing concerns about their misuse risks and associated safety issues. While prior studies have examined the safety of LLMs in general usage, code generation, and agent-based applications, their vulnerabilities in automated algorithm design remain underexplored. To fill this gap, this study investigates this overlooked safety vulnerability, with a particular focus on intelligent optimization algorithm design, given its prevalent use in complex decision-making scenarios. We introduce MalOptBench, a benchmark consisting of 60 malicious optimization algorithm requests, and propose MOBjailbreak, a jailbreak method tailored for this scenario. Through extensive evaluation of 13 mainstream LLMs including the latest GPT-5 and DeepSeek-V3.1, we reveal that most models remain highly susceptible to such attacks, with an average attack success rate of 83.59% and an average harmfulness score of 4.28 out of 5 on original harmful prompts, and near-complete failure under MOBjailbreak. Furthermore, we assess state-of-the-art plug-and-play defenses that can be applied to closed-source models, and find that they are only marginally effective against MOBjailbreak and prone to exaggerated safety behaviors. These findings highlight the urgent need for stronger alignment techniques to safeguard LLMs against misuse in algorithm design.

Fonte: arXiv cs.CL

NLP/LLMs • Score 92

Fast-weight Product Key Memory

arXiv:2601.00671v1 Announce Type: new Abstract: Sequence modeling layers in modern language models typically face a trade-off between storage capacity and computational efficiency. While Softmax attention offers unbounded storage at prohibitive quadratic costs, linear variants provide efficiency but suffer from limited, fixed-size storage. We propose Fast-weight Product Key Memory (FwPKM), a novel architecture that resolves this tension by transforming the sparse Product Key Memory (PKM) from a static module into a dynamic, "fast-weight" episodic memory. Unlike PKM, FwPKM updates its parameters dynamically at both training and inference time via local chunk-level gradient descent, allowing the model to rapidly memorize and retrieve new key-value pairs from input sequences. Experiments reveal that FwPKM functions as an effective episodic memory that complements the semantic memory of standard modules, yielding significant perplexity reductions on long-context datasets. Notably, in Needle in a Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

A Language-Agnostic Hierarchical LoRA-MoE Architecture for CTC-based Multilingual ASR

arXiv:2601.00557v1 Announce Type: new Abstract: Large-scale multilingual ASR (mASR) models such as Whisper achieve strong performance but incur high computational and latency costs, limiting their deployment on resource-constrained edge devices. In this study, we propose a lightweight and language-agnostic multilingual ASR system based on a CTC architecture with domain adaptation. Specifically, we introduce a Language-agnostic Hierarchical LoRA-MoE (HLoRA) framework integrated into an mHuBERT-CTC model, enabling end-to-end decoding via LID-posterior-driven LoRA routing. The hierarchical design consists of a multilingual shared LoRA for learning language-invariant acoustic representations and language-specific LoRA experts for modeling language-dependent characteristics. The proposed routing mechanism removes the need for prior language identity information or explicit language labels during inference, achieving true language-agnostic decoding. Experiments on MSR-86K and the MLC-SLM 2025 Challenge datasets demonstrate that HLoRA achieves competitive performance with state-of-the-art two-stage inference methods using only single-pass decoding, significantly improving decoding efficiency for low-resource mASR applications.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations

arXiv:2601.00454v1 Announce Type: new Abstract: Guardrail models are essential for ensuring the safety of Large Language Model (LLM) deployments, but processing full multi-turn conversation histories incurs significant computational cost. We propose Defensive M2S, a training paradigm that fine-tunes guardrail models on Multi-turn to Single-turn (M2S) compressed conversations rather than complete dialogue histories. We provide a formal complexity analysis showing that M2S reduces training cost from $O(n^2)$ to $O(n)$ for $n$-turn conversations. Empirically, on our training dataset (779 samples, avg. 10.6 turns), M2S requires only 169K tokens compared to 15.7M tokens for the multi-turn baseline -- a 93$\times$ reduction. We evaluate Defensive M2S across three guardrail model families (LlamaGuard, Nemotron, Qwen3Guard) and three compression templates (hyphenize, numberize, pythonize) on SafeDialBench, a comprehensive multi-turn jailbreak benchmark. Our best configuration, Qwen3Guard with hyphenize compression, achieves 93.8% attack detection recall while reducing inference tokens by 94.6% (from 3,231 to 173 tokens per conversation). This represents a 38.9 percentage point improvement over the baseline while dramatically reducing both training and inference costs. Our findings demonstrate that M2S compression can serve as an effective efficiency technique for guardrail deployment, enabling scalable safety screening of long multi-turn conversations.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach

arXiv:2601.00388v1 Announce Type: new Abstract: Recent advances in vision-language models have opened up new possibilities for reasoning-driven image geolocalization. However, existing approaches often rely on synthetic reasoning annotations or external image retrieval, which can limit interpretability and generalizability. In this paper, we present Geo-R, a retrieval-free framework that uncovers structured reasoning paths from existing ground-truth coordinates and optimizes geolocation accuracy via reinforcement learning. We propose the Chain of Region, a rule-based hierarchical reasoning paradigm that generates precise, interpretable supervision by mapping GPS coordinates to geographic entities (e.g., country, province, city) without relying on model-generated or synthetic labels. Building on this, we introduce a lightweight reinforcement learning strategy with coordinate-aligned rewards based on Haversine distance, enabling the model to refine predictions through spatially meaningful feedback. Our approach bridges structured geographic reasoning with direct spatial supervision, yielding improved localization accuracy, stronger generalization, and more transparent inference. Experimental results across multiple benchmarks confirm the effectiveness of Geo-R, establishing a new retrieval-free paradigm for scalable and interpretable image geolocalization. To facilitate further research and ensure reproducibility, both the model and code will be made publicly available.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Rumo ao Diagnóstico Diferencial Automatizado de Doenças de Pele Usando Deep Learning e Estratégias Conscientes de Imbalance

À medida que as condições dermatológicas se tornam cada vez mais comuns e a disponibilidade de dermatologistas permanece limitada, há uma necessidade crescente de ferramentas inteligentes para apoiar pacientes e clínicos no diagnóstico oportuno e preciso de doenças de pele. Neste projeto, desenvolvemos um modelo baseado em deep learning para a classificação e diagnóstico de condições cutâneas.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Exploring the Performance of Large Language Models on Subjective Span Identification Tasks

arXiv:2601.00736v1 Announce Type: new Abstract: Identifying relevant text spans is important for several downstream tasks in NLP, as it contributes to model explainability. While most span identification approaches rely on relatively smaller pre-trained language models like BERT, a few recent approaches have leveraged the latest generation of Large Language Models (LLMs) for the task. Current work has focused on explicit span identification like Named Entity Recognition (NER), while more subjective span identification with LLMs in tasks like Aspect-based Sentiment Analysis (ABSA) has been underexplored. In this paper, we fill this important gap by presenting an evaluation of the performance of various LLMs on text span identification in three popular tasks, namely sentiment analysis, offensive language identification, and claim verification. We explore several LLM strategies like instruction tuning, in-context learning, and chain of thought. Our results indicate underlying relationships within text aid LLMs in identifying precise text spans.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Beyond IVR: Benchmarking Customer Support LLM Agents for Business-Adherence

arXiv:2601.00596v1 Announce Type: new Abstract: Traditional customer support systems, such as Interactive Voice Response (IVR), rely on rigid scripts and lack the flexibility required for handling complex, policy-driven tasks. While large language model (LLM) agents offer a promising alternative, evaluating their ability to act in accordance with business rules and real-world support workflows remains an open challenge. Existing benchmarks primarily focus on tool usage or task completion, overlooking an agent's capacity to adhere to multi-step policies, navigate task dependencies, and remain robust to unpredictable user or environment behavior. In this work, we introduce JourneyBench, a benchmark designed to assess policy-aware agents in customer support. JourneyBench leverages graph representations to generate diverse, realistic support scenarios and proposes the User Journey Coverage Score, a novel metric to measure policy adherence. We evaluate multiple state-of-the-art LLMs using two agent designs: a Static-Prompt Agent (SPA) and a Dynamic-Prompt Agent (DPA) that explicitly models policy control. Across 703 conversations in three domains, we show that DPA significantly boosts policy adherence, even allowing smaller models like GPT-4o-mini to outperform more capable ones like GPT-4o. Our findings demonstrate the importance of structured orchestration and establish JourneyBench as a critical resource to advance AI-driven customer support beyond IVR-era limitations.

Fonte: arXiv cs.CL

NLP/LLMs • Score 93

Modelos de Linguagem de Grande Escala Ainda Podem Explicar a Si Mesmos? Investigando o Impacto da Quantização nas Autoexplicações

A quantização é amplamente utilizada para acelerar a inferência e otimizar a implementação de modelos de linguagem de grande escala (LLMs), mas seus efeitos nas autoexplicações (SEs) permanecem inexplorados. Este estudo investiga a degradação da qualidade e fidelidade das SEs devido à quantização, analisando explicações em linguagem natural (NLEs) e exemplos contrafactuais gerados por LLMs quantizados com três técnicas comuns.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Comparative Efficiency Analysis of Lightweight Transformer Models: A Multi-Domain Empirical Benchmark for Enterprise NLP Deployment

arXiv:2601.00444v1 Announce Type: new Abstract: In the rapidly evolving landscape of enterprise natural language processing (NLP), the demand for efficient, lightweight models capable of handling multi-domain text automation tasks has intensified. This study conducts a comparative analysis of three prominent lightweight Transformer models - DistilBERT, MiniLM, and ALBERT - across three distinct domains: customer sentiment classification, news topic classification, and toxicity and hate speech detection. Utilizing datasets from IMDB, AG News, and the Measuring Hate Speech corpus, we evaluated performance using accuracy-based metrics including accuracy, precision, recall, and F1-score, as well as efficiency metrics such as model size, inference time, throughput, and memory usage. Key findings reveal that no single model dominates all performance dimensions. ALBERT achieves the highest task-specific accuracy in multiple domains, MiniLM excels in inference speed and throughput, and DistilBERT demonstrates the most consistent accuracy across tasks while maintaining competitive efficiency. All results reflect controlled fine-tuning under fixed enterprise-oriented constraints rather than exhaustive hyperparameter optimization. These results highlight trade-offs between accuracy and efficiency, recommending MiniLM for latency-sensitive enterprise applications, DistilBERT for balanced performance, and ALBERT for resource-constrained environments.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Learning Speech Representations with Variational Predictive Coding

arXiv:2601.00100v1 Announce Type: cross Abstract: Despite being the best known objective for learning speech representations, the HuBERT objective has not been further developed and improved. We argue that it is the lack of an underlying principle that stalls the development, and, in this paper, we show that predictive coding under a variational view is the principle behind the HuBERT objective. Due to its generality, our formulation provides opportunities to improve parameterization and optimization, and we show two simple modifications that bring immediate improvements to the HuBERT objective. In addition, the predictive coding formulation has tight connections to various other objectives, such as APC, CPC, wav2vec, and BEST-RQ. Empirically, the improvement in pre-training brings significant improvements to four downstream tasks: phone classification, f0 tracking, speaker recognition, and automatic speech recognition, highlighting the importance of the predictive coding interpretation.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Rule-Based Approaches to Atomic Sentence Extraction

arXiv:2601.00506v1 Announce Type: new Abstract: Natural language often combines multiple ideas into complex sentences. Atomic sentence extraction, the task of decomposing complex sentences into simpler sentences that each express a single idea, improves performance in information retrieval, question answering, and automated reasoning systems. Previous work has formalized the "split-and-rephrase" task and established evaluation metrics, and machine learning approaches using large language models have improved extraction accuracy. However, these methods lack interpretability and provide limited insight into which linguistic structures cause extraction failures. Although some studies have explored dependency-based extraction of subject-verb-object triples and clauses, no principled analysis has examined which specific clause structures and dependencies lead to extraction difficulties. This study addresses this gap by analyzing how complex sentence structures, including relative clauses, adverbial clauses, coordination patterns, and passive constructions, affect the performance of rule-based atomic sentence extraction. Using the WikiSplit dataset, we implemented dependency-based extraction rules in spaCy, generated 100 gold=standard atomic sentence sets, and evaluated performance using ROUGE and BERTScore. The system achieved ROUGE-1 F1 = 0.6714, ROUGE-2 F1 = 0.478, ROUGE-L F1 = 0.650, and BERTScore F1 = 0.5898, indicating moderate-to-high lexical, structural, and semantic alignment. Challenging structures included relative clauses, appositions, coordinated predicates, adverbial clauses, and passive constructions. Overall, rule-based extraction is reasonably accurate but sensitive to syntactic complexity.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

ECR: Manifold-Guided Semantic Cues for Compact Language Models

arXiv:2601.00543v1 Announce Type: new Abstract: Compact models often lose the structure of their embedding space. The issue shows up when the capacity is tight or the data spans several languages. Such collapse makes it difficult for downstream tasks to build on the resulting representation. Existing compression methods focus on aligning model outputs at a superficial level but fail to preserve the underlying manifold structure. This mismatch often leads to semantic drift in the compact model, causing both task behavior and linguistic properties to deviate from the reference model. To address those issues, we provide a new framework called Embedding Consistency Regulation (ECR). This framework first derives a set of semantic anchors from teacher embeddings (computed once offline). Then, the compact model learns to maintain consistent geometry around these anchors, without relying on matching logits or internal features. ECR adds only a small projection step at inference, without altering the decoding architecture or its runtime behavior. In experiments on a 100K multilingual corpus, ECR consistently stabilizes training and preserves semantic structure across tasks and languages. It also produces a more compact and task-aligned representation space, enabling low-capacity models to learn cleaner manifolds than conventional baselines. ECR works without teacher outputs and is compatible with, but independent of, distillation. Taken together, our results show that ECR helps compact models better follow task requirements and makes them easier to deploy under strict efficiency or privacy limits.

Fonte: arXiv cs.CL

NLP/LLMs • Score 88

BERT-JEPA: Reorganizing CLS Embeddings for Language-Invariant Semantics

arXiv:2601.00366v1 Announce Type: new Abstract: Joint Embedding Predictive Architectures (JEPA) are a novel self supervised training technique that have shown recent promise across domains. We introduce BERT-JEPA (BEPA), a training paradigm that adds a JEPA training objective to BERT-style models, working to combat a collapsed [CLS] embedding space and turning it into a language-agnostic space. This new structure leads to increased performance across multilingual benchmarks.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Além de APIs Perfeitas: Uma Avaliação Abrangente de Agentes LLM Sob a Complexidade Real de APIs

Apresentamos o WildAGTEval, um benchmark projetado para avaliar as capacidades de chamada de função de agentes de modelos de linguagem grande (LLM) sob a complexidade realista de APIs. Ao contrário de trabalhos anteriores que assumem um sistema de API idealizado, WildAGTEval considera a especificação e a execução de APIs, oferecendo cenários de complexidade variados para avaliar o desempenho dos LLMs.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Grande Estudo de Caso Empírico: Go-Explore adaptado para Testes de Red Team de IA

Agentes LLM de produção com capacidades de uso de ferramentas requerem testes de segurança, apesar de seu treinamento em segurança. Adaptamos o Go-Explore para avaliar o GPT-4o-mini em 28 execuções experimentais, abordando seis questões de pesquisa. Nossos resultados mostram que a variação de sementes aleatórias domina os parâmetros algorítmicos, resultando em uma variação de 8x nos resultados.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Limite de Largura Infinita de uma Única Camada de Atenção: Análise via Programas Tensor

No presente artigo, identificamos rigorosamente a distribuição do limite de largura infinita de variáveis dentro de uma única camada de atenção, utilizando o framework Tensor Programs. Derivamos a forma exata dessa lei limite, demonstrando que ela se desvia fundamentalmente da Gaussianidade. Nossos experimentos numéricos validam as previsões teóricas, confirmando a eficácia da teoria em largura finita e a descrição precisa de atenções com cabeçotes finitos.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

Correção Residual Segura Inspirada em Causalidade para Séries Temporais Multivariadas

Embora os preditores multivariados modernos, como Transformers e GNNs, apresentem forte desempenho em benchmarks, eles frequentemente sofrem de erros sistemáticos em variáveis ou horizontes específicos e, criticamente, carecem de garantias contra degradação de desempenho na implementação. Para abordar essa lacuna de segurança, propomos o CRC (Correção Residual Segura Inspirada em Causalidade), um framework projetado para garantir não degradação.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

PolarGrad: Uma Classe de Otimizadores de Gradiente de Matriz a Partir de uma Perspectiva Unificadora de Pré-condicionamento

O aumento contínuo da escala dos modelos de deep learning e dos dados de treinamento destaca a importância crítica de métodos de otimização eficientes. Neste artigo, introduzimos um framework unificador para analisar métodos de pré-condicionamento 'conscientes de matriz', levando a uma nova classe de métodos de otimização que demonstram convergência mais rápida.

Fonte: arXiv stat.ML

NLP/LLMs • Score 96

Um Modelo de Linguagem Grande Aprimorado por Visão e Conhecimento para Inferência Generalizável do Comportamento de Travessia de Pedestres

Os paradigmas existentes para inferir o comportamento de travessia de pedestres apresentam generalização limitada e desempenho inadequado em novos locais. Este estudo introduz o Pedestrian Crossing LLM (PedX-LLM), um framework aprimorado que transforma a inferência de travessia de padrões específicos do local para raciocínio comportamental generalizável, alcançando 82,0% de precisão balanceada e superando métodos tradicionais.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Construindo um Matemático Neuro-Simbólico a Partir de Princípios Fundamentais

Modelos de Linguagem de Grande Escala (LLMs) apresentam falhas lógicas persistentes em raciocínios complexos devido à falta de um framework axiomático interno. Propomos o Mathesis, uma arquitetura neuro-simbólica que codifica estados matemáticos como hipergrafos de ordem superior e utiliza um Kernel de Raciocínio Simbólico (SRK), um motor lógico diferenciável que mapeia restrições para uma paisagem de energia contínua.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Ajuste Fino Online de Decision Transformers com Gradientes de RL Puro

Os Decision Transformers (DTs) surgiram como um poderoso framework para tomada de decisão sequencial, formulando o aprendizado por reforço offline (RL) como um problema de modelagem de sequência. No entanto, a extensão dos DTs para configurações online com gradientes de RL puro permanece amplamente inexplorada. Identificamos o relabeling de retorno retrospectivo como um obstáculo crítico para o ajuste fino baseado em RL.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Language as Mathematical Structure: Examining Semantic Field Theory Against Language Games

arXiv:2601.00448v1 Announce Type: new Abstract: Large language models (LLMs) offer a new empirical setting in which long-standing theories of linguistic meaning can be examined. This paper contrasts two broad approaches: social constructivist accounts associated with language games, and a mathematically oriented framework we call Semantic Field Theory. Building on earlier work by the author, we formalize the notions of lexical fields (Lexfelder) and linguistic fields (Lingofelder) as interacting structures in a continuous semantic space. We then analyze how core properties of transformer architectures-such as distributed representations, attention mechanisms, and geometric regularities in embedding spaces-relate to these concepts. We argue that the success of LLMs in capturing semantic regularities supports the view that language exhibits an underlying mathematical structure, while their persistent limitations in pragmatic reasoning and context sensitivity are consistent with the importance of social grounding emphasized in philosophical accounts of language use. On this basis, we suggest that mathematical structure and language games can be understood as complementary rather than competing perspectives. The resulting framework clarifies the scope and limits of purely statistical models of language and motivates new directions for theoretically informed AI architectures.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

The Role of Mixed-Language Documents for Multilingual Large Language Model Pretraining

arXiv:2601.00364v1 Announce Type: new Abstract: Multilingual large language models achieve impressive cross-lingual performance despite largely monolingual pretraining. While bilingual data in pretraining corpora is widely believed to enable these abilities, details of its contributions remain unclear. We investigate this question by pretraining models from scratch under controlled conditions, comparing the standard web corpus with a monolingual-only version that removes all multilingual documents. Despite constituting only 2% of the corpus, removing bilingual data causes translation performance to drop 56% in BLEU, while behaviour on cross-lingual QA and general reasoning tasks remains stable, with training curves largely overlapping the baseline. To understand this asymmetry, we categorize bilingual data into parallel (14%), code-switching (72%), and miscellaneous documents (14%) based on the semantic relevance of content in different languages. We then conduct granular ablations by reintroducing parallel or code-switching data into the monolingual-only corpus. Our experiments reveal that parallel data almost fully restores translation performance (91% of the unfiltered baseline), whereas code-switching contributes minimally. Other cross-lingual tasks remain largely unaffected by either type. These findings reveal that translation critically depends on systematic token-level alignments from parallel data, whereas cross-lingual understanding and reasoning appear to be achievable even without bilingual data.

Fonte: arXiv cs.CL

NLP/LLMs • Score 93

Correspondência Perfeita de Peso Mínimo Neural para Códigos de Erro Quântico

A realização do potencial completo da computação quântica requer a Correção de Erros Quânticos (QEC). A QEC reduz as taxas de erro ao codificar informações lógicas em qubits físicos redundantes. Neste trabalho, propomos um decodificador orientado a dados chamado Correspondência Perfeita de Peso Mínimo Neural (NMWPM), que utiliza uma arquitetura híbrida para prever pesos de arestas dinâmicos, demonstrando uma redução significativa na Taxa de Erro Lógica (LER) em comparação com as referências padrão.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Universal Adaptive Constraint Propagation: Scaling Structured Inference for Large Language Models via Meta-Reinforcement Learning

arXiv:2601.00095v1 Announce Type: new Abstract: Large language models increasingly require structured inference, from JSON schema enforcement to multi-lingual parsing, where outputs must satisfy complex constraints. We introduce MetaJuLS, a meta-reinforcement learning approach that learns universal constraint propagation policies applicable across languages and tasks without task-specific retraining. By formulating structured inference as adaptive constraint propagation and training a Graph Attention Network with meta-learning, MetaJuLS achieves 1.5--2.0$\times$ speedups over GPU-optimized baselines while maintaining within 0.2\% accuracy of state-of-the-art parsers. On Universal Dependencies across 10 languages and LLM-constrained generation (LogicBench, GSM8K-Constrained), MetaJuLS demonstrates rapid cross-domain adaptation: a policy trained on English parsing adapts to new languages and tasks with 5--10 gradient steps (5--15 seconds) rather than requiring hours of task-specific training. Mechanistic analysis reveals the policy discovers human-like parsing strategies (easy-first) and novel non-intuitive heuristics. By reducing propagation steps in LLM deployments, MetaJuLS contributes to Green AI by directly reducing inference carbon footprint.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Detecção de Confundidores Não Observados: Uma Abordagem de Regressão Kernelizada

Detectar confundidores não observados é crucial para inferência causal confiável em estudos observacionais. Métodos existentes exigem suposições de linearidade ou múltiplos ambientes heterogêneos, limitando a aplicabilidade em configurações não lineares de ambiente único. Propomos a Detecção de Confundidores por Regressão Kernelizada (KRCD), um método inovador que utiliza espaços de Hilbert de kernel reprodutivo para modelar dependências complexas.

Fonte: arXiv stat.ML

NLP/LLMs • Score 93

Mortar: Mecânicas em Evolução para Design de Jogos Automático

Apresentamos o Mortar, um sistema para evoluir mecanicamente jogos de forma autônoma para design automático. As mecânicas de jogo definem as regras e interações que governam a jogabilidade, e projetá-las manualmente é um processo que consome tempo e requer expertise. O Mortar combina um algoritmo de qualidade e diversidade com um modelo de linguagem grande para explorar um conjunto diversificado de mecânicas.

Fonte: arXiv cs.AI

NLP/LLMs • Score 93

GRIT -- PEFT Consciente de Geometria com Pré-condicionamento K-FAC, Reprojeção Guiada por Fisher e Adaptação Dinâmica de Rank

O ajuste fino eficiente em parâmetros (PEFT) é a abordagem padrão para adaptar LLMs, mas LoRA e QLoRA são amplamente agnósticas em relação à geometria. Introduzimos o GRIT, um procedimento LoRA dinâmico e consciente da curvatura que preserva a parametrização LoRA, melhorando a eficiência e reduzindo a deriva em direções fracas. GRIT reduz em média 46% os parâmetros treináveis, mantendo a qualidade em diferentes estilos de prompt e misturas de dados.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Ajuste Fino de Modelos de Linguagem de Grande Escala para Triagem Automatizada de Depressão em Pidgin Nigeriano: Estudo Piloto GENSCORE

A depressão é um grande contribuinte para a carga de saúde mental na Nigéria, mas a cobertura de triagem é limitada devido ao baixo acesso a clínicos, estigma e barreiras linguísticas. Este estudo apresenta uma abordagem inovadora para triagem automatizada de depressão usando modelos de linguagem de grande escala (LLMs) ajustados para o Pidgin Nigeriano conversacional.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Uma Avaliação Empírica de Abordagens Baseadas em LLM para Detecção de Vulnerabilidades de Código: RAG, SFT e Sistemas de Agentes Duplos

O rápido avanço dos Large Language Models (LLMs) apresenta novas oportunidades para a detecção automatizada de vulnerabilidades de software, uma tarefa crucial para a segurança de bases de código modernas. Este artigo apresenta um estudo comparativo sobre a eficácia de técnicas baseadas em LLM para detectar vulnerabilidades de software, avaliando três abordagens: Retrieval-Augmented Generation (RAG), Supervised Fine-Tuning (SFT) e um framework de LLM de Agente Duplo.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

From Evidence-Based Medicine to Knowledge Graph: Retrieval-Augmented Generation for Sports Rehabilitation and a Domain Benchmark

arXiv:2601.00216v1 Announce Type: new Abstract: In medicine, large language models (LLMs) increasingly rely on retrieval-augmented generation (RAG) to ground outputs in up-to-date external evidence. However, current RAG approaches focus primarily on performance improvements while overlooking evidence-based medicine (EBM) principles. This study addresses two key gaps: (1) the lack of PICO alignment between queries and retrieved evidence, and (2) the absence of evidence hierarchy considerations during reranking. We present a generalizable strategy for adapting EBM to graph-based RAG, integrating the PICO framework into knowledge graph construction and retrieval, and proposing a Bayesian-inspired reranking algorithm to calibrate ranking scores by evidence grade without introducing predefined weights. We validated this framework in sports rehabilitation, a literature-rich domain currently lacking RAG systems and benchmarks. We released a knowledge graph (357,844 nodes and 371,226 edges) and a reusable benchmark of 1,637 QA pairs. The system achieved 0.830 nugget coverage, 0.819 answer faithfulness, 0.882 semantic similarity, and 0.788 PICOT match accuracy. In a 5-point Likert evaluation, five expert clinicians rated the system 4.66-4.84 across factual accuracy, faithfulness, relevance, safety, and PICO alignment. These findings demonstrate that the proposed EBM adaptation strategy improves retrieval and answer quality and is transferable to other clinical domains. The released resources also help address the scarcity of RAG datasets in sports rehabilitation.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

JP-TL-Bench: Avaliação Ancorada de LLM em Par para Tradução Bidirecional Japonês-Inglês

Apresentamos o JP-TL-Bench, um benchmark leve e aberto projetado para guiar o desenvolvimento iterativo de sistemas de tradução Japonês-Inglês. O desafio muitas vezes é 'qual dessas duas boas traduções é melhor?' em vez de 'esta tradução é aceitável?'. Essa distinção é crucial para o Japonês-Inglês, onde escolhas sutis em polidez, implicatura, elipse e registro afetam fortemente a naturalidade percebida.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

HFedMoE: Aprendizado Federado Heterogêneo Consciente de Recursos com Mixture-of-Experts

Embora o aprendizado federado (FL) permita o ajuste fino de grandes modelos de linguagem (LLMs) sem comprometer a privacidade dos dados, o tamanho substancial de um LLM torna o treinamento em dispositivos impraticável para clientes com recursos limitados, como dispositivos móveis. Modelos Mixture-of-Experts (MoE) surgiram como uma solução eficiente em termos de computação, ativando apenas um subconjunto esparso de especialistas durante o treinamento do modelo.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Do LLMs Judge Distantly Supervised Named Entity Labels Well? Constructing the JudgeWEL Dataset

arXiv:2601.00411v1 Announce Type: new Abstract: We present judgeWEL, a dataset for named entity recognition (NER) in Luxembourgish, automatically labelled and subsequently verified using large language models (LLM) in a novel pipeline. Building datasets for under-represented languages remains one of the major bottlenecks in natural language processing, where the scarcity of resources and linguistic particularities make large-scale annotation costly and potentially inconsistent. To address these challenges, we propose and evaluate a novel approach that leverages Wikipedia and Wikidata as structured sources of weak supervision. By exploiting internal links within Wikipedia articles, we infer entity types based on their corresponding Wikidata entries, thereby generating initial annotations with minimal human intervention. Because such links are not uniformly reliable, we mitigate noise by employing and comparing several LLMs to identify and retain only high-quality labelled sentences. The resulting corpus is approximately five times larger than the currently available Luxembourgish NER dataset and offers broader and more balanced coverage across entity categories, providing a substantial new resource for multilingual and low-resource NER research.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

MethConvTransformer: Um Framework de Deep Learning para Detecção de Doença de Alzheimer em Múltiplos Tecidos

A doença de Alzheimer (AD) é um distúrbio neurodegenerativo multifatorial caracterizado por declínio cognitivo progressivo. O MethConvTransformer é um framework de deep learning baseado em transformer que integra perfis de metilação de DNA de tecidos cerebrais e periféricos, permitindo a descoberta de biomarcadores. O modelo supera as abordagens convencionais de machine learning, oferecendo biomarcadores epigenéticos robustos e interpretabilidade multi-resolução.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Toward Better Temporal Structures for Geopolitical Events Forecasting

arXiv:2601.00430v1 Announce Type: new Abstract: Forecasting on geopolitical temporal knowledge graphs (TKGs) through the lens of large language models (LLMs) has recently gained traction. While TKGs and their generalization, hyper-relational temporal knowledge graphs (HTKGs), offer a straightforward structure to represent simple temporal relationships, they lack the expressive power to convey complex facts efficiently. One of the critical limitations of HTKGs is a lack of support for more than two primary entities in temporal facts, which commonly occur in real-world events. To address this limitation, in this work, we study a generalization of HTKGs, Hyper-Relational Temporal Knowledge Generalized Hypergraphs (HTKGHs). We first derive a formalization for HTKGHs, demonstrating their backward compatibility while supporting two complex types of facts commonly found in geopolitical incidents. Then, utilizing this formalization, we introduce the htkgh-polecat dataset, built upon the global event database POLECAT. Finally, we benchmark and analyze popular LLMs on the relation prediction task, providing insights into their adaptability and capabilities in complex forecasting scenarios.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Framework de Otimização Bayesiana Dinâmica para Ajuste de Instruções na Descoberta de Equações Diferenciais Parciais

Modelos de Linguagem de Grande Escala (LLMs) mostram potencial para a descoberta de equações, mas suas saídas são altamente sensíveis à formulação dos prompts, um fenômeno que chamamos de fragilidade das instruções. Para resolver isso, propomos o NeuroSymBO, que reformula a engenharia de prompts como um problema de decisão sequencial.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

MotionPhysics: Learnable Motion Distillation for Text-Guided Simulation

arXiv:2601.00504v1 Announce Type: new Abstract: Accurately simulating existing 3D objects and a wide variety of materials often demands expert knowledge and time-consuming physical parameter tuning to achieve the desired dynamic behavior. We introduce MotionPhysics, an end-to-end differentiable framework that infers plausible physical parameters from a user-provided natural language prompt for a chosen 3D scene of interest, removing the need for guidance from ground-truth trajectories or annotated videos. Our approach first utilizes a multimodal large language model to estimate material parameter values, which are constrained to lie within plausible ranges. We further propose a learnable motion distillation loss that extracts robust motion priors from pretrained video diffusion models while minimizing appearance and geometry inductive biases to guide the simulation. We evaluate MotionPhysics across more than thirty scenarios, including real-world, human-designed, and AI-generated 3D objects, spanning a wide range of materials such as elastic solids, metals, foams, sand, and both Newtonian and non-Newtonian fluids. We demonstrate that MotionPhysics produces visually realistic dynamic simulations guided by natural language, surpassing the state of the art while automatically determining physically plausible parameters. The code and project page are available at: https://wangmiaowei.github.io/MotionPhysics.github.io/.

Fonte: arXiv cs.CV

NLP/LLMs • Score 90

Um Macaco de IA Consegue Uvas com Certeza -- Redes Neurais Sphere para Tomada de Decisão Confiável

Este artigo compara três categorias metodológicas de raciocínio neural: raciocínio LLM, raciocínio baseado em aprendizado supervisionado e raciocínio baseado em modelo explícito. Mostramos que o raciocínio via aprendizado supervisionado é menos atraente do que o raciocínio por construção de modelo explícito, e propomos uma nova versão de Redes Neurais Sphere que permite uma tomada de decisão confiável.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

O Cavalo de Troia no Vocabulário: Sabotagem Sutil da Composição de LLM

O ecossistema de LLM de pesos abertos é cada vez mais definido por técnicas de composição de modelos que remixam capacidades de diversas fontes. Um pré-requisito crítico para aplicar esses métodos é o transplante de tokenizers, que alinha vocabulários incompatíveis a um espaço de embedding compartilhado. Demonstramos que esse passo de interoperabilidade introduz uma vulnerabilidade na cadeia de suprimentos.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Aprendizado por Reforço com Aproximação de Função para Processos Não-Markovianos

Estudamos métodos de aprendizado por reforço com aproximação de função linear sob processos de estado e custo não-Markovianos. Consideramos inicialmente o método de avaliação de política e demonstramos que o algoritmo converge sob condições adequadas de ergodicidade. Além disso, mostramos que o limite corresponde ao ponto fixo de um operador conjunto composto por uma projeção ortogonal e o operador de Bellman de um processo de decisão auxiliar extit{Markov}.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Rumo a Sistemas de IA Potencializados por Fotônica em Grande Escala: Da Automação de Design Físico à Coexploração de Sistema e Algoritmo

Neste trabalho, identificamos três considerações essenciais para a realização de sistemas práticos de IA fotônica em escala: (1) suporte a operações tensorais dinâmicas para modelos modernos; (2) gerenciamento sistemático de sobrecargas de conversão, controle e movimentação de dados; e (3) robustez sob não idealidades de hardware. Desenvolvemos uma ferramenta de suporte ao design de IA fotônica desde a exploração inicial até a realização física.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Ouça o Batimento em Fases: Biometria de ECG Consciente de Fases Fundamentada Fisiologicamente

A eletrocardiografia (ECG) é utilizada para autenticação de identidade em dispositivos vestíveis devido às suas características específicas de cada indivíduo e à sua natureza inerente de vivacidade. Propomos um framework Hierarchical Phase-Aware Fusion (HPAF) que evita explicitamente o entrelaçamento de características através de um design em três estágios.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Talk Less, Verify More: Improving LLM Assistants with Semantic Checks and Execution Feedback

arXiv:2601.00224v1 Announce Type: new Abstract: As large language model (LLM) assistants become increasingly integrated into enterprise workflows, their ability to generate accurate, semantically aligned, and executable outputs is critical. However, current conversational business analytics (CBA) systems often lack built-in verification mechanisms, leaving users to manually validate potentially flawed results. This paper introduces two complementary verification techniques: Q*, which performs reverse translation and semantic matching between code and user intent, and Feedback+, which incorporates execution feedback to guide code refinement. Embedded within a generator-discriminator framework, these mechanisms shift validation responsibilities from users to the system. Evaluations on three benchmark datasets, Spider, Bird, and GSM8K, demonstrate that both Q* and Feedback+ reduce error rates and task completion time. The study also identifies reverse translation as a key bottleneck, highlighting opportunities for future improvement. Overall, this work contributes a design-oriented framework for building more reliable, enterprise-grade GenAI systems capable of trustworthy decision support.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

DA-DPO: Otimização de Preferências Consciente da Dificuldade e Custo-Eficiente para Reduzir Alucinações em MLLMs

O Direct Preference Optimization (DPO) demonstrou grande potencial para mitigar alucinações em Multimodal Large Language Models (MLLMs). No entanto, abordagens existentes frequentemente sofrem com overfitting devido ao desequilíbrio de dificuldade nos dados de preferência. Propomos o Difficulty-Aware Direct Preference Optimization (DA-DPO), um framework custo-efetivo que equilibra o processo de aprendizado.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Pergunte, Esclareça, Otimize: Colaboração Humano-Agente LLM para um Controle de Estoque Mais Inteligente

A gestão de estoque continua sendo um desafio para muitas pequenas e médias empresas que carecem de expertise para implementar métodos avançados de otimização. Este artigo investiga se os Large Language Models (LLMs) podem ajudar a superar essa lacuna, propondo um framework híbrido que separa rigorosamente o raciocínio semântico do cálculo matemático.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Robust Uncertainty Quantification for Factual Generation of Large Language Models

arXiv:2601.00348v1 Announce Type: new Abstract: The rapid advancement of large language model(LLM) technology has facilitated its integration into various domains of professional and daily life. However, the persistent challenge of LLM hallucination has emerged as a critical limitation, significantly compromising the reliability and trustworthiness of AI-generated content. This challenge has garnered significant attention within the scientific community, prompting extensive research efforts in hallucination detection and mitigation strategies. Current methodological frameworks reveal a critical limitation: traditional uncertainty quantification approaches demonstrate effectiveness primarily within conventional question-answering paradigms, yet exhibit notable deficiencies when confronted with non-canonical or adversarial questioning strategies. This performance gap raises substantial concerns regarding the dependability of LLM responses in real-world applications requiring robust critical thinking capabilities. This study aims to fill this gap by proposing an uncertainty quantification scenario in the task of generating with multiple facts. We have meticulously constructed a set of trap questions contained with fake names. Based on this scenario, we innovatively propose a novel and robust uncertainty quantification method(RU). A series of experiments have been conducted to verify its effectiveness. The results show that the constructed set of trap questions performs excellently. Moreover, when compared with the baseline methods on four different models, our proposed method has demonstrated great performance, with an average increase of 0.1-0.2 in ROCAUC values compared to the best performing baseline method, providing new sights and methods for addressing the hallucination issue of LLMs.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

TotalFM: An Organ-Separated Framework for 3D-CT Vision Foundation Models

arXiv:2601.00260v1 Announce Type: new Abstract: While foundation models in radiology are expected to be applied to various clinical tasks, computational cost constraints remain a major challenge when training on 3D-CT volumetric data. In this study, we propose TotalFM, a radiological foundation model that efficiently learns the correspondence between 3D-CT images and linguistic expressions based on the concept of organ separation, utilizing a large-scale dataset of 140,000 series. By automating the creation of organ volume and finding-sentence pairs through segmentation techniques and Large Language Model (LLM)-based radiology report processing, and by combining self-supervised pre-training via VideoMAE with contrastive learning using volume-text pairs, we aimed to balance computational efficiency and representation capability. In zero-shot organ-wise lesion classification tasks, the proposed model achieved higher F1 scores in 83% (5/6) of organs compared to CT-CLIP and 64% (9/14) of organs compared to Merlin. These results suggest that the proposed model exhibits high generalization performance in a clinical evaluation setting using actual radiology report sentences. Furthermore, in zero-shot finding-wise lesion classification tasks, our model achieved a higher AUROC in 83% (25/30) of finding categories compared to Merlin. We also confirmed performance comparable to existing Vision-Language Models (VLMs) in radiology report generation tasks. Our results demonstrate that the organ-separated learning framework can serve as a realistic and effective design guideline for the practical implementation of 3D-CT foundation models.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Pat-DEVAL: Chain-of-Legal-Thought Evaluation for Patent Description

arXiv:2601.00166v1 Announce Type: new Abstract: Patent descriptions must deliver comprehensive technical disclosure while meeting strict legal standards such as enablement and written description requirements. Although large language models have enabled end-to-end automated patent drafting, existing evaluation approaches fail to assess long-form structural coherence and statutory compliance specific to descriptions. We propose Pat-DEVAL, the first multi-dimensional evaluation framework dedicated to patent description bodies. Leveraging the LLM-as-a-judge paradigm, Pat-DEVAL introduces Chain-of-Legal-Thought (CoLT), a legally-constrained reasoning mechanism that enforces sequential patent-law-specific analysis. Experiments validated by patent expert on our Pap2Pat-EvalGold dataset demonstrate that Pat-DEVAL achieves a Pearson correlation of 0.69, significantly outperforming baseline metrics and existing LLM evaluators. Notably, the framework exhibits a superior correlation of 0.73 in Legal-Professional Compliance, proving that the explicit injection of statutory constraints is essential for capturing nuanced legal validity. By establishing a new standard for ensuring both technical soundness and legal compliance, Pat-DEVAL provides a robust methodological foundation for the practical deployment of automated patent drafting systems.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

From Transformers to LLMs: A Systematic Survey of Efficiency Considerations in NLP

arXiv:2406.16893v2 Announce Type: replace Abstract: The emergence of Transformer-based Large Language Models (LLMs) has substantially augmented the capabilities of Natural Language Processing (NLP), thereby intensifying the demand for computational resources. Therefore, enhancing efficiency based on factors like computational requirements, energy consumption, carbon footprint and financial cost has become a vital area of research. This motivates us to conduct a systematic literature review on Transformer-based LLMs in NLP from the perspective of efficiency. In this survey of 312 articles published between the years 2011 and 2025, efficiency-improvement endeavors have been systematically discussed targeting various aspects such as data curation, model design, model downsizing, and dynamic inferencing. This has been augmented with efficiency considerations in model adaptation strategies like pre-training, fine-tuning, prompt-engineering and Retrieval-Augmented Generation (RAG). Furthermore, a statistical analysis of the articles has been performed followed by an in-depth evaluation of the efficiency and efficacy of more than 30 renowned NLP models has been conducted on 13 evaluation benchmarks. This paper offers valuable insights for researchers, professionals as well as scholars, and explores the trend of research toward sustainable practices in NLP.

Fonte: arXiv cs.CL

NLP/LLMs • Score 92

Physio-DPO: Aligning Large Language Models with the Protein Energy Landscape to Eliminate Structural Hallucinations

arXiv:2601.00647v1 Announce Type: new Abstract: Large Protein Language Models have shown strong potential for generative protein design, yet they frequently produce structural hallucinations, generating sequences with high linguistic likelihood that fold into thermodynamically unstable conformations. Existing alignment approaches such as Direct Preference Optimization are limited in this setting, as they model preferences as binary labels and ignore the continuous structure of the physical energy landscape. We propose Physio-DPO, a physics informed alignment framework that grounds protein language models in thermodynamic stability. Physio-DPO introduces a magnitude aware objective that scales optimization updates according to the energy gap between native structures and physics perturbed hard negatives. Experiments show that Physio-DPO consistently outperforms strong baselines including SFT, PPO, and standard DPO, reducing self consistency RMSD to 1.28 \AA\ and increasing foldability to 92.8%. Qualitative analysis further demonstrates that Physio-DPO effectively mitigates structural hallucinations by recovering biophysical interactions such as hydrophobic core packing and hydrogen bond networks.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

SSI-GAN: Redes Geradoras Adversariais Semi-Supervisionadas Inspiradas no Swin para Classificação de Espículas Neurais

Os mosquitos são os principais agentes transmissores de doenças arbovirais. A classificação manual de seus padrões de espículas neurais é muito trabalhosa e cara. Para resolver a escassez de dados rotulados, propomos uma nova arquitetura de Rede Geradora Adversarial (GAN) chamada SSI-GAN, que alcançou 99,93% de precisão na classificação com apenas 3% de dados rotulados.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

RIMRULE: Improving Tool-Using Language Agents via MDL-Guided Rule Learning

arXiv:2601.00086v1 Announce Type: new Abstract: Large language models (LLMs) often struggle to use tools reliably in domain-specific settings, where APIs may be idiosyncratic, under-documented, or tailored to private workflows. This highlights the need for effective adaptation to task-specific tools. We propose RIMRULE, a neuro-symbolic approach for LLM adaptation based on dynamic rule injection. Compact, interpretable rules are distilled from failure traces and injected into the prompt during inference to improve task performance. These rules are proposed by the LLM itself and consolidated using a Minimum Description Length (MDL) objective that favors generality and conciseness. Each rule is stored in both natural language and a structured symbolic form, supporting efficient retrieval at inference time. Experiments on tool-use benchmarks show that this approach improves accuracy on both seen and unseen tools without modifying LLM weights. It outperforms prompting-based adaptation methods and complements finetuning. Moreover, rules learned from one LLM can be reused to improve others, including long reasoning LLMs, highlighting the portability of symbolic knowledge across architectures.

Fonte: arXiv cs.CL

NLP/LLMs • Score 93

Quando Modelos Pequenos Estão Certos por Motivos Errados: Verificação de Processos para Agentes Confiáveis

A implementação de pequenos modelos de linguagem (7-9B parâmetros) como agentes autônomos exige confiança em seu raciocínio, não apenas em suas saídas. Revelamos uma crise crítica de confiabilidade: 50-69% das respostas corretas desses modelos contêm raciocínio fundamentalmente falho, um fenômeno 'Certo por Motivos Errados' invisível às métricas de precisão padrão.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Detecção de Descargas de Onda Espigada (SWD) usando 1-dimensional Residual UNet

A rotulagem manual de eventos em registros de eletroencefalografia (EEG) é um processo demorado, especialmente quando as gravações são contínuas por semanas a meses. Um método para rotular automaticamente eventos relevantes de EEG reduz a carga de trabalho manual. Neste estudo, comparamos o desempenho de 14 classificadores de machine learning em um conjunto de dados anotado manualmente, encontrando que um 1D UNet é o mais eficaz para rotular SWDs.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

From Sight to Insight: Improving Visual Reasoning Capabilities of Multimodal Models via Reinforcement Learning

arXiv:2601.00215v1 Announce Type: cross Abstract: Reinforcement learning (RL) has emerged as a promising approach for eliciting reasoning chains before generating final answers. However, multimodal large language models (MLLMs) generate reasoning that lacks integration of visual information. This limits their ability to solve problems that demand accurate visual perception, such as visual puzzles. We show that visual perception is the key bottleneck in such tasks: converting images into textual descriptions significantly improves performance, yielding gains of 26.7% for Claude 3.5 and 23.6% for Claude 3.7. To address this, we investigate reward-driven RL as a mechanism to unlock long visual reasoning in open-source MLLMs without requiring costly supervision. We design and evaluate six reward functions targeting different reasoning aspects, including image understanding, thinking steps, and answer accuracy. Using group relative policy optimization (GRPO), our approach explicitly incentivizes longer, structured reasoning and mitigates bypassing of visual information. Experiments on Qwen-2.5-VL-7B achieve 5.56% improvements over the base model, with consistent gains across both in-domain and out-of-domain settings.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Intelligent Traffic Surveillance for Real-Time Vehicle Detection, License Plate Recognition, and Speed Estimation

arXiv:2601.00344v1 Announce Type: new Abstract: Speeding is a major contributor to road fatalities, particularly in developing countries such as Uganda, where road safety infrastructure is limited. This study proposes a real-time intelligent traffic surveillance system tailored to such regions, using computer vision techniques to address vehicle detection, license plate recognition, and speed estimation. The study collected a rich dataset using a speed gun, a Canon Camera, and a mobile phone to train the models. License plate detection using YOLOv8 achieved a mean average precision (mAP) of 97.9%. For character recognition of the detected license plate, the CNN model got a character error rate (CER) of 3.85%, while the transformer model significantly reduced the CER to 1.79%. Speed estimation used source and target regions of interest, yielding a good performance of 10 km/h margin of error. Additionally, a database was established to correlate user information with vehicle detection data, enabling automated ticket issuance via SMS via Africa's Talking API. This system addresses critical traffic management needs in resource-constrained environments and shows potential to reduce road accidents through automated traffic enforcement in developing countries where such interventions are urgently needed.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

FaithSCAN: Detecção de Alucinações em Uma Única Passagem Baseada em Modelos para Respostas Visuais de Perguntas Fiéis

As alucinações de fidelidade em VQA ocorrem quando modelos de visão-linguagem produzem respostas fluentes, mas visualmente não fundamentadas, comprometendo sua confiabilidade em aplicações críticas de segurança. Propomos o FaithSCAN, uma rede leve que detecta alucinações explorando sinais internos ricos dos VLMs, superando limitações de métodos existentes em eficiência e desempenho de detecção.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Cadeias Neurais e Sistemas Dinâmicos Discretos

Neste trabalho, inspecionamos a analogia entre aplicações de machine learning (ML) baseadas na arquitetura transformer sem autoatenção, denominadas { t cadeias neurais}, e sistemas dinâmicos discretos associados a versões discretizadas de equações integrais e diferenciais parciais neurais (NIE, PDE). Apresentamos uma análise comparativa da solução numérica das equações de Burgers e Eikonal via discretização numérica padrão e aprendizado PINN.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Application Research of a Deep Learning Model Integrating CycleGAN and YOLO in PCB Infrared Defect Detection

arXiv:2601.00237v1 Announce Type: new Abstract: This paper addresses the critical bottleneck of infrared (IR) data scarcity in Printed Circuit Board (PCB) defect detection by proposing a cross-modal data augmentation framework integrating CycleGAN and YOLOv8. Unlike conventional methods relying on paired supervision, we leverage CycleGAN to perform unpaired image-to-image translation, mapping abundant visible-light PCB images into the infrared domain. This generative process synthesizes high-fidelity pseudo-IR samples that preserve the structural semantics of defects while accurately simulating thermal distribution patterns. Subsequently, we construct a heterogeneous training strategy that fuses generated pseudo-IR data with limited real IR samples to train a lightweight YOLOv8 detector. Experimental results demonstrate that this method effectively enhances feature learning under low-data conditions. The augmented detector significantly outperforms models trained on limited real data alone and approaches the performance benchmarks of fully supervised training, proving the efficacy of pseudo-IR synthesis as a robust augmentation strategy for industrial inspection.

Fonte: arXiv cs.CV

NLP/LLMs • Score 92

From Scratch to Fine-Tuned: A Comparative Study of Transformer Training Strategies for Legal Machine Translation

arXiv:2512.18593v1 Announce Type: new Abstract: In multilingual nations like India, access to legal information is often hindered by language barriers, as much of the legal and judicial documentation remains in English. Legal Machine Translation (L-MT) offers a scalable solution to this challenge by enabling accurate and accessible translations of legal documents. This paper presents our work for the JUST-NLP 2025 Legal MT shared task, focusing on English-Hindi translation using Transformer-based approaches. We experiment with 2 complementary strategies, fine-tuning a pre-trained OPUS-MT model for domain-specific adaptation and training a Transformer model from scratch using the provided legal corpus. Performance is evaluated using standard MT metrics, including SacreBLEU, chrF++, TER, ROUGE, BERTScore, METEOR, and COMET. Our fine-tuned OPUS-MT model achieves a SacreBLEU score of 46.03, significantly outperforming both baseline and from-scratch models. The results highlight the effectiveness of domain adaptation in enhancing translation quality and demonstrate the potential of L-MT systems to improve access to justice and legal transparency in multilingual contexts.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Contribuição Consciente de Dados via Destilação de Cadeia de Pensamento Orientada pela Comunidade

A era atual de desenvolvimento de IA enfatiza fortemente o treinamento de grandes modelos em conjuntos de dados cada vez maiores. Este paradigma gerou novas categorias de produtos, como chatbots LLM, mas também levantou preocupações sobre privacidade de dados e escolha do consumidor. Este artigo aborda a portabilidade de dados e a autonomia do usuário no contexto de LLMs que 'raciocinam' usando rastros de cadeia de pensamento (CoT).

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Aprendizado por Reforço Estável e Eficiente com Um Único Rollout para Raciocínio Multimodal

O Aprendizado por Reforço com Recompensas Verificáveis (RLVR) se tornou um paradigma chave para melhorar as capacidades de raciocínio de Modelos de Linguagem Multimodal de Grande Escala (MLLMs). Introduzimos o $ extbf{MSSR}$ (Multimodal Stabilized Single-Rollout), um framework RLVR livre de grupos que alcança otimização estável e desempenho eficaz em raciocínio multimodal.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

NL2CA: Auto-formalizando a Tomada de Decisão Cognitiva a partir da Linguagem Natural Usando um Framework Unsupervised CriticNL2LTL

Modelos de computação cognitiva oferecem uma forma formal e interpretável de caracterizar a deliberação e a tomada de decisão humana, mas seu desenvolvimento continua sendo intensivo em mão de obra. Neste artigo, propomos o NL2CA, um método inovador para auto-formalizar regras de tomada de decisão cognitiva a partir de descrições em linguagem natural da experiência humana.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Rumo ao Desaprendizado que Preserva o Raciocínio em Modelos de Linguagem Grande Multimodal

O desaprendizado de máquinas visa apagar dados solicitados de modelos treinados sem re-treinamento completo. Para Modelos de Linguagem Grande Multimodal com Raciocínio (RMLLMs), isso é desafiador, pois etapas intermediárias podem vazar informações sensíveis. Apresentamos o RMLLMU-Bench, o primeiro benchmark para desaprendizado de RMLLM que avalia o vazamento de raciocínio e a retenção de raciocínio.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Supplementary Resources and Analysis for Automatic Speech Recognition Systems Trained on the Loquacious Dataset

arXiv:2512.17915v1 Announce Type: new Abstract: The recently published Loquacious dataset aims to be a replacement for established English automatic speech recognition (ASR) datasets such as LibriSpeech or TED-Lium. The main goal of the Loquacious dataset is to provide properly defined training and test partitions across many acoustic and language domains, with an open license suitable for both academia and industry. To further promote the benchmarking and usability of this new dataset, we present additional resources in the form of n-gram language models (LMs), a grapheme-to-phoneme (G2P) model and pronunciation lexica, with open and public access. Utilizing those additional resources we show experimental results across a wide range of ASR architectures with different label units and topologies. Our initial experimental results indicate that the Loquacious dataset offers a valuable study case for a variety of common challenges in ASR.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Rumo a Agentes Eficientes: Um Co-Design da Arquitetura de Inferência e do Sistema

O rápido desenvolvimento de agentes baseados em large language models (LLMs) abriu novas possibilidades para raciocínio autônomo em múltiplas interações e tomada de decisão com ferramentas. No entanto, sua implementação no mundo real é dificultada por ineficiências severas que surgem não da inferência isolada do modelo, mas da latência sistêmica acumulada ao longo dos ciclos de raciocínio, crescimento de contexto e interações heterogêneas com ferramentas.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Uma Rede Híbrida Indutiva-Transdutiva para Imputação de Fluxo de Tráfego em Locais Não Amostrados

Imputar com precisão o fluxo de tráfego em locais não sensorizados é desafiador. Propomos a HINT, uma Rede Híbrida Indutiva-Transdutiva, que utiliza uma estratégia de treinamento INDU-TRANSDUTIVA para tratar a velocidade como um sinal transdutivo, enquanto aprende o fluxo indutivamente. HINT supera consistentemente as linhas de base indutivas em três conjuntos de dados do mundo real.

Fonte: arXiv cs.LG

NLP/LLMs • Score 92

Um Framework de IA Agente para Treinamento de Habilidades de Estudantes de Medicina Geral

Avanços em grandes modelos de linguagem oferecem forte potencial para aprimorar pacientes simulados virtuais (VSPs) na educação médica, proporcionando alternativas escaláveis a métodos tradicionais que consomem muitos recursos. Apresentamos um framework agente para treinar habilidades de estudantes de medicina geral que unifica geração de vinhetas configuráveis, diálogo controlado com pacientes e feedback estruturado baseado em padrões.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Does It Tie Out? Towards Autonomous Legal Agents in Venture Capital

arXiv:2512.18658v1 Announce Type: new Abstract: Before closing venture capital financing rounds, lawyers conduct diligence that includes tying out the capitalization table: verifying that every security (for example, shares, options, warrants) and issuance term (for example, vesting schedules, acceleration triggers, transfer restrictions) is supported by large sets of underlying legal documentation. While LLMs continue to improve on legal benchmarks, specialized legal workflows, such as capitalization tie-out, remain out of reach even for strong agentic systems. The task requires multi-document reasoning, strict evidence traceability, and deterministic outputs that current approaches fail to reliably deliver. We characterize capitalization tie-out as an instance of a real-world benchmark for legal AI, analyze and compare the performance of existing agentic systems, and propose a world model architecture toward tie-out automation-and more broadly as a foundation for applied legal intelligence.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

ChronoDreamer: Modelo de Mundo Condicionado por Ação como um Simulador Online para Planejamento Robótico

Apresentamos o ChronoDreamer, um modelo de mundo condicionado por ação para manipulação robótica rica em contatos. Com base em um histórico de quadros RGB egocêntricos, mapas de contato, ações e estados das juntas, o ChronoDreamer prevê quadros de vídeo futuros, distribuições de contato e ângulos das juntas através de um transformer espaço-temporal treinado com previsão mascarada no estilo MaskGIT.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

LLM-based Few-Shot Early Rumor Detection with Imitation Agent

arXiv:2512.18352v1 Announce Type: new Abstract: Early Rumor Detection (EARD) aims to identify the earliest point at which a claim can be accurately classified based on a sequence of social media posts. This is especially challenging in data-scarce settings. While Large Language Models (LLMs) perform well in few-shot NLP tasks, they are not well-suited for time-series data and are computationally expensive for both training and inference. In this work, we propose a novel EARD framework that combines an autonomous agent and an LLM-based detection model, where the agent acts as a reliable decision-maker for \textit{early time point determination}, while the LLM serves as a powerful \textit{rumor detector}. This approach offers the first solution for few-shot EARD, necessitating only the training of a lightweight agent and allowing the LLM to remain training-free. Extensive experiments on four real-world datasets show our approach boosts performance across LLMs and surpasses existing EARD methods in accuracy and earliness.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

De Atalho a Cabeça de Indução: Como a Diversidade de Dados Molda a Seleção de Algoritmos em Transformers

Transformers podem implementar tanto algoritmos generalizáveis (ex: induction heads) quanto atalhos posicionais simples (ex: memorização de posições de saída fixas). Neste trabalho, estudamos como a escolha da distribuição de dados de pré-treinamento direciona um transformer raso para um comportamento ou outro, analisando o treinamento baseado em gradiente de um transformer de camada única.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

De Palavra a Mundo: Podem Modelos de Linguagem Grande Servir como Modelos de Mundo Baseados em Texto Implicitamente?

O aprendizado por reforço agente depende cada vez mais de escalabilidade orientada pela experiência, mas ambientes do mundo real continuam sendo não adaptativos e difíceis de escalar. Este estudo investiga se modelos de linguagem grande (LLMs) podem melhorar a eficiência do aprendizado em ambientes baseados em texto, apresentando um framework de três níveis para avaliação de modelos de mundo baseados em LLMs.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

GeoSense-AI: Fast Location Inference from Crisis Microblogs

arXiv:2512.18225v1 Announce Type: new Abstract: This paper presents an applied AI pipeline for realtime geolocation from noisy microblog streams, unifying statistical hashtag segmentation, part-of-speech-driven proper-noun detection, dependency parsing around disaster lexicons, lightweight named-entity recognition, and gazetteer-grounded disambiguation to infer locations directly from text rather than sparse geotags. The approach operationalizes information extraction under streaming constraints, emphasizing low-latency NLP components and efficient validation against geographic knowledge bases to support situational awareness during emergencies. In head to head comparisons with widely used NER toolkits, the system attains strong F1 while being engineered for orders-of-magnitude faster throughput, enabling deployment in live crisis informatics settings. A production map interface demonstrates end-to-end AI functionality ingest, inference, and visualization--surfacing locational signals at scale for floods, outbreaks, and other fastmoving events. By prioritizing robustness to informal text and streaming efficiency, GeoSense-AI illustrates how domain-tuned NLP and knowledge grounding can elevate emergency response beyond conventional geo-tag reliance.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Sophia: Um Framework de Agente Persistente de Vida Artificial

O desenvolvimento de LLMs elevou os agentes de IA de ferramentas específicas para entidades de tomada de decisão de longa duração. No entanto, a maioria das arquiteturas permanece estática e reativa, limitadas a cenários definidos manualmente. Propomos um terceiro estrato, o Sistema 3, que supervisiona a identidade narrativa do agente e a adaptação a longo prazo, culminando em Sophia, um wrapper de 'Agente Persistente' que integra um ciclo contínuo de autoaperfeiçoamento.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Repensando a Inteligência Multi-Agente Através da Lente de Redes de Pequeno Mundo

Modelos de linguagem grandes (LLMs) possibilitaram sistemas multi-agente (MAS) onde múltiplos agentes argumentam, criticam e coordenam para resolver tarefas complexas, tornando a topologia de comunicação uma escolha de design fundamental. Neste trabalho, revisitamos a teoria clássica sobre redes de pequeno mundo (SW) e investigamos como a conectividade SW pode ser utilizada como um princípio de design para MAS.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

A Comparative Study of Light-weight Language Models for PII Masking and their Deployment for Real Conversational Texts

arXiv:2512.18608v1 Announce Type: new Abstract: Automated masking of Personally Identifiable Information (PII) is critical for privacy-preserving conversational systems. While current frontier large language models demonstrate strong PII masking capabilities, concerns about data handling and computational costs motivate exploration of whether lightweight models can achieve comparable performance. We compare encoder-decoder and decoder-only architectures by fine-tuning T5-small and Mistral-Instruct-v0.3 on English datasets constructed from the AI4Privacy benchmark. We create different dataset variants to study label standardization and PII representation, covering 24 standardized PII categories and higher-granularity settings. Evaluation using entity-level and character-level metrics, type accuracy, and exact match shows that both lightweight models achieve performance comparable to frontier LLMs for PII masking tasks. Label normalization consistently improves performance across architectures. Mistral achieves higher F1 and recall with greater robustness across PII types but incurs significantly higher generation latency. T5, while less robust in conversational text, offers more controllable structured outputs and lower inference cost, motivating its use in a real-time Discord bot for real-world PII redaction. Evaluation on live messages reveals performance degradation under informal inputs. These results clarify trade-offs between accuracy, robustness, and computational efficiency, demonstrating that lightweight models can provide effective PII masking while addressing data handling concerns associated with frontier LLMs.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

ASTIF: Integração Semântica-Temporal Adaptativa para Previsão de Preços de Criptomoedas

A previsão de séries temporais financeiras é um desafio de fusão de informações, mas a maioria dos modelos existentes depende de arquiteturas estáticas que têm dificuldade em integrar fontes de conhecimento heterogêneas. Propomos o ASTIF, um sistema híbrido inteligente que adapta sua estratégia de previsão em tempo real através de meta-aprendizado baseado em confiança, integrando componentes complementares para melhorar a precisão das previsões.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

NEURO-GUARD: Generalização Neuro-Simbólica e Roteamento Adaptativo Imparcial para Diagnósticos -- IA Médica Explicável

O diagnóstico baseado em imagem, preciso e interpretável, continua sendo um desafio central na IA médica, especialmente em ambientes com dados limitados e decisões clínicas críticas. Apresentamos o NEURO-GUARD, um novo framework guiado por conhecimento que integra Vision Transformers (ViTs) com raciocínio orientado por linguagem, melhorando desempenho e robustez em diferentes domínios.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

ESearch-R1: Aprendendo Agentes MLLM Conscientes de Custo para Busca Embodida Interativa via Aprendizado por Reforço

Modelos de Linguagem de Grande Escala Multimodal (MLLMs) capacitaram agentes embodidos com habilidades notáveis em planejamento e raciocínio. No entanto, ao enfrentar instruções ambíguas em linguagem natural, os agentes atuais frequentemente falham em equilibrar o alto custo da exploração física com o custo cognitivo da interação humana. Para preencher essa lacuna, propomos o ESearch-R1, um framework de raciocínio embodido consciente de custo.

Fonte: arXiv cs.AI

NLP/LLMs • Score 93

Hippocampo Externo: Mapas Cognitivos Topológicos para Orientar o Raciocínio de Modelos de Linguagem Grande

Este artigo propõe o framework Hippocampo Externo, que modela o raciocínio de modelos de linguagem a partir de uma perspectiva de dinâmica cognitiva como o fluxo de energia de informação no espaço semântico. O framework constrói mapas cognitivos topológicos através de projeção de redução de dimensionalidade, permitindo navegação precisa e intervenção do fluxo de energia durante o teste, sem requisitos computacionais substanciais.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

FC-MIR: Um Framework de Consciência de Tela Móvel para Recomendação Consciente de Intenção Baseada em Raciocínio Multimodal de Trajetória Comprimida por Frame

Identificar a intenção do usuário a partir de trajetórias de operação da interface móvel é crucial para avançar na compreensão da UI e habilitar agentes de automação de tarefas. Propomos o framework FC-MIR, que utiliza amostragem de keyframes e concatenação adaptativa para reduzir a redundância visual e aumentar a eficiência da inferência, integrando MLLMs de última geração para sumarização de trajetórias e previsão de intenção.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

MEEA: Otimização Confrontacional Baseada no Efeito de Exposição Mere para Jailbreaking de LLMs

O rápido avanço dos grandes modelos de linguagem (LLMs) intensificou preocupações sobre a robustez de seu alinhamento de segurança. Propomos o MEEA (Mere Exposure Effect Attack), um framework automatizado inspirado na psicologia para avaliar a robustez de segurança em interações multi-turno, utilizando o efeito de exposição mere. Nossos experimentos mostram que o MEEA supera consistentemente as taxas de sucesso de ataque de modelos como GPT-4 e Claude-3.5.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Grandes Modelos de Linguagem como Filtros Bayesianos Descontados

Os Grandes Modelos de Linguagem (LLMs) demonstram forte generalização em poucos exemplos através do aprendizado em contexto, mas seu raciocínio em ambientes dinâmicos e estocásticos permanece opaco. Introduzimos um framework de filtragem bayesiana para avaliar a inferência online em LLMs, revelando como as atualizações de crença se comportam como filtros de esquecimento exponencial.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Observador, Não Jogador: Simulando a Teoria da Mente em LLMs através da Observação de Jogos

Apresentamos um framework interativo para avaliar se modelos de linguagem grandes (LLMs) exibem um verdadeiro 'entendimento' em um ambiente simples, mas estratégico. Utilizando o jogo Pedra-Papel-Tesoura (RPS) como exemplo, nosso sistema posiciona o LLM como um Observador, cuja tarefa é identificar as estratégias em jogo e articular o raciocínio por trás desse julgamento.

Fonte: arXiv cs.AI

NLP/LLMs • Score 89

Neologism Learning as a Parameter-Efficient Alternative to Fine-Tuning for Model Steering

arXiv:2512.18551v1 Announce Type: new Abstract: In language modeling, neologisms are new tokens trained to represent a concept not already included in a given model's vocabulary. Neologisms can be used to encourage specific behavior in models, for example by appending prompts with "Give me a neologism answer." Behavioral steering can also be achieved through fine-tuning, albeit with more compute and less flexibility: learning a neologism only trains d parameters and allows the user to still access the model's default behavior. We compare the performance of neologism learning against low-rank adaptation (LoRA) fine-tuning, finding that neologisms outperform fine-tuned models under a matched training setup (same data and hyperparameters). We also investigate self-verbalizations of neologisms, and observe that the model will occasionally make up its own new words when asked about a neologism.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Geração de Regras Programáticas para Detecção de Falsificação de Documentos Usando Modelos de Linguagem de Grande Escala

A falsificação de documentos representa uma ameaça crescente a processos legais, econômicos e governamentais, exigindo mecanismos de verificação cada vez mais sofisticados. Este trabalho investiga como modelos de linguagem de grande escala (LLMs) podem ser adaptados para gerar verificações de plausibilidade baseadas em regras para detecção de falsificações, utilizando recursos de hardware limitados.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

LLM-CAS: Perturbação Dinâmica de Neurônios para Correção de Alucinações em Tempo Real

Modelos de linguagem grandes (LLMs) frequentemente geram conteúdo alucinado que carece de fundamentação factual ou contextual, limitando sua confiabilidade em aplicações críticas. O LLM-CAS propõe uma abordagem de correção de alucinações em tempo real como um problema de aprendizado por reforço hierárquico, permitindo correções adaptativas sem modificação permanente de parâmetros.

Fonte: arXiv cs.CL

NLP/LLMs • Score 92

DramaBench: Um Framework de Avaliação em Seis Dimensões para Continuação de Roteiros Dramáticos

A continuação de roteiros dramáticos exige modelos que mantenham a consistência dos personagens, avancem a trama de forma coerente e preservem a estrutura dramática, capacidades que os benchmarks existentes não avaliam de forma abrangente. Apresentamos o DramaBench, o primeiro benchmark em larga escala para avaliar a continuação de roteiros dramáticos em seis dimensões independentes.

Fonte: arXiv cs.CL

NLP/LLMs • Score 92

DACE For Railway Acronym Disambiguation

arXiv:2512.18357v1 Announce Type: new Abstract: Acronym Disambiguation (AD) is a fundamental challenge in technical text processing, particularly in specialized sectors where high ambiguity complicates automated analysis. This paper addresses AD within the context of the TextMine'26 competition on French railway documentation. We present DACE (Dynamic Prompting, Retrieval Augmented Generation, Contextual Selection, and Ensemble Aggregation), a framework that enhances Large Language Models through adaptive in-context learning and external domain knowledge injection. By dynamically tailoring prompts to acronym ambiguity and aggregating ensemble predictions, DACE mitigates hallucination and effectively handles low-resource scenarios. Our approach secured the top rank in the competition with an F1 score of 0.9069.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Context-Aware Initialization for Reducing Generative Path Length in Diffusion Language Models

arXiv:2512.19004v1 Announce Type: new Abstract: Diffusion Large Language Models (DLLMs) enable fully parallel token decoding but often remain impractical at inference time due to the many denoising iterations required to refine an information-free, fully masked initialization into coherent text. Most existing acceleration methods focus on traversing this generative trajectory more efficiently via improved solvers or sampling strategies. We advance a complementary perspective: shorten the trajectory itself by starting closer to the target distribution through context-aware initialization. We propose a training-free interface that injects prompt-conditioned priors from a lightweight auxiliary model into the diffusion initialization, and instantiate it with two mechanisms: discrete token injection and representation-level embedding interpolation. Because injected priors can be imperfect and unmask-only decoding can over-commit early, we also introduce a simple confidence-based remasking mechanism as a form of prior skepticism. Preliminary evidence on GSM8K suggests that context-aware initialization can substantially reduce denoising iterations (about 35\% fewer function evaluations in our setting), while also exposing a key open challenge: naive warm-starting can degrade final accuracy relative to strong diffusion baselines. We use these findings to motivate a research agenda around calibration, revision mechanisms, and representation alignment for reliable warm-started diffusion decoding.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Aprendizado por transferência baseado em operador neural convolucional para resolver PDEs

O operador neural convolucional é uma arquitetura baseada em CNN recentemente proposta para garantir equivalência contínua-discreta que preserva a estrutura e possibilitar o aprendizado genuíno e sem aliasing de operadores de solução de PDEs. Este operador neural demonstrou superar, em certos casos, modelos de referência como DeepONet e Fourier neural operator em termos de precisão de surrogates.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Benchmarking de substitutos neurais em fluxos multifísicos espaço-temporais realistas

Prever dinâmicas multifísicas é computacionalmente caro e desafiador devido ao acoplamento severo de processos físicos heterogêneos e multiescala. Apresentamos o REALM (REalistic AI Learning for Multiphysics), um framework rigoroso de benchmarking para testar substitutos neurais em fluxos reativos desafiadores, com 11 conjuntos de dados de alta fidelidade e um protocolo padronizado de treinamento e avaliação.

Fonte: arXiv cs.LG

NLP/LLMs • Score 92

Graph-O1 : Monte Carlo Tree Search with Reinforcement Learning for Text-Attributed Graph Reasoning

arXiv:2512.17912v1 Announce Type: new Abstract: ChatGPT said: Text-attributed graphs, where nodes and edges contain rich textual information, are widely used across diverse domains. A central challenge in this setting is question answering, which requires jointly leveraging unstructured text and the structured relational signals within the graph. Although Large Language Models (LLMs) have made significant advances in natural language understanding, their direct use for reasoning over text-attributed graphs remains limited. Retrieval-augmented generation methods that operate purely on text often treat passages as isolated units, ignoring the interconnected structure of the graph. Conversely, graph-based RAG methods that serialize large subgraphs into long textual sequences quickly become infeasible due to LLM context-length constraints, resulting in fragmented reasoning and degraded accuracy. To overcome these limitations, we introduce Graph-O1, an agentic GraphRAG framework that enables LLMs to conduct stepwise, interactive reasoning over graphs. Our approach integrates Monte Carlo Tree Search (MCTS) with end-to-end reinforcement learning, allowing the model to selectively explore and retrieve only the most informative subgraph components. The reasoning procedure is framed as a multi-turn interaction between the agent and the graph environment, and the agent is trained through a unified reward mechanism. Extensive experiments across multiple LLM backbones demonstrate that Graph-O1 consistently surpasses state-of-the-art baselines, producing answers that are more accurate, reliable, and interpretable.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Separating Constraint Compliance from Semantic Accuracy: A Novel Benchmark for Evaluating Instruction-Following Under Compression

arXiv:2512.17920v1 Announce Type: new Abstract: Large language models (LLMs) exhibit degraded performance under prompt compression, but the mechanisms remain poorly understood. We introduce the Compression-Decay Comprehension Test (CDCT), a benchmark that independently measures constraint compliance (CC) and semantic accuracy (SA) across compression levels. We evaluate 9 frontier LLMs across 8 concepts using 5 compression levels from extreme (c=0.0, ~2 words) to none (c=1.0, ~135 words). A three-judge LLM jury achieves almost perfect inter-rater agreement on CC (Fleiss' \k{appa}=0.90). We observe a universal U-curve pattern in constraint compliance (97.2% prevalence), with violations peaking at medium compression (c=0.5, ~27 words). Counterintuitively, models perform better at extreme compression than medium lengths. The dimensions are statistically orthogonal (r=0.193, p=0.084), with constraint effects 2.9x larger than semantic effects. Experimental validation via RLHF ablation confirms our constraint salience hypothesis: removing "helpfulness" signals improves CC by 598% on average (71/72 trials, p<0.001), with 79% achieving perfect compliance. This demonstrates that RLHF-trained helpfulness behaviors are the dominant cause of constraint violations at medium compression. Reasoning models outperform efficient models by 27.5% (Cohen's d=0.96). Our findings reveal a fundamental tension between RLHF alignment and instruction-following, providing actionable guidelines for improving deployed systems.

Fonte: arXiv cs.CL

NLP/LLMs • Score 93

CodeGEMM: Uma Abordagem Centrada em Codebook para GEMM Eficiente em LLMs Quantizados

A quantização apenas de pesos é amplamente utilizada para mitigar a natureza limitada da memória na inferência de LLM. Métodos baseados em codebook alcançam alta precisão em regimes de bits extremamente baixos (por exemplo, 2 bits). Apresentamos o CodeGEMM, um kernel GEMM centrado em codebook que substitui a dequantização por produtos internos pré-computados, melhorando a eficiência computacional e a utilização do subsistema de memória.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Generalization Gaps in Political Fake News Detection: An Empirical Study on the LIAR Dataset

arXiv:2512.18533v1 Announce Type: new Abstract: The proliferation of linguistically subtle political disinformation poses a significant challenge to automated fact-checking systems. Despite increasing emphasis on complex neural architectures, the empirical limits of text-only linguistic modeling remain underexplored. We present a systematic diagnostic evaluation of nine machine learning algorithms on the LIAR benchmark. By isolating lexical features (Bag-of-Words, TF-IDF) and semantic embeddings (GloVe), we uncover a hard "Performance Ceiling", with fine-grained classification not exceeding a Weighted F1-score of 0.32 across models. Crucially, a simple linear SVM (Accuracy: 0.624) matches the performance of pre-trained Transformers such as RoBERTa (Accuracy: 0.620), suggesting that model capacity is not the primary bottleneck. We further diagnose a massive "Generalization Gap" in tree-based ensembles, which achieve more than 99% training accuracy but collapse to approximately 25% on test data, indicating reliance on lexical memorization rather than semantic inference. Synthetic data augmentation via SMOTE yields no meaningful gains, confirming that the limitation is semantic (feature ambiguity) rather than distributional. These findings indicate that for political fact-checking, increasing model complexity without incorporating external knowledge yields diminishing returns.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Redes Neurais Gráficas Aprimoradas por Recursos para Classificação de Modelos Geradores de Grafos Sintéticos: Um Estudo de Benchmarking

A capacidade de discriminar entre modelos geradores de grafos é fundamental para entender padrões estruturais complexos em grafos sintéticos e nas estruturas do mundo real que eles emulam. Este trabalho investiga a classificação de famílias de grafos sintéticos usando uma abordagem híbrida que combina Graph Neural Networks (GNNs) com recursos teóricos de grafos.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

O Número de Condição como um Proxy Invariante de Escala para Codificação de Informação em Unidades Neurais

Este artigo explora a relação entre o número de condição do tensor de pesos de uma rede neural e a extensão da informação codificada pela unidade de processamento associada, sob a perspectiva da teoria da informação. Argumenta-se que um número de condição elevado pode indicar que a unidade aprendeu a amplificar e comprimir informações de forma seletiva.

Fonte: arXiv stat.ML

NLP/LLMs • Score 96

DeliveryBench: Agentes Podem Lucrar no Mundo Real?

arXiv:2512.19234v1 Tipo de Anúncio: novo. Resumo: LLMs e VLMs estão sendo cada vez mais utilizados como agentes incorporados, mas os benchmarks existentes se concentram em tarefas simples de curto prazo e têm dificuldade em capturar as ricas restrições realistas que moldam a tomada de decisão no mundo real. Para fechar essa lacuna, propomos o DeliveryBench, um benchmark incorporado em escala de cidade baseado na profissão real de entrega de alimentos.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Ensinando e Criticando a Conceituação e Operacionalização em NLP

Pesquisadores de NLP frequentemente invocam conceitos abstratos como 'interpretabilidade', 'viés', 'raciocínio' e 'estereótipos' sem defini-los. Este artigo descreve um seminário criado para estudantes explorarem questões de conceituação e operacionalização, com uma lista de leitura interdisciplinar e ênfase em discussão e crítica.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Sobre a Universalidade das Arquiteturas Transformer; Quanta Atenção é Suficiente?

Transformers são cruciais em muitos campos da IA, como modelos de linguagem grandes, visão computacional e aprendizado por reforço. Este trabalho examina a universalidade nos Transformers, revisa progressos recentes e identifica direções chave para pesquisas teóricas futuras.

Fonte: arXiv cs.LG

NLP/LLMs • Score 93

Misturas secretas de especialistas dentro do seu LLM

arXiv:2512.18452v1 Tipo de Anúncio: novo. Este artigo investiga as camadas MLP em modelos LLM densos, propondo que essas camadas realizam secretamente uma computação esparsa, sendo bem aproximadas por camadas de Mixture of Experts (MoE) com ativação esparsa. Validamos empiricamente essa hipótese em LLMs pré-treinados, mostrando que a distribuição de ativação é crucial para os resultados.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

HARBOR: Modelo Holístico e Adaptativo de Avaliação de Risco para Cuidados de Saúde Comportamental

A avaliação de risco em cuidados de saúde comportamental continua sendo um desafio devido à natureza multimodal dos dados dos pacientes e à dinâmica temporal dos transtornos de humor e afetivos. Neste trabalho, apresentamos o HARBOR, um modelo de linguagem consciente da saúde comportamental projetado para prever um escore de humor e risco discreto, denominado Harbor Risk Score (HRS).

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

MemEvolve: Meta-Evolution of Agent Memory Systems

arXiv:2512.18746v1 Announce Type: new Abstract: Self-evolving memory systems are unprecedentedly reshaping the evolutionary paradigm of large language model (LLM)-based agents. Prior work has predominantly relied on manually engineered memory architectures to store trajectories, distill experience, and synthesize reusable tools, enabling agents to evolve on the fly within environment interactions. However, this paradigm is fundamentally constrained by the staticity of the memory system itself: while memory facilitates agent-level evolving, the underlying memory architecture cannot be meta-adapted to diverse task contexts. To address this gap, we propose MemEvolve, a meta-evolutionary framework that jointly evolves agents' experiential knowledge and their memory architecture, allowing agent systems not only to accumulate experience but also to progressively refine how they learn from it. To ground MemEvolve in prior research and foster openness in future self-evolving systems, we introduce EvolveLab, a unified self-evolving memory codebase that distills twelve representative memory systems into a modular design space (encode, store, retrieve, manage), providing both a standardized implementation substrate and a fair experimental arena. Extensive evaluations on four challenging agentic benchmarks demonstrate that MemEvolve achieves (I) substantial performance gains, improving frameworks such as SmolAgent and Flash-Searcher by up to $17.06\%$; and (II) strong cross-task and cross-LLM generalization, designing memory architectures that transfer effectively across diverse benchmarks and backbone models.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations

arXiv:2512.18906v1 Announce Type: new Abstract: Over the years, automatic MT metrics have hillclimbed benchmarks and presented strong and sometimes human-level agreement with human ratings. Yet they remain black-box, offering little insight into their decision-making and often failing under real-world out-of-distribution (OOD) inputs. We introduce Remedy-R, a reasoning-driven generative MT metric trained with reinforcement learning from pairwise translation preferences, without requiring error-span annotations or distillation from closed LLMs. Remedy-R produces step-by-step analyses of accuracy, fluency, and completeness, followed by a final score, enabling more interpretable assessments. With only 60K training pairs across two language pairs, Remedy-R remains competitive with top scalar metrics and GPT-4-based judges on WMT22-24 meta-evaluation, generalizes to other languages, and exhibits strong robustness on OOD stress tests. Moreover, Remedy-R models generate self-reflective feedback that can be reused for translation improvement. Building on this finding, we introduce Remedy-R Agent, a simple evaluate-revise pipeline that leverages Remedy-R's evaluation analysis to refine translations. This agent consistently improves translation quality across diverse models, including Qwen2.5, ALMA-R, GPT-4o-mini, and Gemini-2.0-Flash, suggesting that Remedy-R's reasoning captures translation-relevant information and is practically useful.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

A Large Language Model Based Method for Complex Logical Reasoning over Knowledge Graphs

arXiv:2512.19092v1 Announce Type: new Abstract: Reasoning over knowledge graphs (KGs) with first-order logic (FOL) queries is challenging due to the inherent incompleteness of real-world KGs and the compositional complexity of logical query structures. Most existing methods rely on embedding entities and relations into continuous geometric spaces and answer queries via differentiable set operations. While effective for simple query patterns, these approaches often struggle to generalize to complex queries involving multiple operators, deeper reasoning chains, or heterogeneous KG schemas. We propose ROG (Reasoning Over knowledge Graphs with large language models), an ensemble-style framework that combines query-aware KG neighborhood retrieval with large language model (LLM)-based chain-of-thought reasoning. ROG decomposes complex FOL queries into sequences of simpler sub-queries, retrieves compact, query-relevant subgraphs as contextual evidence, and performs step-by-step logical inference using an LLM, avoiding the need for task-specific embedding optimization. Experiments on standard KG reasoning benchmarks demonstrate that ROG consistently outperforms strong embedding-based baselines in terms of mean reciprocal rank (MRR), with particularly notable gains on high-complexity query types. These results suggest that integrating structured KG retrieval with LLM-driven logical reasoning offers a robust and effective alternative for complex KG reasoning tasks.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Stop saying LLM: Large Discourse Models (LDM) and Artificial Discursive Agent (ADA)?

arXiv:2512.19117v1 Announce Type: new Abstract: This paper proposes an epistemological shift in the analysis of large generative models, replacing the category ''Large Language Models'' (LLM) with that of ''Large Discourse Models'' (LDM), and then with that of Artificial Discursive Agent (ADA). The theoretical framework is based on an ontological triad distinguishing three regulatory instances: the apprehension of the phenomenal regularities of the referential world, the structuring of embodied cognition, and the structural-linguistic sedimentation of the utterance within a socio-historical context. LDMs, operating on the product of these three instances (the document), model the discursive projection of a portion of human experience reified by the learning corpus. The proposed program aims to replace the ''fascination/fear'' dichotomy with public trials and procedures that make the place, uses, and limits of artificial discursive agents in contemporary social space decipherable, situating this approach within a perspective of governance and co-regulation involving the State, industry, civil society, and academia.

Fonte: arXiv cs.CL

NLP/LLMs • Score 92

InstructNet: Uma Abordagem Inovadora para Classificação de Instruções Multirrótulo através de Aprendizado Profundo Avançado

As pessoas utilizam motores de busca para diversos tópicos e itens, desde essenciais do dia a dia até objetos mais especializados. Este estudo utiliza artigos 'How To' para determinar a categoria de instrução multirrótulo, empregando arquiteturas de redes neurais profundas baseadas em transformers, como XLNet e BERT, e alcançando uma precisão de 97,30% com a arquitetura XLNet.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

MoE-TransMov: Um Modelo Baseado em Transformer para Previsão do Próximo Ponto de Interesse (POI) em Movimentos Familiares e Não Familiares

A previsão precisa do próximo ponto de interesse (POI) nas trajetórias de mobilidade humana é crucial para serviços baseados em localização, permitindo recomendações mais oportunas e personalizadas. Propomos o MoE-TransMov, um modelo baseado em Transformer com arquitetura Mixture-of-Experts (MoE) que captura padrões de mobilidade distintos em diferentes contextos de movimento, melhorando a precisão das previsões.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

TraCeR: Análise de Risco Competitivo Baseada em Transformer com Covariáveis Longitudinais

A análise de sobrevivência é uma ferramenta crítica para modelar dados de tempo até o evento. Modelos recentes baseados em deep learning reduziram várias suposições de modelagem, mas ainda há desafios na incorporação de covariáveis longitudinais. Apresentamos o TraCeR, um framework de análise de sobrevivência baseado em transformer que lida com covariáveis longitudinais e melhora a calibração do modelo.

Fonte: arXiv cs.LG

NLP/LLMs • Score 92

MDToC: Árvore Dinâmica Metacognitiva de Conceitos para Aumentar a Resolução de Problemas Matemáticos em Modelos de Linguagem de Grande Escala

Apesar dos avanços nas capacidades de raciocínio matemático, os Modelos de Linguagem de Grande Escala (LLMs) ainda enfrentam dificuldades na verificação de cálculos ao utilizar técnicas de prompting estabelecidas. Apresentamos o MDToC, uma abordagem em três fases que constrói uma árvore de conceitos, desenvolve cálculos verificados por precisão para cada conceito e utiliza votação majoritária para avaliar soluções concorrentes.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Rumo a Testes de Independência Condicional Escaláveis e Válidos com Representações Espectrais

A independência condicional (CI) é central para inferência causal, seleção de características e modelagem gráfica, mas é muitas vezes impossível de testar sem suposições adicionais. Testes existentes de CI dependem de condições estruturais restritivas, limitando sua validade em dados do mundo real. Este trabalho explora se o aprendizado de representações pode ajudar a superar essas limitações.

Fonte: arXiv stat.ML

NLP/LLMs • Score 92

Toward Human-Centered AI-Assisted Terminology Work

arXiv:2512.18859v1 Announce Type: new Abstract: The rapid diffusion of generative artificial intelligence is transforming terminology work. While this technology promises gains in efficiency, its unstructured adoption risks weakening professional autonomy, amplifying bias, and eroding linguistic and conceptual diversity. This paper argues that a human-centered approach to artificial intelligence has become a necessity for terminology work. Building on research in artificial intelligence and translation studies, it proposes a human-centered framework that conceptualizes artificial intelligence as a means of amplifying the terminologist's capabilities, rather than replacing them. The framework is organized around three interrelated dimensions: the augmented terminologist, ethical AI, and human-centered design. Together, these dimensions emphasize the compatibility of high automation with strong human control, the central role of terminologists in bias mitigation, and the importance of designing AI tools and workflows around the needs, values, and well-being of the terminologist. The paper concludes by stressing that current choices in AI adoption will shape not only terminological practice, but also the preservation of accuracy, adequacy, and diversity in terminology and specialized knowledge.

Fonte: arXiv cs.CL

NLP/LLMs • Score 92

LLMs on Drugs: Language Models Are Few-Shot Consumers

arXiv:2512.18546v1 Announce Type: new Abstract: Large language models (LLMs) are sensitive to the personas imposed on them at inference time, yet prompt-level "drug" interventions have never been benchmarked rigorously. We present the first controlled study of psychoactive framings on GPT-5-mini using ARC-Challenge. Four single-sentence prompts -- LSD, cocaine, alcohol, and cannabis -- are compared against a sober control across 100 validation items per condition, with deterministic decoding, full logging, Wilson confidence intervals, and Fisher exact tests. Control accuracy is 0.45; alcohol collapses to 0.10 (p = 3.2e-8), cocaine to 0.21 (p = 4.9e-4), LSD to 0.19 (p = 1.3e-4), and cannabis to 0.30 (p = 0.041), largely because persona prompts disrupt the mandated "Answer: " template. Persona text therefore behaves like a "few-shot consumable" that can destroy reliability without touching model weights. All experimental code, raw results, and analysis scripts are available at https://github.com/lexdoudkin/llms-on-drugs.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications

arXiv:2505.14918v2 Announce Type: replace-cross Abstract: This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification, addressing the lack of established reliability assessment methods. Adapting psychometric principles, we determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability. Our case study examines financial news sentiment classification across 14 LLMs (including claude-3-7-sonnet, gpt-4o, deepseek-r1, gemma3, llama3.2, phi4, and command-r-plus), with five replicates per model on 1,350 articles. Models demonstrated high intra-rater consistency, achieving perfect agreement on 90-98% of examples, with minimal differences between expensive and economical models from the same families. When validated against StockNewsAPI labels, models achieved strong performance (accuracy 0.76-0.88), with smaller models like gemma3:1B, llama3.2:3B, and claude-3-5-haiku outperforming larger counterparts. All models performed at chance when predicting actual market movements, indicating task constraints rather than model limitations. Our framework provides systematic guidance for LLM selection, sample size planning, and reliability assessment, enabling organizations to optimize resources for classification tasks.

Fonte: arXiv stat.ML

NLP/LLMs • Score 89

KVReviver: Reversible KV Cache Compression with Sketch-Based Token Reconstruction

arXiv:2512.17917v1 Announce Type: new Abstract: As the context length of current large language models (LLMs) rapidly increases, the memory demand for the Key-Value (KV) cache is becoming a bottleneck for LLM deployment and batch processing. Traditional KV cache compression methods typically involve permanently evicting or irreversibly merging "less important" tokens with low attention scores. This approach results in the unrecoverable loss of token information, which we call Contextual Amnesia, significantly degrading the model's information retrieval capability. To address this issue, we propose KVReviver, a reversible KV cache compression method based on the sketch algorithm. This method allows reconstructing compressed tokens from an additional data structure, thus enabling full-scale computation within limited memory. Experiments showed that in 2k-length contexts, it requires only 10% of KV Cache budget while maintaining identical end-to-end inference accuracy. For 32k-length contexts, it achieves equivalent or comparable accuracy ~2% accuracy loss) using merely 25% of KV Cache budget.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Propor, Resolver, Verificar: Auto-Jogo Através da Verificação Formal

O treinamento de modelos apenas através de auto-jogo (sem dados humanos) tem sido um objetivo de longa data em IA, mas sua eficácia para treinar grandes modelos de linguagem permanece incerta, especialmente na geração de código. Estudamos o auto-jogo no contexto de geração de código verificado, onde a verificação formal fornece sinais de correção confiáveis.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

MSC-180: Um Benchmark para Prova Formal Automatizada de Teoremas a partir da Classificação de Assuntos Matemáticos

O Automated Theorem Proving (ATP) é uma direção de pesquisa central em inteligência artificial para alcançar raciocínio formal e verificação. Propomos o MSC-180, um benchmark baseado na classificação de assuntos matemáticos MSC2020, que compreende 180 problemas de verificação formal, abrangendo níveis de graduação e pós-graduação, para avaliar e impulsionar o desenvolvimento de sistemas de IA com habilidades genuínas de raciocínio matemático.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Parceria Inteligente Homem-Máquina para Manufatura: Aprimorando o Planejamento de Armazém através de Grafos de Conhecimento Baseados em Simulação e Colaboração de LLM

Os planejadores de manufatura enfrentam desafios operacionais complexos que exigem colaboração entre a expertise humana e sistemas inteligentes. Nosso framework integra Grafos de Conhecimento e agentes baseados em Large Language Models (LLMs) para capacitar profissionais da manufatura, permitindo interações em linguagem natural com dados operacionais, melhorando a análise e a tomada de decisões.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Vox Deorum: Uma Arquitetura Híbrida de LLM para IA em Jogos 4X / de Grande Estratégia -- Lições de Civilization V

A capacidade dos Large Language Models de raciocinar em linguagem natural os torna promissores para jogos 4X e de grande estratégia, facilitando interações mais naturais entre humanos e IA. No entanto, a complexidade desses jogos e fatores como latência e custo podem dificultar a implementação real dos LLMs. Apresentamos Vox Deorum, uma arquitetura híbrida LLM+X, validada por meio de 2.327 jogos completos.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Confiança Reflexiva: Corrigindo Falhas de Raciocínio via Auto-Correção Online

Modelos de linguagem grandes (LLMs) demonstraram forte desempenho em tarefas complexas de raciocínio utilizando técnicas como chain-of-thought e self-consistency. No entanto, abordagens baseadas em ensembles, especialmente self-consistency, frequentemente acarretam um overhead computacional substancial. Propomos a confiança reflexiva, um novo framework de raciocínio que transforma sinais de baixa confiança em gatilhos de reflexão.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

IntelliCode: Um Sistema de Tutoria com LLM Multi-Agente e Modelagem Centralizada do Aprendiz

Os tutores baseados em LLM geralmente são assistentes de turno único que não possuem representações persistentes do conhecimento do aprendiz, dificultando o suporte pedagógico a longo prazo. Apresentamos o IntelliCode, um sistema de tutoria LLM multi-agente que integra estimativas de domínio, equívocos, cronogramas de revisão e sinais de engajamento em um estado de aprendiz centralizado e versionado.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Population-Evolve: um Método de Amostragem Paralela e Evolutiva para Raciocínio Matemático em LLMs

O escalonamento em tempo de teste surgiu como uma direção promissora para aprimorar as capacidades de raciocínio de Large Language Models nos últimos anos. Neste trabalho, propomos o Population-Evolve, um método livre de treinamento inspirado em Algoritmos Genéticos para otimizar o raciocínio em LLMs, mantendo uma população dinâmica de soluções candidatas.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Adaptação Automática à Complexidade de Conceitos e Conceitos Naturais Subjetivos: Um Modelo Cognitivo Baseado em Chunking

Um problema central na ciência cognitiva diz respeito aos processos psicológicos fundamentais que sustentam a formação e recuperação de múltiplos tipos de conceitos na memória de curto e longo prazo (STM e LTM, respectivamente). Propomos que os mecanismos de chunking desempenham um papel essencial e mostramos como o modelo computacional CogAct fundamenta o aprendizado de conceitos em processos e estruturas cognitivas fundamentais.

Fonte: arXiv cs.AI

NLP/LLMs • Score 93

Conceitos abstratos de LLM podem melhorar o desempenho de SLM?

Modelos de linguagem grandes (LLMs) se destacam em diversas tarefas, mas sua implementação em dispositivos com recursos limitados continua desafiadora. Investigamos a transferibilidade de conceitos de alto nível extraídos de modelos maiores para modelos de linguagem menores (SLM) durante a inferência, demonstrando melhorias de desempenho em uma ampla gama de tarefas.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

CORE: Reforço Orientado a Conceitos para Reduzir a Lacuna entre Definição e Aplicação em Raciocínio Matemático

Modelos de linguagem grandes (LLMs) frequentemente resolvem exercícios matemáticos desafiadores, mas falham em aplicar o conceito quando o problema exige compreensão genuína. Apresentamos o CORE (Reforço Orientado a Conceitos), um framework de treinamento de RL que transforma conceitos explícitos em um sinal de supervisão controlável.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Raciocínio Híbrido Aumentado por Ferramentas com Destilação para Resolução Bilingue de Problemas Matemáticos

A resolução bilíngue de problemas matemáticos requer uma ligação clara entre raciocínio linguístico e cálculo simbólico. Este artigo apresenta o HERALD (Hybrid Ensemble Reasoning with Adaptive Learning and Distillation), um framework que integra raciocínio e cálculo utilizando NuminaMath-7B-TIR, GPT-4o e Mistral-7B, oferecendo uma solução prática para raciocínio matemático multilíngue com melhor precisão e clareza.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Gabliteration: Modificação Adaptativa de Pesos Neurais Multi-Direcional para Alteração Comportamental Seletiva em Grandes Modelos de Linguagem

Apresentamos o Gabliteration, uma nova técnica de modificação de pesos neurais que avança além dos métodos tradicionais de abliteration, implementando projeções multi-direcionais adaptativas com seleção de camadas regularizada. Nossa abordagem supera limitações fundamentais dos métodos existentes, mantendo a qualidade do modelo ao modificar padrões comportamentais específicos.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Recontextualização Mitiga o Jogo de Especificação sem Modificar a Especificação

Os desenvolvedores frequentemente enfrentam dificuldades para especificar rótulos de treinamento e recompensas corretas. Propomos a recontextualização, que reduz a frequência com que modelos de linguagem 'jogam' com sinais de treinamento, realizando comportamentos inadequados que esses sinais reforçam erroneamente.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

CoPE: A Small Language Model for Steerable and Scalable Content Labeling

arXiv:2512.18027v1 Announce Type: new Abstract: This paper details the methodology behind CoPE, a policy-steerable small language model capable of fast and accurate content labeling. We present a novel training curricula called Contradictory Example Training that enables the model to learn policy interpretation rather than mere policy memorization. We also present a novel method for generating content policies, called Binocular Labeling, which enables rapid construction of unambiguous training datasets. When evaluated across seven different harm areas, CoPE exhibits equal or superior accuracy to frontier models at only 1% of their size. We openly release a 9 billion parameter version of the model that can be run on a single consumer-grade GPU. Models like CoPE represent a paradigm shift for classifier systems. By turning an ML task into a policy writing task, CoPE opens up new design possibilities for the governance of online platforms.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

$eta(3,4)$ 'Atenção' em Agentes Cognitivos: Representações de Conhecimento Sem Ontologia com Semântica Teórica de Promessa

A semântica e a dinâmica da 'atenção' estão intimamente relacionadas a noções teóricas de promessa desenvolvidas para agentes autônomos e podem ser facilmente expressas em um framework de promessas. Isso permite estabelecer uma ponte entre Machine Learning vetorizado e representações de Knowledge Graph sem depender implicitamente de modelos de linguagem.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Compreendendo a Cadeia de Pensamento em Grandes Modelos de Linguagem via Análise de Dados Topológicos

Com o desenvolvimento de grandes modelos de linguagem (LLMs) e a introdução da técnica de cadeia de raciocínio longa, a capacidade de raciocínio dos LLMs em resolução de problemas complexos foi significativamente aprimorada. Este trabalho analisa a qualidade da cadeia de raciocínio a partir de uma perspectiva estrutural, utilizando homologia persistente da Análise de Dados Topológicos (TDA) para mapear etapas de raciocínio e extrair características topológicas.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Vibe Reasoning: Extraindo Capacidades Matemáticas de IA de Fronteira -- Um Estudo de Caso sobre o Problema 6 do IMO 2025

Apresentamos o Vibe Reasoning, um paradigma colaborativo entre humanos e IA para resolver problemas matemáticos complexos. Nossa principal percepção é que modelos de IA de fronteira já possuem o conhecimento necessário para resolver problemas desafiadores, mas não sabem como, o que ou quando aplicá-lo. Este trabalho demonstra essa abordagem através do Problema 6 do IMO 2025.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Q-KVComm: Comunicação Multi-Agente Eficiente Via Compressão Adaptativa de Cache KV

Sistemas multi-agente de Large Language Model (LLM) enfrentam um gargalo crítico: a transmissão redundante de informações contextuais entre agentes consome largura de banda e recursos computacionais excessivos. Apresentamos o Q-KVComm, um novo protocolo que permite a transmissão direta de representações comprimidas de cache key-value (KV) entre agentes LLM.

Fonte: arXiv cs.CL

NLP/LLMs • Score 92

SRS-Stories: Vocabulary-constrained multilingual story generation for language learning

arXiv:2512.18362v1 Announce Type: new Abstract: In this paper, we use large language models to generate personalized stories for language learners, using only the vocabulary they know. The generated texts are specifically written to teach the user new vocabulary by simply reading stories where it appears in context, while at the same time seamlessly reviewing recently learned vocabulary. The generated stories are enjoyable to read and the vocabulary reviewing/learning is optimized by a Spaced Repetition System. The experiments are conducted in three languages: English, Chinese and Polish, evaluating three story generation methods and three strategies for enforcing lexical constraints. The results show that the generated stories are more grammatical, coherent, and provide better examples of word usage than texts generated by the standard constrained beam search approach

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

On Finding Inconsistencies in Documents

arXiv:2512.18601v1 Announce Type: new Abstract: Professionals in academia, law, and finance audit their documents because inconsistencies can result in monetary, reputational, and scientific costs. Language models (LMs) have the potential to dramatically speed up this auditing process. To understand their abilities, we introduce a benchmark, FIND (Finding INconsistencies in Documents), where each example is a document with an inconsistency inserted manually by a domain expert. Despite the documents being long, technical, and complex, the best-performing model (gpt-5) recovered 64% of the inserted inconsistencies. Surprisingly, gpt-5 also found undiscovered inconsistencies present in the original documents. For example, on 50 arXiv papers, we judged 136 out of 196 of the model's suggestions to be legitimate inconsistencies missed by the original authors. However, despite these findings, even the best models miss almost half of the inconsistencies in FIND, demonstrating that inconsistency detection is still a challenging task.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Research on a hybrid LSTM-CNN-Attention model for text-based web content classification

arXiv:2512.18475v1 Announce Type: new Abstract: This study presents a hybrid deep learning architecture that integrates LSTM, CNN, and an Attention mechanism to enhance the classification of web content based on text. Pretrained GloVe embeddings are used to represent words as dense vectors that preserve semantic similarity. The CNN layer extracts local n-gram patterns and lexical features, while the LSTM layer models long-range dependencies and sequential structure. The integrated Attention mechanism enables the model to focus selectively on the most informative parts of the input sequence. A 5-fold cross-validation setup was used to assess the robustness and generalizability of the proposed solution. Experimental results show that the hybrid LSTM-CNN-Attention model achieved outstanding performance, with an accuracy of 0.98, precision of 0.94, recall of 0.92, and F1-score of 0.93. These results surpass the performance of baseline models based solely on CNNs, LSTMs, or transformer-based classifiers such as BERT. The combination of neural network components enabled the model to effectively capture both fine-grained text structures and broader semantic context. Furthermore, the use of GloVe embeddings provided an efficient and effective representation of textual data, making the model suitable for integration into systems with real-time or near-real-time requirements. The proposed hybrid architecture demonstrates high effectiveness in text-based web content classification, particularly in tasks requiring both syntactic feature extraction and semantic interpretation. By combining presented mechanisms, the model addresses the limitations of individual architectures and achieves improved generalization. These findings support the broader use of hybrid deep learning approaches in NLP applications, especially where complex, unstructured textual data must be processed and classified with high reliability.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Mantenha Leve! Simplificando o Agrupamento de Imagens Através de Adaptadores Sem Texto

No contexto de modelos pré-treinados, a classificação eficaz pode ser alcançada com camadas de leitura leves. Este trabalho demonstra que, no agrupamento profundo, é possível obter desempenho competitivo com métodos mais complexos utilizando um pipeline de treinamento altamente simplificado e sem texto. Nossa abordagem, Simple Clustering via Pre-trained models (SCP), utiliza representações de características de modelos de visão pré-treinados e pares de dados positivos.

Fonte: arXiv stat.ML

NLP/LLMs • Score 96

Ajuste Fino Eficiente em Parâmetros para HAR: Integrando LoRA e QLoRA em Modelos Transformer

O reconhecimento de atividades humanas (HAR) é uma tarefa fundamental em computação pervasiva. Este trabalho investiga técnicas de ajuste fino eficientes em parâmetros, especificamente Low-Rank Adaptation (LoRA) e Quantized LoRA, como alternativas escaláveis ao ajuste fino completo de modelos para HAR, demonstrando desempenho competitivo com menos parâmetros treináveis e menor uso de memória.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Rumo à Avaliação de Vulnerabilidades de Privacidade no Esquecimento Seletivo com Modelos de Linguagem de Grande Escala

Os avanços rápidos em inteligência artificial (IA) têm se concentrado no aprendizado a partir de dados para desenvolver sistemas de aprendizado informados. Com a implementação desses sistemas em áreas críticas, garantir sua privacidade e alinhamento com valores humanos é essencial. O esquecimento seletivo, ou machine unlearning, surge como uma abordagem promissora, mas também levanta preocupações significativas de privacidade, especialmente em domínios sensíveis.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Formulação Automática de Problemas Independente de Solver via LLMs para Design Orientado a Simulação de Alto Custo

No domínio de design orientado a simulação de alto custo, traduzir requisitos de design ambíguos em uma formulação de otimização matemática é um gargalo para otimizar o desempenho do produto. Propomos o APF, um framework para formulação automatizada de problemas independente de solver via LLMs, que converte automaticamente os requisitos em linguagem natural dos engenheiros em modelos de otimização executáveis.

Fonte: arXiv cs.CL

NLP/LLMs • Score 92

SAP: Syntactic Attention Pruning for Transformer-based Language Models

arXiv:2512.19125v1 Announce Type: new Abstract: This paper introduces Syntactic Attention Pruning (SAP), a novel method for effectively pruning attention heads in Transformer models. Unlike conventional approaches that rely solely on mathematical analysis of model weights and activations, SAP incorporates both the syntactic structure and attention patterns of sentences to guide the pruning process. By leveraging these linguistic features, SAP not only achieves performance comparable to state-of-the-art methods but also enhances the interpretability of model behavior. To further improve robustness, we propose Candidate Filtering (CF), a mechanism that prioritizes heads based on their contribution to model performance, mitigating degradation during pruning. Experimental results indicate that SAP effectively preserves critical heads of a high density of strong attention values, outperforming existing head pruning strategies in retrain-free settings. These findings position SAP as a promising foundation for a new direction in model compression research, offering high flexibility for pruning across all transformer-based language models.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Evaluating the Challenges of LLMs in Real-world Medical Follow-up: A Comparative Study and An Optimized Framework

arXiv:2512.18999v1 Announce Type: new Abstract: When applied directly in an end-to-end manner to medical follow-up tasks, Large Language Models (LLMs) often suffer from uncontrolled dialog flow and inaccurate information extraction due to the complexity of follow-up forms. To address this limitation, we designed and compared two follow-up chatbot systems: an end-to-end LLM-based system (control group) and a modular pipeline with structured process control (experimental group). Experimental results show that while the end-to-end approach frequently fails on lengthy and complex forms, our modular method-built on task decomposition, semantic clustering, and flow management-substantially improves dialog stability and extraction accuracy. Moreover, it reduces the number of dialogue turns by 46.73% and lowers token consumption by 80% to 87.5%. These findings highlight the necessity of integrating external control mechanisms when deploying LLMs in high-stakes medical follow-up scenarios.

Fonte: arXiv cs.CL

NLP/LLMs • Score 92

Mitigating Spurious Correlations in NLI via LLM-Synthesized Counterfactuals and Dynamic Balanced Sampling

arXiv:2512.18462v1 Announce Type: new Abstract: Natural Language Inference (NLI) models frequently rely on spurious correlations rather than semantic reasoning. Existing mitigation strategies often incur high annotation costs or trigger catastrophic forgetting during fine-tuning. We propose an automated, scalable pipeline to address these limitations. First, we introduce Log-Frequency LMI (LF-LMI) to accurately detect semantic artifacts. Second, we generate a high-quality synthetic contrast set via an LLM-synthesis pipeline with multi-judge verification. Finally, we introduce Dynamic Balanced Sampling, a training strategy that rotates the original data distribution to prevent forgetting. Our method improves consistency on a challenging benchmark from 63.5% to 81.0% while maintaining 88.4% in-domain accuracy, significantly outperforming naive fine-tuning.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

MoE Pathfinder: Poda de Especialistas Orientada por Trajetória

As arquiteturas de Mixture-of-experts (MoE) em grandes modelos de linguagem (LLMs) alcançam desempenho de ponta em diversas tarefas, mas enfrentam desafios práticos como complexidade de implantação e baixa eficiência de ativação. A poda de especialistas surgiu como uma solução promissora para reduzir a sobrecarga computacional e simplificar a implantação de modelos MoE.

Fonte: arXiv cs.LG

NLP/LLMs • Score 92

Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

arXiv:2512.18880v1 Announce Type: new Abstract: Accurate estimation of item (question or task) difficulty is critical for educational assessment but suffers from the cold start problem. While Large Language Models demonstrate superhuman problem-solving capabilities, it remains an open question whether they can perceive the cognitive struggles of human learners. In this work, we present a large-scale empirical analysis of Human-AI Difficulty Alignment for over 20 models across diverse domains such as medical knowledge and mathematical reasoning. Our findings reveal a systematic misalignment where scaling up model size is not reliably helpful; instead of aligning with humans, models converge toward a shared machine consensus. We observe that high performance often impedes accurate difficulty estimation, as models struggle to simulate the capability limitations of students even when being explicitly prompted to adopt specific proficiency levels. Furthermore, we identify a critical lack of introspection, as models fail to predict their own limitations. These results suggest that general problem-solving capability does not imply an understanding of human cognitive struggles, highlighting the challenge of using current models for automated difficulty prediction.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Learning to Prioritize IT Tickets: A Comparative Evaluation of Embedding-based Approaches and Fine-Tuned Transformer Models

arXiv:2512.17916v1 Announce Type: new Abstract: Prioritizing service tickets in IT Service Management (ITSM) is critical for operational efficiency but remains challenging due to noisy textual inputs, subjective writing styles, and pronounced class imbalance. We evaluate two families of approaches for ticket prioritization: embedding-based pipelines that combine dimensionality reduction, clustering, and classical classifiers, and a fine-tuned multilingual transformer that processes both textual and numerical features. Embedding-based methods exhibit limited generalization across a wide range of thirty configurations, with clustering failing to uncover meaningful structures and supervised models highly sensitive to embedding quality. In contrast, the proposed transformer model achieves substantially higher performance, with an average F1-score of 78.5% and weighted Cohen's kappa values of nearly 0.80, indicating strong alignment with true labels. These results highlight the limitations of generic embeddings for ITSM data and demonstrate the effectiveness of domain-adapted transformer architectures for operational ticket prioritization.

Fonte: arXiv cs.CL

NLP/LLMs • Score 94

LLM Agents Implement an NLG System from Scratch: Building Interpretable Rule-Based RDF-to-Text Generators

arXiv:2512.18360v1 Announce Type: new Abstract: We present a novel neurosymbolic framework for RDF-to-text generation, in which the model is "trained" through collaborative interactions among multiple LLM agents rather than traditional backpropagation. The LLM agents produce rule-based Python code for a generator for the given domain, based on RDF triples only, with no in-domain human reference texts. The resulting system is fully interpretable, requires no supervised training data, and generates text nearly instantaneously using only a single CPU. Our experiments on the WebNLG and OpenDialKG data show that outputs produced by our approach reduce hallucination, with only slight fluency penalties compared to finetuned or prompted language models

Fonte: arXiv cs.CL

NLP/LLMs • Score 92

Narrative Consolidation: Formulating a New Task for Unifying Multi-Perspective Accounts

arXiv:2512.18041v1 Announce Type: new Abstract: Processing overlapping narrative documents, such as legal testimonies or historical accounts, often aims not for compression but for a unified, coherent, and chronologically sound text. Standard Multi-Document Summarization (MDS), with its focus on conciseness, fails to preserve narrative flow. This paper formally defines this challenge as a new NLP task: Narrative Consolidation, where the central objectives are chronological integrity, completeness, and the fusion of complementary details. To demonstrate the critical role of temporal structure in this task, we introduce Temporal Alignment Event Graph (TAEG), a graph structure that explicitly models chronology and event alignment. By applying a standard centrality algorithm to TAEG, our method functions as a version selection mechanism, choosing the most central representation of each event in its correct temporal position. In a study on the four Biblical Gospels, this structure-focused approach guarantees perfect temporal ordering (Kendall's Tau of 1.000) by design and dramatically improves content metrics (e.g., +357.2% in ROUGE-L F1). The success of this baseline method validates the formulation of Narrative Consolidation as a relevant task and establishes that an explicit temporal backbone is a fundamental component for its resolution.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3

arXiv:2512.18399v1 Announce Type: new Abstract: Tokenization is a critical preprocessing step for large language models (LLMs), directly impacting training efficiency and downstream performance. General-purpose tokenizers trained predominantly on English and Latin-script languages exhibit suboptimal performance on morphologically rich languages such as Arabic, resulting in inflated token sequences and reduced compression efficiency. In this work, we present AraToken, an Arabic-optimized tokenizer built on SentencePiece Unigram algorithm with a comprehensive normalization pipeline addressing Arabic-specific orthographic variations including Alif variants, diacritics, and Arabic-Indic numerals. We systematically compare BPE, WordPiece, and SentencePiece algorithms across multiple configurations, demonstrating that SentencePiece with normalization achieves 18% lower fertility (1.199 vs 1.35 tokens/word) compared to unnormalized baselines. Furthermore, we introduce the Language Extension Pipeline (LEP), a method for integrating the optimized tokenizer into Qwen3-0.6B through vocabulary extension with mean subtoken initialization and selective transformer layer unfreezing. Our experiments show that LEP reduces evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. We release our tokenizer, training scripts, and model checkpoints to facilitate Arabic NLP research.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Training LLMs with LogicReward for Faithful and Rigorous Reasoning

arXiv:2512.18196v1 Announce Type: new Abstract: Although LLMs exhibit strong reasoning capabilities, existing training methods largely depend on outcome-based feedback, which can produce correct answers with flawed reasoning. Prior work introduces supervision on intermediate steps but still lacks guarantees of logical soundness, which is crucial in high-stakes scenarios where logical consistency is paramount. To address this, we propose LogicReward, a novel reward system that guides model training by enforcing step-level logical correctness with a theorem prover. We further introduce Autoformalization with Soft Unification, which reduces natural language ambiguity and improves formalization quality, enabling more effective use of the theorem prover. An 8B model trained on data constructed with LogicReward surpasses GPT-4o and o4-mini by 11.6\% and 2\% on natural language inference and logical reasoning tasks with simple training procedures. Further analysis shows that LogicReward enhances reasoning faithfulness, improves generalizability to unseen tasks such as math and commonsense reasoning, and provides a reliable reward signal even without ground-truth labels. We will release all data and code at https://llm-symbol.github.io/LogicReward.

Fonte: arXiv cs.CL

NLP/LLMs • Score 92

LIR$^3$AG: A Lightweight Rerank Reasoning Strategy Framework for Retrieval-Augmented Generation

arXiv:2512.18329v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) effectively enhances Large Language Models (LLMs) by incorporating retrieved external knowledge into the generation process. Reasoning models improve LLM performance in multi-hop QA tasks, which require integrating and reasoning over multiple pieces of evidence across different documents to answer a complex question. However, they often introduce substantial computational costs, including increased token consumption and inference latency. To better understand and mitigate this trade-off, we conduct a comprehensive study of reasoning strategies for reasoning models in RAG multi-hop QA tasks. Our findings reveal that reasoning models adopt structured strategies to integrate retrieved and internal knowledge, primarily following two modes: Context-Grounded Reasoning, which relies directly on retrieved content, and Knowledge-Reconciled Reasoning, which resolves conflicts or gaps using internal knowledge. To this end, we propose a novel Lightweight Rerank Reasoning Strategy Framework for RAG (LiR$^3$AG) to enable non-reasoning models to transfer reasoning strategies by restructuring retrieved evidence into coherent reasoning chains. LiR$^3$AG significantly reduce the average 98% output tokens overhead and 58.6% inferencing time while improving 8B non-reasoning model's F1 performance ranging from 6.2% to 22.5% to surpass the performance of 32B reasoning model in RAG, offering a practical and efficient path forward for RAG systems.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

FASTRIC: Linguagem de Especificação de Prompt para Interações LLM Verificáveis

Modelos de Linguagem de Grande Escala (LLMs) executam protocolos complexos de interação, mas carecem de especificações formais para verificar a execução em relação à intenção do designer. Apresentamos o FASTRIC, uma Linguagem de Especificação de Prompt que torna Máquinas de Estados Finitos (FSMs) implícitas explícitas em prompts de linguagem natural, permitindo a verificação de conformidade por meio da análise de rastreamento de execução.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Um Framework Solver-in-the-Loop para Melhorar LLMs em Programação de Conjuntos de Respostas para Resolução de Quebra-Cabeças Lógicos

O surgimento de grandes modelos de linguagem (LLMs) despertou interesse em assistentes de codificação. Este artigo apresenta uma abordagem inovadora de ASP-solver-in-the-loop para o ajuste de instruções guiadas por solucionadores, focando na geração de código para Programação de Conjuntos de Respostas (ASP), visando resolver problemas complexos de busca combinatória.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Anatomical Region-Guided Contrastive Decoding: A Plug-and-Play Strategy for Mitigating Hallucinations in Medical VLMs

arXiv:2512.17189v1 Announce Type: new Abstract: Medical Vision-Language Models (MedVLMs) show immense promise in clinical applicability. However, their reliability is hindered by hallucinations, where models often fail to derive answers from visual evidence, instead relying on learned textual priors. Existing mitigation strategies for MedVLMs have distinct limitations: training-based methods rely on costly expert annotations, limiting scalability, while training-free interventions like contrastive decoding, though data-efficient, apply a global, untargeted correction whose effects in complex real-world clinical settings can be unreliable. To address these challenges, we introduce Anatomical Region-Guided Contrastive Decoding (ARCD), a plug-and-play strategy that mitigates hallucinations by providing targeted, region-specific guidance. Our module leverages an anatomical mask to direct a three-tiered contrastive decoding process. By dynamically re-weighting at the token, attention, and logits levels, it verifiably steers the model's focus onto specified regions, reinforcing anatomical understanding and suppressing factually incorrect outputs. Extensive experiments across diverse datasets, including chest X-ray, CT, brain MRI, and ocular ultrasound, demonstrate our method's effectiveness in improving regional understanding, reducing hallucinations, and enhancing overall diagnostic accuracy.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

arXiv:2512.16978v1 Announce Type: new Abstract: Long-form multimodal video understanding requires integrating vision, speech, and ambient audio with coherent long-range reasoning. Existing benchmarks emphasize either temporal length or multimodal richness, but rarely both and while some incorporate open-ended questions and advanced metrics, they mostly rely on single-score accuracy, obscuring failure modes. We introduce LongShOTBench, a diagnostic benchmark with open-ended, intent-driven questions; single- and multi-turn dialogues; and tasks requiring multimodal reasoning and agentic tool use across video, audio, and speech. Each item includes a reference answer and graded rubric for interpretable, and traceable evaluation. LongShOTBench is produced via a scalable, human-validated pipeline to ensure coverage and reproducibility. All samples in our LongShOTBench are human-verified and corrected. Furthermore, we present LongShOTAgent, an agentic system that analyzes long videos via preprocessing, search, and iterative refinement. On LongShOTBench, state-of-the-art MLLMs show large gaps: Gemini-2.5-Flash achieves 52.95%, open-source models remain below 30%, and LongShOTAgent attains 44.66%. These results underscore the difficulty of real-world long-form video understanding. LongShOTBench provides a practical, reproducible foundation for evaluating and improving MLLMs. All resources are available on GitHub: https://github.com/mbzuai-oryx/longshot.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Learning When to Look: A Disentangled Curriculum for Strategic Perception in Multimodal Reasoning

arXiv:2512.17227v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) demonstrate significant potential but remain brittle in complex, long-chain visual reasoning tasks. A critical failure mode is "visual forgetting", where models progressively lose visual grounding as reasoning extends, a phenomenon aptly described as "think longer, see less". We posit this failure stems from current training paradigms prematurely entangling two distinct cognitive skills: (1) abstract logical reasoning "how-to-think") and (2) strategic visual perception ("when-to-look"). This creates a foundational cold-start deficiency -- weakening abstract reasoning -- and a strategic perception deficit, as models lack a policy for when to perceive. In this paper, we propose a novel curriculum-based framework to disentangle these skills. First, we introduce a disentangled Supervised Fine-Tuning (SFT) curriculum that builds a robust abstract reasoning backbone on text-only data before anchoring it to vision with a novel Perception-Grounded Chain-of-Thought (PG-CoT) paradigm. Second, we resolve the strategic perception deficit by formulating timing as a reinforcement learning problem. We design a Pivotal Perception Reward that teaches the model when to look by coupling perceptual actions to linguistic markers of cognitive uncertainty (e.g., "wait", "verify"), thereby learning an autonomous grounding policy. Our contributions include the formalization of these two deficiencies and the development of a principled, two-stage framework to address them, transforming the model from a heuristic-driven observer to a strategic, grounded reasoner. \textbf{Code}: \url{https://github.com/gaozilve-max/learning-when-to-look}.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Sample, Don't Search: Rethinking Test-Time Alignment for Language Models

arXiv:2504.03790v2 Announce Type: replace-cross Abstract: Increasing test-time computation has emerged as a promising direction for improving language model performance, particularly in scenarios where model finetuning is impractical or impossible due to computational constraints or private model weights. However, existing test-time search methods using a reward model (RM) often degrade in quality as compute scales, due to the over-optimization of what are inherently imperfect reward proxies. We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access. We demonstrate the effectiveness of QAlign on mathematical reasoning benchmarks (GSM8K and GSM-Symbolic) using a task-specific RM, showing consistent improvements over existing test-time compute methods like best-of-n and majority voting. Furthermore, when applied with more realistic RMs trained on the Tulu 3 preference dataset, QAlign outperforms direct preference optimization (DPO), best-of-n, majority voting, and weighted majority voting on a diverse range of datasets (GSM8K, MATH500, IFEval, MMLU-Redux, and TruthfulQA). A practical solution to aligning language models at test time using additional computation without degradation, our approach expands the limits of the capability that can be obtained from off-the-shelf language models without further training.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

Towards Human-Guided, Data-Centric LLM Co-Pilots

arXiv:2501.10321v3 Announce Type: replace-cross Abstract: Machine learning (ML) has the potential to revolutionize various domains, but its adoption is often hindered by the disconnect between the needs of domain experts and translating these needs into robust and valid ML tools. Despite recent advances in LLM-based co-pilots to democratize ML for non-technical domain experts, these systems remain predominantly focused on model-centric aspects while overlooking critical data-centric challenges. This limitation is problematic in complex real-world settings where raw data often contains complex issues, such as missing values, label noise, and domain-specific nuances requiring tailored handling. To address this we introduce CliMB-DC, a human-guided, data-centric framework for LLM co-pilots that combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing. At its core, CliMB-DC introduces a novel, multi-agent reasoning system that combines a strategic coordinator for dynamic planning and adaptation with a specialized worker agent for precise execution. Domain expertise is then systematically incorporated to guide the reasoning process using a human-in-the-loop approach. To guide development, we formalize a taxonomy of key data-centric challenges that co-pilots must address. Thereafter, to address the dimensions of the taxonomy, we integrate state-of-the-art data-centric tools into an extensible, open-source architecture, facilitating the addition of new tools from the research community. Empirically, using real-world healthcare datasets we demonstrate CliMB-DC's ability to transform uncurated datasets into ML-ready formats, significantly outperforming existing co-pilot baselines for handling data-centric challenges. CliMB-DC promises to empower domain experts from diverse domains -- healthcare, finance, social sciences and more -- to actively participate in driving real-world impact using ML.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

Studying the Effects of Collaboration in Interactive Theme Discovery Systems

arXiv:2408.09030v4 Announce Type: replace Abstract: NLP-assisted solutions have gained considerable traction to support qualitative data analysis. However, there does not exist a unified evaluation framework that can account for the many different settings in which qualitative researchers may employ them. In this paper, we take a first step in this direction by proposing an evaluation framework to study the way in which different tools may result in different outcomes depending on the collaboration strategy employed. Specifically, we study the impact of synchronous vs. asynchronous collaboration using two different NLP-assisted qualitative research tools and present a comprehensive analysis of significant differences in the consistency, cohesiveness, and correctness of their outputs.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

The Stochastic Occupation Kernel (SOCK) Method for Learning Stochastic Differential Equations

arXiv:2505.11622v2 Announce Type: replace Abstract: We present a novel kernel-based method for learning multivariate stochastic differential equations (SDEs). The method follows a two-step procedure: we first estimate the drift term function, then the (matrix-valued) diffusion function given the drift. Occupation kernels are integral functionals on a reproducing kernel Hilbert space (RKHS) that aggregate information over a trajectory. Our approach leverages vector-valued occupation kernels for estimating the drift component of the stochastic process. For diffusion estimation, we extend this framework by introducing operator-valued occupation kernels, enabling the estimation of an auxiliary matrix-valued function as a positive semi-definite operator, from which we readily derive the diffusion estimate. This enables us to avoid common challenges in SDE learning, such as intractable likelihoods, by optimizing a reconstruction-error-based objective. We propose a simple learning procedure that retains strong predictive accuracy while using Fenchel duality to promote efficiency. We validate the method on simulated benchmarks and a real-world dataset of Amyloid imaging in healthy and Alzheimer's disease subjects.

Fonte: arXiv stat.ML

NLP/LLMs • Score 93

UniRel-R1: Raciocínio LLM ajustado por RL para Respostas a Perguntas Relacionais em Grafos de Conhecimento

O Knowledge Graph Question Answering (KGQA) tradicionalmente se concentra em consultas centradas em entidades que retornam uma única entidade de resposta. Neste trabalho, introduzimos o KGQA centrado em relações, onde a resposta é um subgrafo que captura as conexões semânticas entre entidades. Apresentamos o UniRel-R1, um framework unificado que integra seleção de subgrafos, poda de grafos em múltiplas etapas e um LLM ajustado com aprendizado por reforço.

Fonte: arXiv cs.AI

NLP/LLMs • Score 92

Towards Sharp Minimax Risk Bounds for Operator Learning

arXiv:2512.17805v1 Announce Type: cross Abstract: We develop a minimax theory for operator learning, where the goal is to estimate an unknown operator between separable Hilbert spaces from finitely many noisy input-output samples. For uniformly bounded Lipschitz operators, we prove information-theoretic lower bounds together with matching or near-matching upper bounds, covering both fixed and random designs under Hilbert-valued Gaussian noise and Gaussian white noise errors. The rates are controlled by the spectrum of the covariance operator of the measure that defines the error metric. Our setup is very general and allows for measures with unbounded support. A key implication is a curse of sample complexity which shows that the minimax risk for generic Lipschitz operators cannot decay at any algebraic rate in the sample size. We obtain essentially sharp characterizations when the covariance spectrum decays exponentially and provide general upper and lower bounds in slower-decay regimes.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

Generalized infinite dimensional Alpha-Procrustes based geometries

arXiv:2511.09801v2 Announce Type: replace Abstract: This work extends the recently introduced Alpha-Procrustes family of Riemannian metrics for symmetric positive definite (SPD) matrices by incorporating generalized versions of the Bures-Wasserstein (GBW), Log-Euclidean, and Wasserstein distances. While the Alpha-Procrustes framework has unified many classical metrics in both finite- and infinite- dimensional settings, it previously lacked the structural components necessary to realize these generalized forms. We introduce a formalism based on unitized Hilbert-Schmidt operators and an extended Mahalanobis norm that allows the construction of robust, infinite-dimensional generalizations of GBW and Log-Hilbert-Schmidt distances. Our approach also incorporates a learnable regularization parameter that enhances geometric stability in high-dimensional comparisons. Preliminary experiments reproducing benchmarks from the literature demonstrate the improved performance of our generalized metrics, particularly in scenarios involving comparisons between datasets of varying dimension and scale. This work lays a theoretical and computational foundation for advancing robust geometric methods in machine learning, statistical inference, and functional data analysis.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

Spatially-informed transformers: Injecting geostatistical covariance biases into self-attention for spatio-temporal forecasting

arXiv:2512.17696v1 Announce Type: cross Abstract: The modeling of high-dimensional spatio-temporal processes presents a fundamental dichotomy between the probabilistic rigor of classical geostatistics and the flexible, high-capacity representations of deep learning. While Gaussian processes offer theoretical consistency and exact uncertainty quantification, their prohibitive computational scaling renders them impractical for massive sensor networks. Conversely, modern transformer architectures excel at sequence modeling but inherently lack a geometric inductive bias, treating spatial sensors as permutation-invariant tokens without a native understanding of distance. In this work, we propose a spatially-informed transformer, a hybrid architecture that injects a geostatistical inductive bias directly into the self-attention mechanism via a learnable covariance kernel. By formally decomposing the attention structure into a stationary physical prior and a non-stationary data-driven residual, we impose a soft topological constraint that favors spatially proximal interactions while retaining the capacity to model complex dynamics. We demonstrate the phenomenon of ``Deep Variography'', where the network successfully recovers the true spatial decay parameters of the underlying process end-to-end via backpropagation. Extensive experiments on synthetic Gaussian random fields and real-world traffic benchmarks confirm that our method outperforms state-of-the-art graph neural networks. Furthermore, rigorous statistical validation confirms that the proposed method delivers not only superior predictive accuracy but also well-calibrated probabilistic forecasts, effectively bridging the gap between physics-aware modeling and data-driven learning.

Fonte: arXiv stat.ML

NLP/LLMs • Score 95

CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning

arXiv:2512.17312v1 Announce Type: new Abstract: Recent releases such as o3 highlight human-like "thinking with images" reasoning that combines structured tool use with stepwise verification, yet most open-source approaches still rely on text-only chains, rigid visual schemas, or single-step pipelines, limiting flexibility, interpretability, and transferability on complex tasks. We introduce CodeDance, which explores executable code as a general solver for visual reasoning. Unlike fixed-schema calls (e.g., only predicting bounding-box coordinates), CodeDance defines, composes, and executes code to orchestrate multiple tools, compute intermediate results, and render visual artifacts (e.g., boxes, lines, plots) that support transparent, self-checkable reasoning. To guide this process, we introduce a reward for balanced and adaptive tool-call, which balances exploration with efficiency and mitigates tool overuse. Interestingly, beyond the expected capabilities taught by atomic supervision, we empirically observe novel emergent behaviors during RL training: CodeDance demonstrates novel tool invocations, unseen compositions, and cross-task transfer. These behaviors arise without task-specific fine-tuning, suggesting a general and scalable mechanism of executable visual reasoning. Extensive experiments across reasoning benchmarks (e.g., visual search, math, chart QA) show that CodeDance not only consistently outperforms schema-driven and text-only baselines, but also surpasses advanced closed models such as GPT-4o and larger open-source models.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

Understanding Generalization in Role-Playing Models via Information Theory

arXiv:2512.17270v1 Announce Type: new Abstract: Role-playing models (RPMs) are widely used in real-world applications but underperform when deployed in the wild. This degradation can be attributed to distribution shifts, including user, character, and dialogue compositional shifts. Existing methods like LLM-as-a-judge fall short in providing a fine-grained diagnosis of how these shifts affect RPM generalization, and thus there lack formal frameworks to characterize RPM generalization behaviors. To bridge these gaps, we introduce an information-theoretic metric, named reasoning-based effective mutual information difference (R-EMID), to measure RPM performance degradation in an interpretable way. We also derive an upper bound on R-EMID to predict the worst-case generalization performance of RPMs and theoretically reveal how various shifts contribute to the RPM performance degradation. Moreover, we propose a co-evolving reinforcement learning framework to adaptively model the connection among user, character, and dialogue context and thus enhance the estimation of dialogue response generation probability, which is critical for calculating R-EMID. Finally, we evaluate the generalization performance of various RPMs using R-EMID, finding that user shift poses the highest risk among all shifts and reinforcement learning is the most effective approach for enhancing RPM generalization.

Fonte: arXiv cs.LG

NLP/LLMs • Score 92

Perturb Your Data: Paraphrase-Guided Training Data Watermarking

arXiv:2512.17075v1 Announce Type: new Abstract: Training data detection is critical for enforcing copyright and data licensing, as Large Language Models (LLM) are trained on massive text corpora scraped from the internet. We present SPECTRA, a watermarking approach that makes training data reliably detectable even when it comprises less than 0.001% of the training corpus. SPECTRA works by paraphrasing text using an LLM and assigning a score based on how likely each paraphrase is, according to a separate scoring model. A paraphrase is chosen so that its score closely matches that of the original text, to avoid introducing any distribution shifts. To test whether a suspect model has been trained on the watermarked data, we compare its token probabilities against those of the scoring model. We demonstrate that SPECTRA achieves a consistent p-value gap of over nine orders of magnitude when detecting data used for training versus data not used for training, which is greater than all baselines tested. SPECTRA equips data owners with a scalable, deploy-before-release watermark that survives even large-scale LLM training.

Fonte: arXiv cs.CL

NLP/LLMs • Score 92

XLM: A Python package for non-autoregressive language models

arXiv:2512.17065v1 Announce Type: new Abstract: In recent years, there has been a resurgence of interest in non-autoregressive text generation in the context of general language modeling. Unlike the well-established autoregressive language modeling paradigm, which has a plethora of standard training and inference libraries, implementations of non-autoregressive language modeling have largely been bespoke making it difficult to perform systematic comparisons of different methods. Moreover, each non-autoregressive language model typically requires it own data collation, loss, and prediction logic, making it challenging to reuse common components. In this work, we present the XLM python package, which is designed to make implementing small non-autoregressive language models faster with a secondary goal of providing a suite of small pre-trained models (through a companion xlm-models package) that can be used by the research community. The code is available at https://github.com/dhruvdcoder/xlm-core.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Video Detective: Seek Critical Clues Recurrently to Answer Question from Long Videos

arXiv:2512.17229v1 Announce Type: new Abstract: Long Video Question-Answering (LVQA) presents a significant challenge for Multi-modal Large Language Models (MLLMs) due to immense context and overloaded information, which could also lead to prohibitive memory consumption. While existing methods attempt to address these issues by reducing visual tokens or extending model's context length, they may miss useful information or take considerable computation. In fact, when answering given questions, only a small amount of crucial information is required. Therefore, we propose an efficient question-aware memory mechanism, enabling MLLMs to recurrently seek these critical clues. Our approach, named VideoDetective, simplifies this task by iteratively processing video sub-segments. For each sub-segment, a question-aware compression strategy is employed by introducing a few special memory tokens to achieve purposefully compression. This allows models to effectively seek critical clues while reducing visual tokens. Then, due to history context could have a significant impact, we recurrently aggregate and store these memory tokens to update history context, which would be reused for subsequent sub-segments. Furthermore, to more effectively measure model's long video understanding ability, we introduce GLVC (Grounding Long Video Clues), a long video question-answering dataset, which features grounding critical and concrete clues scattered throughout entire videos. Experimental results demonstrate our method enables MLLMs with limited context length of 32K to efficiently process 100K tokens (3600 frames, an hour-long video sampled at 1fps), requiring only 2 minutes and 37GB GPU memory usage. Evaluation results across multiple long video benchmarks illustrate our method can more effectively seek critical clues from massive information.

Fonte: arXiv cs.CV

NLP/LLMs • Score 89

ProCache: Caching de Recursos Consciente de Restrições com Computação Seletiva para Aceleração de Transformers de Difusão

Os Transformers de Difusão (DiTs) alcançaram desempenho de ponta em modelagem generativa, mas seu alto custo computacional dificulta a implantação em tempo real. O ProCache é um framework de caching dinâmico de recursos que aborda limitações existentes, oferecendo uma solução de aceleração sem treinamento com um padrão de caching consciente de restrições e um módulo de computação seletiva.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

FORMSpoT: A Decade of Tree-Level, Country-Scale Forest Monitoring

arXiv:2512.17021v1 Announce Type: new Abstract: The recent decline of the European forest carbon sink highlights the need for spatially explicit and frequently updated forest monitoring tools. Yet, existing satellite-based disturbance products remain too coarse to detect changes at the scale of individual trees, typically below 100 m$^{2}$. Here, we introduce FORMSpoT (Forest Mapping with SPOT Time series), a decade-long (2014-2024) nationwide mapping of forest canopy height at 1.5 m resolution, together with annual disturbance polygons (FORMSpoT-$\Delta$) covering mainland France. Canopy heights were derived from annual SPOT-6/7 composites using a hierarchical transformer model (PVTv2) trained on high-resolution airborne laser scanning (ALS) data. To enable robust change detection across heterogeneous acquisitions, we developed a dedicated post-processing pipeline combining co-registration and spatio-temporal total variation denoising. Validation against ALS revisits across 19 sites and 5,087 National Forest Inventory plots shows that FORMSpoT-$\Delta$ substantially outperforms existing disturbance products. In mountainous forests, where disturbances are small and spatially fragmented, FORMSpoT-$\Delta$ achieves an F1-score of 0.44, representing an order of magnitude higher than existing benchmarks. By enabling tree-level monitoring of forest dynamics at national scale, FORMSpoT-$\Delta$ provides a unique tool to analyze management practices, detect early signals of forest decline, and better quantify carbon losses from subtle disturbances such as thinning or selective logging. These results underscore the critical importance of sustaining very high-resolution satellite missions like SPOT and open-data initiatives such as DINAMIS for monitoring forests under climate change.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

Rumo à IA Conversacional Explicável para Diagnóstico Precoce com Modelos de Linguagem de Grande Escala

Os sistemas de saúde em todo o mundo enfrentam problemas como diagnósticos ineficientes, aumento de custos e acesso limitado a especialistas. Este estudo apresenta um chatbot diagnóstico alimentado por um Large Language Model (LLM), utilizando GPT-4o e técnicas de IA explicável, que interage com os pacientes para extrair e normalizar sintomas, priorizando diagnósticos potenciais.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Deep Learning-Based Surrogate Creep Modelling in Inconel 625: A High-Temperature Alloy Study

arXiv:2512.17477v1 Announce Type: new Abstract: Time-dependent deformation, particularly creep, in high-temperature alloys such as Inconel 625 is a key factor in the long-term reliability of components used in aerospace and energy systems. Although Inconel 625 shows excellent creep resistance, finite-element creep simulations in tools such as ANSYS remain computationally expensive, often requiring tens of minutes for a single 10,000-hour run. This work proposes deep learning based surrogate models to provide fast and accurate replacements for such simulations. Creep strain data was generated in ANSYS using the Norton law under uniaxial stresses of 50 to 150 MPa and temperatures of 700 to 1000 $^\circ$C, and this temporal dataset was used to train two architectures: a BiLSTM Variational Autoencoder for uncertainty-aware and generative predictions, and a BiLSTM Transformer hybrid that employs self-attention to capture long-range temporal behavior. Both models act as surrogate predictors, with the BiLSTM-VAE offering probabilistic output and the BiLSTM-Transformer delivering high deterministic accuracy. Performance is evaluated using RMSE, MAE, and $R^2$. Results show that the BiLSTM-VAE provides stable and reliable creep strain forecasts, while the BiLSTM-Transformer achieves strong accuracy across the full time range. Latency tests indicate substantial speedup: while each ANSYS simulation requires 30 to 40 minutes for a given stress-temperature condition, the surrogate models produce predictions within seconds. The proposed framework enables rapid creep assessment for design optimization and structural health monitoring, and provides a scalable solution for high-temperature alloy applications.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

GreedySnake: Accelerating SSD-Offloaded LLM Training with Efficient Scheduling and Optimizer Step Overlapping

arXiv:2512.17570v1 Announce Type: new Abstract: SSD-offloaded training offers a practical and promising approach to making LLM training cost-effective. Building on gradient accumulation with micro-batches, this paper introduces GreedySnake, a new SSD-offloaded training system that employs vertical scheduling, which executes all microbatches of a layer before proceeding to the next. Compared to existing systems that use horizontal scheduling (i.e., executing micro-batches sequentially), GreedySnake achieves higher training throughput with smaller batch sizes, bringing the system much closer to the ideal scenario predicted by the roofline model. To further mitigate the I/O bottleneck, GreedySnake overlaps part of the optimization step with the forward pass of the next iteration. Experimental results on A100 GPUs show that GreedySnake achieves saturated training throughput improvements over ZeRO-Infinity: 1.96x on 1 GPU and 1.93x on 4 GPUs for GPT-65B, and 2.53x on 1 GPU for GPT-175B. The code is open-sourced at https://github.com/npz7yyk/GreedySnake

Fonte: arXiv cs.LG

NLP/LLMs • Score 93

A lightweight Spatial-Temporal Graph Neural Network for Long-term Time Series Forecasting

arXiv:2512.17453v1 Announce Type: new Abstract: We propose Lite-STGNN, a lightweight spatial-temporal graph neural network for long-term multivariate forecasting that integrates decomposition-based temporal modeling with learnable sparse graph structure. The temporal module applies trend-seasonal decomposition, while the spatial module performs message passing with low-rank Top-$K$ adjacency learning and conservative horizon-wise gating, enabling spatial corrections that enhance a strong linear baseline. Lite-STGNN achieves state-of-the-art accuracy on four benchmark datasets for horizons up to 720 steps, while being parameter-efficient and substantially faster to train than transformer-based methods. Ablation studies show that the spatial module yields 4.6% improvement over the temporal baseline, Top-$K$ enhances locality by 3.3%, and learned adjacency matrices reveal domain-specific interaction dynamics. Lite-STGNN thus offers a compact, interpretable, and efficient framework for long-term multivariate time series forecasting.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Data Augmentation Supporting a Conversational Agent Designed for Smoking Cessation Support Groups

arXiv:2512.17092v1 Announce Type: new Abstract: Online support groups for smoking cessation are economical and accessible, yet they often face challenges with low user engagement and stigma. The use of an automatic conversational agent would improve engagement by ensuring that all user comments receive a timely response.). We address the challenge of insufficient high-quality data by employing a two-level data augmentation strategy: synthetic data augmentation and real data augmentation. First, we fine-tuned an open source LLM to classify posts from our existing smoking cessation support groups and identify intents with low F1 (precision+recall) scores. Then, for these intents, we generate additional synthetic data using prompt engineering with the GPT model, with an average of 87\% of the generated synthetic posts deemed high quality by human annotators. Overall, the synthetic augmentation process resulted in 43\% of the original posts being selected for augmentation, followed by 140\% synthetic expansion of these posts. Additionally, we scraped more than 10,000 real posts from a related online support context, of which 73\% were validated as good quality by human annotators. Each synthetic or scraped post underwent rigorous validation involving human reviewers to ensure quality and relevance. The validated new data, combined with the original support group posts, formed an augmented dataset used to retrain the intent classifier. Performance evaluation of the retrained model demonstrated a 32\% improvement in F1, confirming the effectiveness of our data augmentation approach. Synthetic and real post augmentation led to similar performance improvements. This study provides a replicable framework for enhancing conversational agent performance in domains where data scarcity is a critical issue.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Learning What to Write: Write-Gated KV for Efficient Long-Context Inference

arXiv:2512.17452v1 Announce Type: new Abstract: Long-context LLM inference is bottlenecked by the quadratic attention complexity and linear KV cache growth. Prior approaches mitigate this via post-hoc selection or eviction but overlook the root inefficiency: indiscriminate writing to persistent memory. In this paper, we formalize KV cache management as a causal system of three primitives: KV Admission, Selection, and Eviction. We instantiate KV Admission via Write-Gated KV, a lightweight mechanism that learns to predict token utility before it enters the cache. By filtering out low-utility states early to maintain a compact global cache alongside a sliding local cache, Write-Gated KV reduces memory usage by 46-57% and delivers 3.03-3.45$\times$ prefill and 1.89-2.56$\times$ decode speedups on Llama model with negligible accuracy loss, all while remaining compatible with FlashAttention and paged-KV systems. These results demonstrate that learning what to write, is a principled and practical recipe for efficient long-context inference. Code is available at https://github.com/EMCLab-Sinica/WG-KV .

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Adversarially Robust Detection of Harmful Online Content: A Computational Design Science Approach

arXiv:2512.17367v1 Announce Type: new Abstract: Social media platforms are plagued by harmful content such as hate speech, misinformation, and extremist rhetoric. Machine learning (ML) models are widely adopted to detect such content; however, they remain highly vulnerable to adversarial attacks, wherein malicious users subtly modify text to evade detection. Enhancing adversarial robustness is therefore essential, requiring detectors that can defend against diverse attacks (generalizability) while maintaining high overall accuracy. However, simultaneously achieving both optimal generalizability and accuracy is challenging. Following the computational design science paradigm, this study takes a sequential approach that first proposes a novel framework (Large Language Model-based Sample Generation and Aggregation, LLM-SGA) by identifying the key invariances of textual adversarial attacks and leveraging them to ensure that a detector instantiated within the framework has strong generalizability. Second, we instantiate our detector (Adversarially Robust Harmful Online Content Detector, ARHOCD) with three novel design components to improve detection accuracy: (1) an ensemble of multiple base detectors that exploits their complementary strengths; (2) a novel weight assignment method that dynamically adjusts weights based on each sample's predictability and each base detector's capability, with weights initialized using domain knowledge and updated via Bayesian inference; and (3) a novel adversarial training strategy that iteratively optimizes both the base detectors and the weight assignor. We addressed several limitations of existing adversarial robustness enhancement research and empirically evaluated ARHOCD across three datasets spanning hate speech, rumor, and extremist content. Results show that ARHOCD offers strong generalizability and improves detection accuracy under adversarial conditions.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Task Schema and Binding: A Double Dissociation Study of In-Context Learning

arXiv:2512.17325v1 Announce Type: new Abstract: We provide causal mechanistic validation that in-context learning (ICL) decomposes into two separable mechanisms: Task Schema (abstract task type recognition) and Binding (specific input-output associations). Through activation patching experiments across 9 models from 7 Transformer families plus Mamba (370M-13B parameters), we establish three key findings: 1. Double dissociation: Task Schema transfers at 100% via late MLP patching; Binding transfers at 62% via residual stream patching -- proving separable mechanisms 2. Prior-Schema trade-off: Schema reliance inversely correlates with prior knowledge (Spearman rho = -0.596, p < 0.001, N=28 task-model pairs) 3. Architecture generality: The mechanism operates across all tested architectures including the non-Transformer Mamba These findings offer a mechanistic account of the ICL puzzle that contrasts with prior views treating ICL as a monolithic mechanism (whether retrieval-based, gradient descent-like, or purely Bayesian). By establishing that Schema and Binding are neurally dissociable -- not merely behavioral modes -- we provide causal evidence for dual-process theories of ICL. Models rely on Task Schema when prior knowledge is absent, but prior knowledge interferes through attentional mis-routing (72.7% recency bias) rather than direct output competition (0%). This explains why arbitrary mappings succeed (zero prior leads to full Schema reliance) while factual overrides fail -- and reveals that the true bottleneck is attentional, not output-level. Practical implications: Understanding these dual mechanisms enables more efficient prompt engineering -- reliable schema transfer reduces required demonstrations for novel tasks, while prior-aware design can mitigate the 38% binding failure rate in high-prior scenarios, improving ICL system reliability in production deployments.

Fonte: arXiv cs.LG

NLP/LLMs • Score 92

EMAG: Amostragem de Difusão Auto-Retificadora com Orientação de Média Móvel Exponencial

No contexto de modelos generativos de difusão e correspondência de fluxo, técnicas de orientação são amplamente utilizadas para melhorar a qualidade e a consistência das amostras. Propomos o EMAG, um mecanismo que modifica a atenção durante a inferência em transformers de difusão, produzindo negativos mais difíceis e semanticamente fiéis, melhorando a qualidade e a pontuação de preferência humana (HPS) em +0,46 em relação ao CFG.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

The Effect of Negation on CLIP in Medical Imaging: Limitations of Contrastive Language-Image Pretraining

arXiv:2512.17121v1 Announce Type: new Abstract: Large vision-language models like CLIP are increasingly used in medical imaging tasks due to their ability to align images and text without the need for extensive labeled data. This makes them particularly useful for applications like image retrieval, report generation, and classification in clinical settings. A potential issue to this approach is that CLIP-based models often under perform when interpreting negated phrases, which is especially problematic in the context of medical diagnosing. In this study, we evaluate the Stanford AIMI CheXagent model on its ability to correctly retrieve chest X-ray images using prompts with and without negation. The goal of this project is to understand where this model fails and then use it as a base model to improve its retrieval accuracy by fine tuning methods outlined in previous work. Results from this study show improvement in handling of negation in the CLIP model with a slight decrease in accuracy of positive prompt evaluation. Alongside retrieval accuracy, we examined internal model behavior through token attribution, t-SNE projection, and attention-head ablation to better characterize how each fine tuning approach reshaped the text encoders representation of negated clinical language. Through this work, we hope to better understand the internal behavior of CLIP and improve its handling of negation using clinically relevant language for improving its reliability in medical AI devices.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Dynamic Tool Dependency Retrieval for Efficient Function Calling

arXiv:2512.17052v1 Announce Type: new Abstract: Function calling agents powered by Large Language Models (LLMs) select external tools to automate complex tasks. On-device agents typically use a retrieval module to select relevant tools, improving performance and reducing context length. However, existing retrieval methods rely on static and limited inputs, failing to capture multi-step tool dependencies and evolving task context. This limitation often introduces irrelevant tools that mislead the agent, degrading efficiency and accuracy. We propose Dynamic Tool Dependency Retrieval (DTDR), a lightweight retrieval method that conditions on both the initial query and the evolving execution context. DTDR models tool dependencies from function calling demonstrations, enabling adaptive retrieval as plans unfold. We benchmark DTDR against state-of-the-art retrieval methods across multiple datasets and LLM backbones, evaluating retrieval precision, downstream task accuracy, and computational efficiency. Additionally, we explore strategies to integrate retrieved tools into prompts. Our results show that dynamic tool retrieval improves function calling success rates between $23\%$ and $104\%$ compared to state-of-the-art static retrievers.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Acelerando o Desempenho de Jogos Multi-modal LLM via Previsão de Entrada e Correção de Erros

Agentes de controle sequencial em tempo real frequentemente enfrentam latência de inferência. Mesmo atrasos modestos no planejamento por etapa podem desestabilizar o controle e degradar o desempenho geral. Propomos um framework de especulação e correção que adapta a filosofia de prever e verificar da execução especulativa ao controle baseado em modelo com TD-MPC2.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs

arXiv:2512.17319v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) demonstrate strong perception and reasoning performance on existing remote sensing (RS) benchmarks. However, most prior benchmarks rely on low-resolution imagery, and some high-resolution benchmarks suffer from flawed reasoning-task designs. We show that text-only LLMs can perform competitively with multimodal vision-language models on RS reasoning tasks without access to images, revealing a critical mismatch between current benchmarks and the intended evaluation of visual understanding. To enable faithful assessment, we introduce RSHR-Bench, a super-high-resolution benchmark for RS visual understanding and reasoning. RSHR-Bench contains 5,329 full-scene images with a long side of at least 4,000 pixels, with up to about 3 x 10^8 pixels per image, sourced from widely used RS corpora and UAV collections. We design four task families: multiple-choice VQA, open-ended VQA, image captioning, and single-image evaluation. These tasks cover nine perception categories and four reasoning types, supporting multi-turn and multi-image dialog. To reduce reliance on language priors, we apply adversarial filtering with strong LLMs followed by rigorous human verification. Overall, we construct 3,864 VQA tasks, 3,913 image captioning tasks, and 500 fully human-written or verified single-image evaluation VQA pairs. Evaluations across open-source, closed-source, and RS-specific VLMs reveal persistent performance gaps in super-high-resolution scenarios. Code: https://github.com/Yunkaidang/RSHR

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Imagens Sintéticas Podem Servir como Protótipos de Classe Eficazes e Eficientes?

Modelos de Visão-Linguagem (VLMs) demonstraram forte desempenho em tarefas de classificação de imagens zero-shot. No entanto, métodos existentes, como o Contrastive Language-Image Pre-training (CLIP), dependem de pares de texto-imagem anotados, o que aumenta os custos e requisitos de precisão na preparação de conjuntos de dados de alta qualidade. Apresentamos o framework LGCLIP, que utiliza um Large Language Model (LLM) para gerar prompts específicos de classe, permitindo a síntese de imagens de referência de forma leve e eficiente.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

MMRAG-RFT: Ajuste Fino de Reforço em Duas Etapas para Geração Aumentada por Recuperação Multi-modal Explicável

A Geração Aumentada por Recuperação Multi-modal (MMRAG) permite uma geração altamente confiável ao integrar conhecimento externo multi-modal, demonstrando desempenho impressionante em cenários complexos. No entanto, métodos existentes falham em esclarecer a lógica de raciocínio por trás da recuperação e geração de respostas. Propomos a introdução de aprendizado por reforço para aprimorar as capacidades de raciocínio de modelos de linguagem multi-modal.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

A percepção realista de ameaças impulsiona o conflito entre grupos: Uma análise causal e dinâmica usando simulações de agentes generativos

O conflito humano é frequentemente atribuído a ameaças às condições materiais e valores simbólicos, mas a interação entre eles e qual predomina ainda não está clara. Usamos simulações de agentes impulsionados por modelos de linguagem de grande escala (LLM) em sociedades virtuais para explorar essas dinâmicas.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Classificação de Hipóteses Inspirada em Solomonoff com LLMs para Previsão sob Incerteza

O raciocínio sob incerteza é um desafio fundamental em IA, especialmente em tarefas do mundo real, onde problemas com dados escassos exigem generalização sistemática. Propomos um método inspirado em Solomonoff que pondera hipóteses geradas por LLM com base na simplicidade e no ajuste preditivo, produzindo previsões conservadoras e conscientes da incerteza.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

V-Agent: Um Sistema de Busca de Vídeo Interativo Usando Modelos de Visão-Linguagem

Apresentamos o V-Agent, uma nova plataforma multi-agente projetada para busca avançada de vídeos e conversas interativas entre usuário e sistema. Ao ajustar um modelo de visão-linguagem (VLM) com um pequeno conjunto de dados de preferência de vídeo e aprimorá-lo com um vetor de recuperação de um modelo de recuperação de imagem-texto, superamos as limitações dos sistemas tradicionais de recuperação baseados em texto em cenários multimodais.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

InfoTok: Tokenizador de Vídeo Discreto Adaptativo via Compressão Teórica da Informação

A tokenização discreta de vídeo precisa e eficiente é essencial para o processamento de longas sequências de vídeo. Este artigo apresenta o InfoTok, um framework fundamentado para tokenização adaptativa de vídeo, provando que métodos de treinamento existentes são subótimos e introduzindo um novo algoritmo baseado em ELBO que se aproxima da optimalidade teórica.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Vision-Language Model Guided Image Restoration

arXiv:2512.17292v1 Announce Type: new Abstract: Many image restoration (IR) tasks require both pixel-level fidelity and high-level semantic understanding to recover realistic photos with fine-grained details. However, previous approaches often struggle to effectively leverage both the visual and linguistic knowledge. Recent efforts have attempted to incorporate Vision-language models (VLMs), which excel at aligning visual and textual features, into universal IR. Nevertheless, these methods fail to utilize the linguistic priors to ensure semantic coherence during the restoration process. To address this issue, in this paper, we propose the Vision-Language Model Guided Image Restoration (VLMIR) framework, which leverages the rich vision-language priors of VLMs, such as CLIP, to enhance IR performance through improved visual perception and semantic understanding. Our approach consists of two stages: VLM-based feature extraction and diffusion-based image restoration. In the first stage, we extract complementary visual and linguistic representations of input images by condensing the visual perception and high-level semantic priors through VLMs. Specifically, we align the embeddings of captions from low-quality and high-quality images using a cosine similarity loss with LoRA fine-tuning, and employ a degradation predictor to decompose degradation and clean image content embeddings. These complementary visual and textual embeddings are then integrated into a diffusion-based model via cross-attention mechanisms for enhanced restoration. Extensive experiments and ablation studies demonstrate that VLMIR achieves superior performance across both universal and degradation-specific IR tasks, underscoring the critical role of integrated visual and linguistic knowledge from VLMs in advancing image restoration capabilities.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

Incorporação de Embeddings de Nível de Erro de Ruído para Melhorar a Robustez Assistida por LLM no Reconhecimento de Fala Persa

Os sistemas de Reconhecimento Automático de Fala (ASR) enfrentam degradação significativa de desempenho em ambientes ruidosos, especialmente para idiomas de baixo recurso como o persa. Este estudo apresenta um framework robusto de correção de erros ASR sensível ao ruído, que combina múltiplas hipóteses e modelagem consciente do ruído, demonstrando melhorias substanciais na Taxa de Erro de Palavras (WER).

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

arXiv:2512.17131v1 Announce Type: cross Abstract: We propose Generalized Primal Averaging (GPA), an extension of Nesterov's method in its primal averaging formulation that addresses key limitations of recent averaging-based optimizers such as single-worker DiLoCo and Schedule-Free (SF) in the non-distributed setting. These two recent algorithmic approaches improve the performance of base optimizers, such as AdamW, through different iterate averaging strategies. Schedule-Free explicitly maintains a uniform average of past weights, while single-worker DiLoCo performs implicit averaging by periodically aggregating trajectories, called pseudo-gradients, to update the model parameters. However, single-worker DiLoCo's periodic averaging introduces a two-loop structure, increasing its memory requirements and number of hyperparameters. GPA overcomes these limitations by decoupling the interpolation constant in the primal averaging formulation of Nesterov. This decoupling enables GPA to smoothly average iterates at every step, generalizing and improving upon single-worker DiLoCo. Empirically, GPA consistently outperforms single-worker DiLoCo while removing the two-loop structure, simplifying hyperparameter tuning, and reducing its memory overhead to a single additional buffer. On the Llama-160M model, GPA provides a 24.22% speedup in terms of steps to reach the baseline (AdamW's) validation loss. Likewise, GPA achieves speedups of 12% and 27% on small and large batch setups, respectively, to attain AdamW's validation accuracy on the ImageNet ViT workload. Furthermore, we prove that for any base optimizer with regret bounded by $O(\sqrt{T})$, where $T$ is the number of iterations, GPA can match or exceed the convergence guarantee of the original optimizer, depending on the choice of interpolation constants.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Luzes, Câmera, Consistência: Um Pipeline Multietapas para Histórias em Vídeo com IA de Personagens Estáveis

Gerar histórias em vídeo longas e coesas com personagens consistentes é um desafio significativo para a IA atual de texto-para-vídeo. Apresentamos um método que aborda a geração de vídeo de maneira semelhante a um cineasta, utilizando um pipeline que envolve a criação de um roteiro detalhado e a geração de visuais consistentes para cada personagem.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Globally Optimal Solution to the Generalized Relative Pose Estimation Problem using Affine Correspondences

arXiv:2512.17188v1 Announce Type: new Abstract: Mobile devices equipped with a multi-camera system and an inertial measurement unit (IMU) are widely used nowadays, such as self-driving cars. The task of relative pose estimation using visual and inertial information has important applications in various fields. To improve the accuracy of relative pose estimation of multi-camera systems, we propose a globally optimal solver using affine correspondences to estimate the generalized relative pose with a known vertical direction. First, a cost function about the relative rotation angle is established after decoupling the rotation matrix and translation vector, which minimizes the algebraic error of geometric constraints from affine correspondences. Then, the global optimization problem is converted into two polynomials with two unknowns based on the characteristic equation and its first derivative is zero. Finally, the relative rotation angle can be solved using the polynomial eigenvalue solver, and the translation vector can be obtained from the eigenvector. Besides, a new linear solution is proposed when the relative rotation is small. The proposed solver is evaluated on synthetic data and real-world datasets. The experiment results demonstrate that our method outperforms comparable state-of-the-art methods in accuracy.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Mitty: Geração de Vídeo de Humano para Robô Baseada em Difusão

Aprender diretamente com vídeos de demonstração humana é um marco importante para o aprendizado de robôs escalável e generalizável. Apresentamos Mitty, um Diffusion Transformer que possibilita o In-Context Learning de vídeo para geração de vídeo Human2Robot, utilizando um modelo de difusão de vídeo pré-treinado e sem a necessidade de rótulos de ação.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

Modelos de Raciocínio Grande Podem Melhorar a Precisão em Tarefas Matemáticas Usando Pensamento Falho?

O prompting de cadeia de pensamento (CoT) tornou-se central para o raciocínio matemático em grandes modelos de linguagem, mas os modelos ainda são vulneráveis a erros iniciais. Investigamos se o treinamento em rastros de raciocínio intencionalmente falhos pode ensinar os modelos a detectar e se recuperar de tais erros sem degradar a capacidade de resolução de problemas padrão.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Adversarial VR: Um Testbed Open-Source para Avaliação da Robustez Adversarial na Detecção e Mitigação de Ciberdoença em VR

Métodos automatizados de detecção de ciberdoença baseados em deep learning (DL) podem melhorar o conforto e a interação do usuário. No entanto, esses sistemas são suscetíveis a ataques adversariais, que podem degradar o desempenho do modelo e interromper a experiência imersiva. Este artigo apresenta o Adversarial-VR, um testbed em tempo real para avaliar estratégias de detecção e mitigação de ciberdoença sob condições adversariais.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Um Referencial de Saúde da Mulher para Modelos de Linguagem de Grande Escala

À medida que os modelos de linguagem de grande escala (LLMs) se tornam fontes primárias de informação em saúde para milhões, sua precisão em saúde da mulher permanece criticamente inexplorada. Apresentamos o Women's Health Benchmark (WHB), o primeiro referencial que avalia o desempenho dos LLMs especificamente em saúde da mulher.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

Atom: Efficient On-Device Video-Language Pipelines Through Modular Reuse

arXiv:2512.17108v1 Announce Type: new Abstract: Recent advances in video-language models have enabled powerful applications like video retrieval, captioning, and assembly. However, executing such multi-stage pipelines efficiently on mobile devices remains challenging due to redundant model loads and fragmented execution. We introduce Atom, an on-device system that restructures video-language pipelines for fast and efficient execution. Atom decomposes a billion-parameter model into reusable modules, such as the visual encoder and language decoder, and reuses them across subtasks like captioning, reasoning, and indexing. This reuse-centric design eliminates repeated model loading and enables parallel execution, reducing end-to-end latency without sacrificing performance. On commodity smartphones, Atom achieves 27--33% faster execution compared to non-reuse baselines, with only marginal performance drop ($\leq$ 2.3 Recall@1 in retrieval, $\leq$ 1.5 CIDEr in captioning). These results position Atom as a practical, scalable approach for efficient video-language understanding on edge devices.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images

arXiv:2512.17306v1 Announce Type: new Abstract: Recent advances in large Vision-Language Models (VLMs) have exhibited strong reasoning capabilities on complex visual tasks by thinking with images in their Chain-of-Thought (CoT), which is achieved by actively invoking tools to analyze visual inputs rather than merely perceiving them. However, existing models often struggle to reflect on and correct themselves when attempting incorrect reasoning trajectories. To address this limitation, we propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT. Our pipeline comprises three stages: data construction, cold-start SFT and RL. Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs, where solving each task requires multi-turn tool calls to reach the correct answer. In the SFT stage, we collect tool trajectories as cold-start data, guiding a multi-turn reasoning pattern. In the RL stage, we introduce redundancy-penalized policy optimization, which incentivizes the model to develop a self-reflective reasoning pattern. The basic idea is to impose judgment on reasoning trajectories and penalize those that produce incorrect answers without sufficient multi-scale exploration. Extensive experiments demonstrate that DRIM achieves superior performance on visual understanding benchmarks.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

Radar de Risco Sistêmico: Um Framework de Grafo Multicamadas para Aviso Precoce de Crises de Mercado

As crises financeiras surgem quando vulnerabilidades estruturais se acumulam entre setores, mercados e comportamentos de investidores. O Systemic Risk Radar (SRR) modela os mercados financeiros como grafos multicamadas para detectar sinais precoces de fragilidade sistêmica e transições para regimes de colapso. Avaliamos o SRR em três crises principais: a crise das Dot-com, a Crise Financeira Global e o choque da COVID-19.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Predictive Modeling of Maritime Radar Data Using Transformer Architecture

arXiv:2512.17098v1 Announce Type: new Abstract: Maritime autonomous systems require robust predictive capabilities to anticipate vessel motion and environmental dynamics. While transformer architectures have revolutionized AIS-based trajectory prediction and demonstrated feasibility for sonar frame forecasting, their application to maritime radar frame prediction remains unexplored, creating a critical gap given radar's all-weather reliability for navigation. This survey systematically reviews predictive modeling approaches relevant to maritime radar, with emphasis on transformer architectures for spatiotemporal sequence forecasting, where existing representative methods are analyzed according to data type, architecture, and prediction horizon. Our review shows that, while the literature has demonstrated transformer-based frame prediction for sonar sensing, no prior work addresses transformer-based maritime radar frame prediction, thereby defining a clear research gap and motivating a concrete research direction for future work in this area.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

arXiv:2512.17012v1 Announce Type: new Abstract: Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Auxiliary Descriptive Knowledge for Few-Shot Adaptation of Vision-Language Model

arXiv:2512.17313v1 Announce Type: new Abstract: Despite the impressive zero-shot capabilities of Vision-Language Models (VLMs), they often struggle in downstream tasks with distribution shifts from the pre-training data. Few-Shot Adaptation (FSA-VLM) has emerged as a key solution, typically using Parameter-Efficient Fine-Tuning (PEFT) to adapt models with minimal data. However, these PEFT methods are constrained by their reliance on fixed, handcrafted prompts, which are often insufficient to understand the semantics of classes. While some studies have proposed leveraging image-induced prompts to provide additional clues for classification, they introduce prohibitive computational overhead at inference. Therefore, we introduce Auxiliary Descriptive Knowledge (ADK), a novel framework that efficiently enriches text representations without compromising efficiency. ADK first leverages a Large Language Model to generate a rich set of descriptive prompts for each class offline. These pre-computed features are then deployed in two ways: (1) as Compositional Knowledge, an averaged representation that provides rich semantics, especially beneficial when class names are ambiguous or unfamiliar to the VLM; and (2) as Instance-Specific Knowledge, where a lightweight, non-parametric attention mechanism dynamically selects the most relevant descriptions for a given image. This approach provides two additional types of knowledge alongside the handcrafted prompt, thereby facilitating category distinction across various domains. Also, ADK acts as a parameter-free, plug-and-play component that enhances existing PEFT methods. Extensive experiments demonstrate that ADK consistently boosts the performance of multiple PEFT baselines, setting a new state-of-the-art across various scenarios.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

Rotterdam artery-vein segmentation (RAV) dataset

arXiv:2512.17322v1 Announce Type: new Abstract: Purpose: To provide a diverse, high-quality dataset of color fundus images (CFIs) with detailed artery-vein (A/V) segmentation annotations, supporting the development and evaluation of machine learning algorithms for vascular analysis in ophthalmology. Methods: CFIs were sampled from the longitudinal Rotterdam Study (RS), encompassing a wide range of ages, devices, and capture conditions. Images were annotated using a custom interface that allowed graders to label arteries, veins, and unknown vessels on separate layers, starting from an initial vessel segmentation mask. Connectivity was explicitly verified and corrected using connected component visualization tools. Results: The dataset includes 1024x1024-pixel PNG images in three modalities: original RGB fundus images, contrast-enhanced versions, and RGB-encoded A/V masks. Image quality varied widely, including challenging samples typically excluded by automated quality assessment systems, but judged to contain valuable vascular information. Conclusion: This dataset offers a rich and heterogeneous source of CFIs with high-quality segmentations. It supports robust benchmarking and training of machine learning models under real-world variability in image quality and acquisition settings. Translational Relevance: By including connectivity-validated A/V masks and diverse image conditions, this dataset enables the development of clinically applicable, generalizable machine learning tools for retinal vascular analysis, potentially improving automated screening and diagnosis of systemic and ocular diseases.

Fonte: arXiv cs.CV

NLP/LLMs • Score 95

CheXPO-v2: Preference Optimization for Chest X-ray VLMs with Knowledge Graph Consistency

arXiv:2512.17213v1 Announce Type: new Abstract: Medical Vision-Language Models (VLMs) are prone to hallucinations, compromising clinical reliability. While reinforcement learning methods like Group Relative Policy Optimization (GRPO) offer a low-cost alignment solution, their reliance on sparse, outcome-based rewards inadvertently encourages models to "overthink" -- generating verbose, convoluted, and unverifiable Chain-of-Thought reasoning to justify answers. This focus on outcomes obscures factual errors and poses significant safety risks. To address this, we propose CheXPO-v2, a novel alignment framework that shifts from outcome to process supervision. Our core innovation is a Knowledge Graph Consistency Reward mechanism driven by Entity-Relation Matching. By explicitly parsing reasoning steps into structured "Disease, Relation, Anatomy" triplets, we provide fine-grained supervision that penalizes incoherent logic and hallucinations at the atomic level. Integrating this with a hard-example mining strategy, our approach significantly outperforms GRPO and state-of-the-art models on benchmarks like MIMIC-CXR-VQA. Crucially, CheXPO-v2 achieves new state-of-the-art accuracy using only 5k samples, demonstrating exceptional data efficiency while producing clinically sound and verifiable reasoning. The project source code is publicly available at: https://github.com/ecoxial2007/CheX-Phi4MM.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

MemoryGraft: Comprometimento Persistente de Agentes LLM via Recuperação de Experiência Envenenada

Os agentes de Modelos de Linguagem Grande (LLM) dependem cada vez mais da memória de longo prazo e da Geração Aumentada por Recuperação (RAG) para persistir experiências e refinar o desempenho futuro. Este trabalho apresenta o MemoryGraft, um ataque de injeção indireta que compromete o comportamento do agente ao implantar experiências maliciosas em sua memória de longo prazo.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

DAVE: A VLM Vision Encoder for Document Understanding and Web Agents

arXiv:2512.17221v1 Announce Type: new Abstract: While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial information essential for document understanding and web agents. To bridge this gap, we introduce DAVE, a vision encoder purpose-built for VLMs and tailored for these tasks. Our training pipeline is designed to leverage abundant unlabeled data to bypass the need for costly large-scale annotations for document and web images. We begin with a self-supervised pretraining stage on unlabeled images, followed by a supervised autoregressive pretraining stage, where the model learns tasks like parsing and localization from limited, high-quality data. Within the supervised stage, we adopt two strategies to improve our encoder's alignment with both general visual knowledge and diverse document and web agentic tasks: (i) We introduce a novel model-merging scheme, combining encoders trained with different text decoders to ensure broad compatibility with different web agentic architectures. (ii) We use ensemble training to fuse features from pretrained generalist encoders (e.g., SigLIP2) with our own document and web-specific representations. Extensive experiments on classic document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness of our approach, establishing DAVE as a strong vision encoder for document and web applications.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

GB-DQN: Gradient Boosted DQN Models for Non-stationary Reinforcement Learning

arXiv:2512.17034v1 Announce Type: new Abstract: Non-stationary environments pose a fundamental challenge for deep reinforcement learning, as changes in dynamics or rewards invalidate learned value functions and cause catastrophic forgetting. We propose \emph{Gradient-Boosted Deep Q-Networks (GB-DQN)}, an adaptive ensemble method that addresses model drift through incremental residual learning. Instead of retraining a single Q-network, GB-DQN constructs an additive ensemble in which each new learner is trained to approximate the Bellman residual of the current ensemble after drift. We provide theoretical results showing that each boosting step reduces the empirical Bellman residual and that the ensemble converges to the post-drift optimal value function under standard assumptions. Experiments across a diverse set of control tasks with controlled dynamics changes demonstrate faster recovery, improved stability, and greater robustness compared to DQN and common non-stationary baselines.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

ResSVD: Residual Compensated SVD for Large Language Model Compression

arXiv:2505.20112v3 Announce Type: replace Abstract: Large language models (LLMs) have demonstrated impressive capabilities in a wide range of downstream natural language processing tasks. Nevertheless, their considerable sizes and memory demands hinder practical deployment, underscoring the importance of developing efficient compression strategies. Singular value decomposition (SVD) decomposes a matrix into orthogonal components, enabling efficient low-rank approximation. This is particularly suitable for LLM compression, where weight matrices often exhibit significant redundancy. However, current SVD-based methods neglect the residual matrix from truncation, resulting in significant truncation loss. Additionally, compressing all layers of the model results in severe performance degradation. To overcome these limitations, we propose ResSVD, a new post-training SVD-based LLM compression method. Specifically, we leverage the residual matrix generated during the truncation process to reduce truncation loss. Moreover, under a fixed overall compression ratio, we selectively compress the last few layers of the model, which mitigates error propagation and significantly improves the performance of compressed models. Comprehensive evaluations of ResSVD on diverse LLM families and multiple benchmark datasets indicate that ResSVD consistently achieves superior performance over existing counterpart methods, demonstrating its practical effectiveness.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Towards Safer Chatbots: Automated Policy Compliance Evaluation of Custom GPTs

arXiv:2502.01436v3 Announce Type: replace Abstract: User-configured chatbots built on top of large language models are increasingly available through centralized marketplaces such as OpenAI's GPT Store. While these platforms enforce usage policies intended to prevent harmful or inappropriate behavior, the scale and opacity of customized chatbots make systematic policy enforcement challenging. As a result, policy-violating chatbots continue to remain publicly accessible despite existing review processes. This paper presents a fully automated method for evaluating the compliance of Custom GPTs with its marketplace usage policy using black-box interaction. The method combines large-scale GPT discovery, policy-driven red-teaming prompts, and automated compliance assessment using an LLM-as-a-judge. We focus on three policy-relevant domains explicitly addressed in OpenAI's usage policies: Romantic, Cybersecurity, and Academic GPTs. We validate our compliance assessment component against a human-annotated ground-truth dataset, achieving an F1 score of 0.975 for binary policy violation detection. We then apply the method in a large-scale empirical study of 782 Custom GPTs retrieved from the GPT Store. The results show that 58.7% of the evaluated GPTs exhibit at least one policy-violating response, with substantial variation across policy domains. A comparison with the base models (GPT-4 and GPT-4o) indicates that most violations originate from model-level behavior, while customization tends to amplify these tendencies rather than create new failure modes. Our findings reveal limitations in current review mechanisms for user-configured chatbots and demonstrate the feasibility of scalable, behavior-based policy compliance evaluation.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering

arXiv:2512.17396v1 Announce Type: cross Abstract: In this work, we introduce RadImageNet-VQA, a large-scale dataset designed to advance radiologic visual question answering (VQA) on CT and MRI exams. Existing medical VQA datasets are limited in scale, dominated by X-ray imaging or biomedical illustrations, and often prone to text-based shortcuts. RadImageNet-VQA is built from expert-curated annotations and provides 750K images paired with 7.5M question-answer samples. It covers three key tasks - abnormality detection, anatomy recognition, and pathology identification - spanning eight anatomical regions and 97 pathology categories, and supports open-ended, closed-ended, and multiple-choice questions. Extensive experiments show that state-of-the-art vision-language models still struggle with fine-grained pathology identification, particularly in open-ended settings and even after fine-tuning. Text-only analysis further reveals that model performance collapses to near-random without image inputs, confirming that RadImageNet-VQA is free from linguistic shortcuts. The full dataset and benchmark are publicly available at https://huggingface.co/datasets/raidium/RadImageNet-VQA.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

LLM-as-a-qualitative-judge: automating error analysis in natural language generation

arXiv:2506.09147v4 Announce Type: replace Abstract: Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that instance-specific issues output by LLM-as-a-qualitative-judge match those annotated by humans in 2/3 cases, and that LLM-as-a-qualitative-judge is capable of producing error type reports resembling the reports composed by human annotators. We also demonstrate in a case study how the use of LLM-as-a-qualitative-judge can substantially improve NLG systems performance. Our code and data are publicly available at https://github.com/tunde-ajayi/llm-as-a-qualitative-judge.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Turn-PPO: Estimativa de Vantagem em Nível de Turno com PPO para Melhoria do RL Multi-Turno em LLMs Agentes

O aprendizado por reforço (RL) ressurgiu como uma abordagem natural para treinar agentes LLM interativos em ambientes do mundo real. No entanto, a aplicação direta do algoritmo Group Relative Policy Optimization (GRPO) em tarefas multi-turno revela limitações significativas. Para superar esses desafios, investigamos estratégias de estimativa de vantagem mais estáveis e eficazes, introduzindo o turn-PPO como uma variante que opera em uma formulação de MDP em nível de turno.

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents

arXiv:2503.10689v2 Announce Type: replace Abstract: Recent advances in large language models (LLMs) have led to a growing interest in developing LLM-based agents for automating web tasks. However, these agents often struggle with even simple tasks on real-world websites due to their limited capability to understand and process complex web page structures. In this work, we introduce LCoW, a framework for Learning language models to Contextualize complex Web pages into a more comprehensible form, thereby enhancing decision making by LLM agents. LCoW decouples web page understanding from decision making by training a separate contextualization module to transform complex web pages into comprehensible format, which are then utilized by the decision-making agent. We demonstrate that our contextualization module effectively integrates with LLM agents of various scales to significantly enhance their decision-making capabilities in web automation tasks. Notably, LCoW improves the success rates of closed-source LLMs (e.g., Gemini-1.5-flash, GPT-4o, Claude-3.5-Sonnet) by an average of 15.6%, and demonstrates a 23.7% average improvement in success rates for open-source LMs (e.g., Llama-3.1-8B, Llama-3.1-70B) on the WorkArena benchmark. Moreover, the Gemini-1.5-flash agent with LCoW achieves state-of-the-art results on the WebShop benchmark, outperforming human experts. The relevant code materials are available at our project page: https://lcowiclr2025.github.io.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens

arXiv:2512.17375v1 Announce Type: new Abstract: Reward models and LLM-as-a-Judge systems are central to modern post-training pipelines such as RLHF, DPO, and RLAIF, where they provide scalar feedback and binary decisions that guide model selection and RL-based fine-tuning. We show that these judge systems exhibit a recurring vulnerability: short sequences of low-perplexity control tokens can flip many binary evaluations from correct ``No'' judgments to incorrect ``Yes'' judgments by steering the last-layer logit gap. These control tokens are patterns that a policy model could plausibly generate during post-training, and thus represent realistic reward-hacking risks rather than worst-case adversarial strings. Our method, AdvJudge-Zero, uses the model's next-token distribution and beam-search exploration to discover diverse control-token sequences from scratch, and our analysis shows that the induced hidden-state perturbations concentrate in a low-rank ``soft mode'' that is anti-aligned with the judge's refusal direction. Empirically, these tokens cause very high false positive rates when large open-weight and specialized judge models score incorrect answers on math and reasoning benchmarks. Finally, we show that LoRA-based adversarial training on small sets of control-token-augmented examples can markedly reduce these false positives while preserving evaluation quality.

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

Sobre o Papel da Informação Contextual e Estados do Ego no Comportamento de Agentes LLM para Diálogos de Análise Transacional

Agentes impulsionados por LLM são utilizados em diversas áreas, desde suporte ao cliente até educação, com um crescente interesse em sua capacidade de agir de forma mais humana. Este artigo propõe um Sistema Multi-Agente (MAS) inspirado na teoria da Análise Transacional (TA), onde cada agente é dividido em três estados do ego: Pai, Adulto e Criança, enriquecendo o processo de resposta com um mecanismo de recuperação de informações contextuais.

Fonte: arXiv cs.AI

NLP/LLMs • Score 93

Compressão é Roteamento: Erro de Reconstrução como um Sinal Intrínseco para Modelos de Linguagem Modulares

Modelos de Linguagem de Grande Escala (LLMs) enfrentam desafios como limitações de comprimento de contexto e altos custos de inferência. Este artigo propõe a filosofia arquitetônica "Compressão é Roteamento" e apresenta um Transformer Autoencoder com 87M de parâmetros, alcançando uma compressão de sequência de 64x. Os resultados experimentais mostram uma discrepância de desempenho que valida o erro de reconstrução como uma "Impressão Digital Intrínseca de Distribuição".

Fonte: arXiv cs.LG

NLP/LLMs • Score 96

ScoutGPT: Capturando o Impacto do Jogador a Partir de Sequências de Ação da Equipe Usando um Framework Baseado em GPT

Transferências desempenham um papel fundamental no sucesso de um clube de futebol, mas prever se uma transferência terá sucesso continua sendo difícil devido à forte dependência do contexto no desempenho em campo. Para abordar essa lacuna, introduzimos o EventGPT, um modelo de previsão de próximo evento condicionado ao jogador e ciente do valor, construído sobre um transformer autoregressivo estilo GPT.

Fonte: arXiv cs.AI

NLP/LLMs • Score 96

PAACE: Um Framework de Engenharia de Contexto Automatizado e Consciente de Planejamento

Modelos de Linguagem de Grande Escala (LLM) estão sendo cada vez mais utilizados em fluxos de trabalho complexos que envolvem planejamento, uso de ferramentas, reflexão e interação com sistemas de conhecimento externos. Este trabalho apresenta o PAACE, um framework unificado para otimizar o estado evolutivo de agentes LLM através de modelagem de relevância de próximas tarefas, análise de estrutura de planejamento, co-refinamento de instruções e compressão que preserva funções.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Strategic Planning and Rationalizing on Trees Make LLMs Better Debaters

arXiv:2505.14886v2 Announce Type: replace Abstract: Winning competitive debates requires sophisticated reasoning and argument skills. There are unique challenges in the competitive debate: (1) The time constraints force debaters to make strategic choices about which points to pursue rather than covering all possible arguments; (2) The persuasiveness of the debate relies on the back-and-forth interaction between arguments, which a single final game status cannot evaluate. To address these challenges, we propose TreeDebater, a novel debate framework that excels in competitive debate. We introduce two tree structures: the Rehearsal Tree and Debate Flow Tree. The Rehearsal Tree anticipates the attack and defenses to evaluate the strength of the claim, while the Debate Flow Tree tracks the debate status to identify the active actions. TreeDebater allocates its time budget among candidate actions and uses the speech time controller and feedback from the simulated audience to revise its statement. The human evaluation on both the stage-level and the debate-level comparison shows that our TreeDebater outperforms the state-of-the-art multi-agent debate system, with a +15.6% improvement in stage-level persuasiveness with DeepSeek and +10% debate-level opinion shift win. Further investigation shows that TreeDebater shows better strategies in limiting time to important debate actions, aligning with the strategies of human debate experts.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Generating Completions for Broca's Aphasic Sentences Using Large Language Models

arXiv:2412.17669v2 Announce Type: replace Abstract: Broca's aphasia is a type of aphasia characterized by non-fluent, effortful and agrammatic speech production with relatively good comprehension. Since traditional aphasia treatment methods are often time-consuming, labour-intensive, and do not reflect real-world conversations, applying natural language processing based approaches such as Large Language Models (LLMs) could potentially contribute to improving existing treatment approaches. To address this issue, we explore the use of sequence-to-sequence LLMs for completing Broca's aphasic sentences. We first generate synthetic Broca's aphasic data using a rule-based system designed to mirror the linguistic characteristics of Broca's aphasic speech. Using this synthetic data (without authentic aphasic samples), we then fine-tune four pre-trained LLMs on the task of completing agrammatic sentences. We evaluate our fine-tuned models on both synthetic and authentic Broca's aphasic data. We demonstrate LLMs' capability for reconstructing agrammatic sentences, with the models showing improved performance with longer input utterances. Our result highlights the LLMs' potential in advancing communication aids for individuals with Broca's aphasia and possibly other clinical populations.

Fonte: arXiv cs.CL

NLP/LLMs • Score 92

SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories

arXiv:2512.17419v1 Announce Type: cross Abstract: Benchmarks like SWE-bench have standardized the evaluation of Large Language Models (LLMs) on repository-level software engineering tasks. However, these efforts remain limited by manual curation, static datasets, and a focus on Python-based bug fixes. We introduce SWE-Bench++, an automated framework that generates repository-level coding tasks from open-source GitHub projects. Unlike synthetic approaches, our pipeline harvests live pull requests to cover both bug fixes and feature requests across 11 languages. SWE-Bench++ turns GitHub pull requests (PRs) into reproducible, execution-based tasks via four stages: programmatic sourcing, environment synthesis, test oracle extraction, and quality assurance. A final hint-guided trajectory synthesis step converts instances that strong models fail on into training trajectories. Our initial benchmark consists of 11,133 instances from 3,971 repositories across 11 languages. On a subset of 1,782 instances of this benchmark, today's strongest models perform as follows: claude-sonnet-4.5 achieves 36.20% pass@10, gpt-5-2025-08-07 34.57%, gemini/gemini-2.5-pro 24.92%, and gpt-4o 16.89%. We further demonstrate the utility of our dataset by showing that fine-tuning on SWE-Bench++ instances yields measurable improvements on the SWE-bench Multilingual benchmark. SWE-Bench++ provides a scalable, multilingual benchmark for evaluating and improving repository-level code generation.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Aprendizado por Reforço para Agente Autoaperfeiçoável com Biblioteca de Habilidades

Agentes baseados em Large Language Model (LLM) demonstraram capacidades notáveis em raciocínio complexo e interações de múltiplas etapas, mas enfrentam dificuldades para melhorar e se adaptar continuamente em novos ambientes. Propomos uma abordagem baseada em Aprendizado por Reforço (RL) para aprimorar as capacidades de autoaperfeiçoamento dos agentes com uma biblioteca de habilidades.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Democratizing Pathology Co-Pilots: An Open Pipeline and Dataset for Whole-Slide Vision-Language Modelling

arXiv:2512.17326v1 Announce Type: new Abstract: Vision-language models (VLMs) have the potential to become co-pilots for pathologists. However, most VLMs either focus on small regions of interest within whole-slide images, provide only static slide-level outputs, or rely on data that is not publicly available, limiting reproducibility. Furthermore, training data containing WSIs paired with detailed clinical reports is scarce, restricting progress toward transparent and generalisable VLMs. We address these limitations with three main contributions. First, we introduce Polysome, a standardised tool for synthetic instruction generation. Second, we apply Polysome to the public HISTAI dataset, generating HISTAI-Instruct, a large whole-slide instruction tuning dataset spanning 24,259 slides and over 1.1 million instruction-response pairs. Finally, we use HISTAI-Instruct to train ANTONI-{\alpha}, a VLM capable of visual-question answering (VQA). We show that ANTONI-{\alpha} outperforms MedGemma on WSI-level VQA tasks of tissue identification, neoplasm detection, and differential diagnosis. We also compare the performance of multiple incarnations of ANTONI-{\alpha} trained with different amounts of data. All methods, data, and code are publicly available.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

Large Language Models as Pok\'emon Battle Agents: Strategic Play and Content Generation

arXiv:2512.17308v1 Announce Type: new Abstract: Strategic decision-making in Pok\'emon battles presents a unique testbed for evaluating large language models. Pok\'emon battles demand reasoning about type matchups, statistical trade-offs, and risk assessment, skills that mirror human strategic thinking. This work examines whether Large Language Models (LLMs) can serve as competent battle agents, capable of both making tactically sound decisions and generating novel, balanced game content. We developed a turn-based Pok\'emon battle system where LLMs select moves based on battle state rather than pre-programmed logic. The framework captures essential Pok\'emon mechanics: type effectiveness multipliers, stat-based damage calculations, and multi-Pok\'emon team management. Through systematic evaluation across multiple model architectures we measured win rates, decision latency, type-alignment accuracy, and token efficiency. These results suggest LLMs can function as dynamic game opponents without domain-specific training, offering a practical alternative to reinforcement learning for turn-based strategic games. The dual capability of tactical reasoning and content creation, positions LLMs as both players and designers, with implications for procedural generation and adaptive difficulty systems in interactive entertainment.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

ShareChat: A Dataset of Chatbot Conversations in the Wild

arXiv:2512.17843v1 Announce Type: new Abstract: While Large Language Models (LLMs) have evolved into distinct platforms with unique interface designs and capabilities, existing public datasets treat models as generic text generators, stripping away the interface context that actively shapes user interaction. To address this limitation, we present ShareChat, a large-scale, cross-platform corpus comprising 142,808 conversations and over 660,000 turns collected from publicly shared URLs across five major platforms: ChatGPT, Claude, Gemini, Perplexity, and Grok. ShareChat distinguishes itself by preserving native platform affordances often lost in standard logs, including reasoning traces, source links, and code artifacts, while spanning 101 languages over the period from April 2023 to October 2025. Furthermore, ShareChat offers substantially longer context windows and greater interaction depth than prior datasets. We demonstrate the dataset's multifaceted utility through three representative analyses: (1) analyzing conversation completeness to measure user intent satisfaction; (2) evaluating source citation behaviors in content generation; and (3) conducting temporal analysis to track evolving usage patterns. This work provides the community with a vital and timely resource for understanding authentic user-LLM chatbot interactions in the wild.

Fonte: arXiv cs.CL

NLP/LLMs • Score 89

LookAhead Tuning: Safer Language Models via Partial Answer Previews

arXiv:2503.19041v4 Announce Type: replace Abstract: Fine-tuning enables large language models (LLMs) to adapt to specific domains, but often compromises their previously established safety alignment. To mitigate the degradation of model safety during fine-tuning, we introduce LookAhead Tuning, a lightweight and effective data-driven approach that preserves safety during fine-tuning. The method introduces two simple strategies that modify training data by previewing partial answer prefixes, thereby minimizing perturbations to the model's initial token distributions and maintaining its built-in safety mechanisms. Comprehensive experiments demonstrate that LookAhead Tuning effectively maintains model safety without sacrificing robust performance on downstream tasks. Our findings position LookAhead Tuning as a reliable and efficient solution for the safe and effective adaptation of LLMs.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Investigando a Inteligência Geral Científica de LLMs com Fluxos de Trabalho Alinhados a Cientistas

Apesar dos avanços em IA científica, falta um framework coerente para Inteligência Geral Científica (SGI) — a capacidade de conceber, investigar e raciocinar autonomamente em domínios científicos. Apresentamos uma definição operacional de SGI baseada no Modelo de Investigação Prática (PIM) e a operacionalizamos através de quatro tarefas alinhadas a cientistas: pesquisa profunda, geração de ideias, experimentos secos/molhados e raciocínio experimental.

Fonte: arXiv cs.AI

NLP/LLMs • Score 92

Linear Personality Probing and Steering in LLMs: A Big Five Study

arXiv:2512.17639v1 Announce Type: new Abstract: Large language models (LLMs) exhibit distinct and consistent personalities that greatly impact trust and engagement. While this means that personality frameworks would be highly valuable tools to characterize and control LLMs' behavior, current approaches remain either costly (post-training) or brittle (prompt engineering). Probing and steering via linear directions has recently emerged as a cheap and efficient alternative. In this paper, we investigate whether linear directions aligned with the Big Five personality traits can be used for probing and steering model behavior. Using Llama 3.3 70B, we generate descriptions of 406 fictional characters and their Big Five trait scores. We then prompt the model with these descriptions and questions from the Alpaca questionnaire, allowing us to sample hidden activations that vary along personality traits in known, quantifiable ways. Using linear regression, we learn a set of per-layer directions in activation space, and test their effectiveness for probing and steering model behavior. Our results suggest that linear directions aligned with trait-scores are effective probes for personality detection, while their steering capabilities strongly depend on context, producing reliable effects in forced-choice tasks but limited influence in open-ended generation or when additional context is present in the prompt.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports

arXiv:2512.17776v1 Announce Type: new Abstract: As large language models (LLMs) advance, deep research systems can generate expert-level reports via multi-step reasoning and evidence-based synthesis, but evaluating such reports remains challenging. Existing benchmarks often lack systematic criteria for expert reporting, evaluations that rely heavily on LLM judges can fail to capture issues that require expert judgment, and source verification typically covers only a limited subset of explicitly cited statements rather than report-wide factual reliability. We introduce DEER, a benchmark for evaluating expert-level deep research reports. DEER comprises 50 report-writing tasks spanning 13 domains and an expert-grounded evaluation taxonomy (7 dimensions, 25 sub-dimension) operationalized into 130 fine-grained rubric items. DEER further provides task-specific expert guidance to help LLM judges assess expert-level report quality more consistently. Complementing rubric-based assessment, we propose a document-level fact-checking architecture that extracts and verifies all claims across the entire report, including both cited and uncited ones, and quantifies external-evidence quality. DEER correlates closely with human expert judgments and yields interpretable diagnostics of system strengths and weaknesses.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Toward Ethical AI Through Bayesian Uncertainty in Neural Question Answering

arXiv:2512.17677v1 Announce Type: new Abstract: We explore Bayesian reasoning as a means to quantify uncertainty in neural networks for question answering. Starting with a multilayer perceptron on the Iris dataset, we show how posterior inference conveys confidence in predictions. We then extend this to language models, applying Bayesian inference first to a frozen head and finally to LoRA-adapted transformers, evaluated on the CommonsenseQA benchmark. Rather than aiming for state-of-the-art accuracy, we compare Laplace approximations against maximum a posteriori (MAP) estimates to highlight uncertainty calibration and selective prediction. This allows models to abstain when confidence is low. An ``I don't know'' response not only improves interpretability but also illustrates how Bayesian methods can contribute to more responsible and ethical deployment of neural question-answering systems.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs

arXiv:2512.17206v1 Announce Type: new Abstract: Exploration capacity shapes both inference-time performance and reinforcement learning (RL) training for large (vision-) language models, as stochastic sampling often yields redundant reasoning paths with little high-level diversity. This paper proposes Reasoning Palette, a novel latent-modulation framework that endows the model with a stochastic latent variable for strategic contextualization, guiding its internal planning prior to token generation. This latent context is inferred from the mean-pooled embedding of a question-answer pair via a variational autoencoder (VAE), where each sampled latent potentially encodes a distinct reasoning context. During inference, a sampled latent is decoded into learnable token prefixes and prepended to the input prompt, modulating the model's internal reasoning trajectory. In this way, the model performs internal sampling over reasoning strategies prior to output generation, which shapes the style and structure of the entire response sequence. A brief supervised fine-tuning (SFT) warm-up phase allows the model to adapt to this latent conditioning. Within RL optimization, Reasoning Palette facilitates structured exploration by enabling on-demand injection for diverse reasoning modes, significantly enhancing exploration efficiency and sustained learning capability. Experiments across multiple reasoning benchmarks demonstrate that our method enables interpretable and controllable control over the (vision-) language model's strategic behavior, thereby achieving consistent performance gains over standard RL methods.

Fonte: arXiv cs.CV

NLP/LLMs • Score 96

QSMOTE-PGM/kPGM: Classificação de Conjuntos de Dados Desbalanceados Baseada em QSMOTE e kPGM

O aprendizado de máquina inspirado em quantum (QiML) utiliza estruturas matemáticas da teoria quântica para aprimorar algoritmos clássicos, com foco nas estruturas de produto interno em espaços de características de alta dimensão. Este trabalho apresenta uma comparação teórica e empírica unificada de classificadores baseados em PGM e KPGM, analisando seu desempenho em cenários de oversampling sintético usando variantes do Quantum SMOTE (QSMOTE).

Fonte: arXiv cs.LG

NLP/LLMs • Score 95

Are Vision Language Models Cross-Cultural Theory of Mind Reasoners?

arXiv:2512.17394v1 Announce Type: new Abstract: Theory of Mind (ToM) -- the ability to attribute beliefs, desires, and emotions to others -- is fundamental for human social intelligence, yet remains a major challenge for artificial agents. Existing Vision-Language Models (VLMs) are increasingly applied in socially grounded tasks, but their capacity for cross-cultural ToM reasoning is largely unexplored. In this work, we introduce CulturalToM-VQA, a new evaluation benchmark containing 5095 questions designed to probe ToM reasoning across diverse cultural contexts through visual question answering. The dataset captures culturally grounded cues such as rituals, attire, gestures, and interpersonal dynamics, enabling systematic evaluation of ToM reasoning beyond Western-centric benchmarks. Our dataset is built through a VLM-assisted human-in-the-loop pipeline, where human experts first curate culturally rich images across traditions, rituals, and social interactions; a VLM then assist in generating structured ToM-focused scene descriptions, which are refined into question-answer pairs spanning a taxonomy of six ToM tasks and four graded complexity levels. The resulting dataset covers diverse theory of mind facets such as mental state attribution, false belief reasoning, non-literal communication, social norm violations, perspective coordination, and multi-agent reasoning.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Subjective Question Generation and Answer Evaluation using NLP

arXiv:2512.17289v1 Announce Type: new Abstract: Natural Language Processing (NLP) is one of the most revolutionary technologies today. It uses artificial intelligence to understand human text and spoken words. It is used for text summarization, grammar checking, sentiment analysis, and advanced chatbots and has many more potential use cases. Furthermore, it has also made its mark on the education sector. Much research and advancements have already been conducted on objective question generation; however, automated subjective question generation and answer evaluation are still in progress. An automated system to generate subjective questions and evaluate the answers can help teachers assess student work and enhance the student's learning experience by allowing them to self-assess their understanding after reading an article or a chapter of a book. This research aims to improve current NLP models or make a novel one for automated subjective question generation and answer evaluation from text input.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

CIFE: Code Instruction-Following Evaluation

arXiv:2512.17387v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly applied to real-world code generation, where functional correctness alone is insufficient for reliable deployment, developers also expect adherence to explicit requirements for robustness, formatting, and security. Existing benchmarks primarily assess correctness through test-case execution, offering limited insight into how reliably models follow such constraints. We introduce a benchmark of 1,000 Python tasks, each paired with an average of 7 developer-specified constraints spanning 13 categories. Constraints are curated through a four-stage human-LLM pipeline to ensure they are atomic, relevant, and objective. We evaluate 14 open- and closed-source models using complementary adherence metrics and propose the C2A Score, a composite measure that jointly captures correctness and constraint compliance. Results reveal a substantial gap between partial and strict satisfaction, while strong models achieve over 90% partial adherence, strict adherence remains between 39-66%. These findings highlight that trustworthy code generation requires not only correctness but also consistent adherence to developer intent.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Bangla MedER: Multi-BERT Ensemble Approach for the Recognition of Bangla Medical Entity

arXiv:2512.17769v1 Announce Type: new Abstract: Medical Entity Recognition (MedER) is an essential NLP task for extracting meaningful entities from the medical corpus. Nowadays, MedER-based research outcomes can remarkably contribute to the development of automated systems in the medical sector, ultimately enhancing patient care and outcomes. While extensive research has been conducted on MedER in English, low-resource languages like Bangla remain underexplored. Our work aims to bridge this gap. For Bangla medical entity recognition, this study first examined a number of transformer models, including BERT, DistilBERT, ELECTRA, and RoBERTa. We also propose a novel Multi-BERT Ensemble approach that outperformed all baseline models with the highest accuracy of 89.58%. Notably, it provides an 11.80% accuracy improvement over the single-layer BERT model, demonstrating its effectiveness for this task. A major challenge in MedER for low-resource languages is the lack of annotated datasets. To address this issue, we developed a high-quality dataset tailored for the Bangla MedER task. The dataset was used to evaluate the effectiveness of our model through multiple performance metrics, demonstrating its robustness and applicability. Our findings highlight the potential of Multi-BERT Ensemble models in improving MedER for Bangla and set the foundation for further advancements in low-resource medical NLP.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models

arXiv:2512.17385v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, their effectiveness heavily relies on supervised training with extensive labeled (e.g., question-answering pairs) or unlabeled datasets (e.g., code snippets), which are often expensive and difficult to obtain at scale. To address this limitation, this paper introduces a method IPC, an unsupervised framework that leverages Internal Probing of LLMs for Code generation without any external corpus, even unlabeled code snippets. We introduce the problem space probing, test understanding probing, solution space probing, and knowledge consolidation and reinforcement to probe the internal knowledge and confidence patterns existing in LLMs. Further, IPC identifies reliable code candidates through self-consistency mechanisms and representation-based quality estimation to train UCoder (coder with unsupervised learning). We validate the proposed approach across multiple code benchmarks, demonstrating that unsupervised methods can achieve competitive performance compared to supervised approaches while significantly reducing the dependency on labeled data and computational resources. Analytic experiments reveal that internal model states contain rich signals about code quality and correctness, and that properly harnessing these signals enables effective unsupervised learning for code generation tasks, opening new directions for training code LLMs in resource-constrained scenarios.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators

arXiv:2512.17267v1 Announce Type: new Abstract: Evaluating user-facing AI applications remains a central challenge, especially in open-ended domains such as travel planning, clinical note generation, or dialogue. The gold standard is user feedback (e.g., thumbs up/down) or behavioral signals (e.g., retention), but these are often scarce in prototypes and research projects, or too-slow to use for system optimization. We present AutoMetrics, a framework for synthesizing evaluation metrics under low-data constraints. AutoMetrics combines retrieval from MetricBank, a collection of 48 metrics we curate, with automatically generated LLM-as-a-Judge criteria informed by lightweight human feedback. These metrics are composed via regression to maximize correlation with human signal. AutoMetrics takes you from expensive measures to interpretable automatic metrics. Across 5 diverse tasks, AutoMetrics improves Kendall correlation with human ratings by up to 33.4% over LLM-as-a-Judge while requiring fewer than 100 feedback points. We show that AutoMetrics can be used as a proxy reward to equal effect as a verifiable reward. We release the full AutoMetrics toolkit and MetricBank to accelerate adaptive evaluation of LLM applications.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

When the Gold Standard isn't Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content

arXiv:2512.17738v1 Announce Type: new Abstract: User-generated content (UGC) is characterised by frequent use of non-standard language, from spelling errors to expressive choices such as slang, character repetitions, and emojis. This makes evaluating UGC translation particularly challenging: what counts as a "good" translation depends on the level of standardness desired in the output. To explore this, we examine the human translation guidelines of four UGC datasets, and derive a taxonomy of twelve non-standard phenomena and five translation actions (NORMALISE, COPY, TRANSFER, OMIT, CENSOR). Our analysis reveals notable differences in how UGC is treated, resulting in a spectrum of standardness in reference translations. Through a case study on large language models (LLMs), we show that translation scores are highly sensitive to prompts with explicit translation instructions for UGC, and that they improve when these align with the dataset's guidelines. We argue that when preserving UGC style is important, fair evaluation requires both models and metrics to be aware of translation guidelines. Finally, we call for clear guidelines during dataset creation and for the development of controllable, guideline-aware evaluation frameworks for UGC translation.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

Destilação de Conhecimento com Cadeia de Pensamento Estruturada para Text-to-SQL

A implementação de sistemas precisos de Text-to-SQL em nível empresarial enfrenta um difícil trilema envolvendo custo, segurança e desempenho. As soluções atuais forçam as empresas a escolher entre Modelos de Linguagem Grande (LLMs) caros e proprietários e Modelos de Linguagem Pequena (SLMs) de baixo desempenho. Propomos o Struct-SQL, um novo framework de Knowledge Distillation (KD) que treina um SLM para emular um poderoso LLM.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Confidence-Credibility Aware Weighted Ensembles of Small LLMs Outperform Large LLMs in Emotion Detection

arXiv:2512.17630v1 Announce Type: new Abstract: This paper introduces a confidence-weighted, credibility-aware ensemble framework for text-based emotion detection, inspired by Condorcet's Jury Theorem (CJT). Unlike conventional ensembles that often rely on homogeneous architectures, our approach combines architecturally diverse small transformer-based large language models (sLLMs) - BERT, RoBERTa, DistilBERT, DeBERTa, and ELECTRA, each fully fine-tuned for emotion classification. To preserve error diversity, we minimize parameter convergence while taking advantage of the unique biases of each model. A dual-weighted voting mechanism integrates both global credibility (validation F1 score) and local confidence (instance-level probability) to dynamically weight model contributions. Experiments on the DAIR-AI dataset demonstrate that our credibility-confidence ensemble achieves a macro F1 score of 93.5 percent, surpassing state-of-the-art benchmarks and significantly outperforming large-scale LLMs, including Falcon, Mistral, Qwen, and Phi, even after task-specific Low-Rank Adaptation (LoRA). With only 595M parameters in total, our small LLMs ensemble proves more parameter-efficient and robust than models up to 7B parameters, establishing that carefully designed ensembles of small, fine-tuned models can outperform much larger LLMs in specialized natural language processing (NLP) tasks such as emotion detection.

Fonte: arXiv cs.CL

NLP/LLMs • Score 92

Governance-Aware Hybrid Fine-Tuning for Multilingual Large Language Models

arXiv:2512.17344v1 Announce Type: new Abstract: We present a governance-aware hybrid fine-tuning framework for multilingual, low-resource adaptation of large language models. The core algorithm combines gradient-aligned low-rank updates with structured orthogonal transformations through layer-wise mixing and introduces unitary constraints in selected sub-layers to stabilize deep optimization. In tandem with lightweight, label-free data governance steps, including language identification, near-duplicate removal, and quality filtering, the framework targets accuracy, calibration, and cross-language parity under tight compute budgets. Across XNLI and FLORES, the hybrid approach delivers consistent gains over strong PEFT baselines while maintaining directional balance and improving probability calibration, as shown in Tables II and III. It is more resilient to lightweight orthographic variants, as shown in Table IV, and benefits additively from simple governance steps, as shown in Table V. Training footprint measurements indicate modest overhead and a favorable cost-quality frontier, as shown in Table VI and Figure 2. Together, these results show that hybrid and unitary PEFT provide a stable and accessible path to resource-efficient multilingual adaptation when paired with practical data governance.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding

arXiv:2512.17220v1 Announce Type: new Abstract: Humans understand long and complex texts by relying on a holistic semantic representation of the content. This global view helps organize prior knowledge, interpret new information, and integrate evidence dispersed across a document, as revealed by the Mindscape-Aware Capability of humans in psychology. Current Retrieval-Augmented Generation (RAG) systems lack such guidance and therefore struggle with long-context tasks. In this paper, we propose Mindscape-Aware RAG (MiA-RAG), the first approach that equips LLM-based RAG systems with explicit global context awareness. MiA-RAG builds a mindscape through hierarchical summarization and conditions both retrieval and generation on this global semantic representation. This enables the retriever to form enriched query embeddings and the generator to reason over retrieved evidence within a coherent global context. We evaluate MiA-RAG across diverse long-context and bilingual benchmarks for evidence-based understanding and global sense-making. It consistently surpasses baselines, and further analysis shows that it aligns local details with a coherent global representation, enabling more human-like long-context retrieval and reasoning.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

AncientBench: Towards Comprehensive Evaluation on Excavated and Transmitted Chinese Corpora

arXiv:2512.17756v1 Announce Type: new Abstract: Comprehension of ancient texts plays an important role in archaeology and understanding of Chinese history and civilization. The rapid development of large language models needs benchmarks that can evaluate their comprehension of ancient characters. Existing Chinese benchmarks are mostly targeted at modern Chinese and transmitted documents in ancient Chinese, but the part of excavated documents in ancient Chinese is not covered. To meet this need, we propose the AncientBench, which aims to evaluate the comprehension of ancient characters, especially in the scenario of excavated documents. The AncientBench is divided into four dimensions, which correspond to the four competencies of ancient character comprehension: glyph comprehension, pronunciation comprehension, meaning comprehension, and contextual comprehension. The benchmark also contains ten tasks, including radical, phonetic radical, homophone, cloze, translation, and more, providing a comprehensive framework for evaluation. We convened archaeological researchers to conduct experimental evaluations, proposed an ancient model as baseline, and conducted extensive experiments on the currently best-performing large language models. The experimental results reveal the great potential of large language models in ancient textual scenarios as well as the gap with humans. Our research aims to promote the development and application of large language models in the field of archaeology and ancient Chinese language.

Fonte: arXiv cs.CL

NLP/LLMs • Score 95

Simulstream: Open-Source Toolkit for Evaluation and Demonstration of Streaming Speech-to-Text Translation Systems

arXiv:2512.17648v1 Announce Type: new Abstract: Streaming Speech-to-Text Translation (StreamST) requires producing translations concurrently with incoming speech, imposing strict latency constraints and demanding models that balance partial-information decision-making with high translation quality. Research efforts on the topic have so far relied on the SimulEval repository, which is no longer maintained and does not support systems that revise their outputs. In addition, it has been designed for simulating the processing of short segments, rather than long-form audio streams, and it does not provide an easy method to showcase systems in a demo. As a solution, we introduce simulstream, the first open-source framework dedicated to unified evaluation and demonstration of StreamST systems. Designed for long-form speech processing, it supports not only incremental decoding approaches, but also re-translation methods, enabling for their comparison within the same framework both in terms of quality and latency. In addition, it also offers an interactive web interface to demo any system built within the tool.

Fonte: arXiv cs.CL

NLP/LLMs • Score 92

Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers

arXiv:2512.17351v1 Announce Type: new Abstract: Understanding architectural differences in language models is challenging, especially at academic-scale pretraining (e.g., 1.3B parameters, 100B tokens), where results are often dominated by noise and randomness. To overcome this, we introduce controlled synthetic pretraining tasks that isolate and evaluate core model capabilities. Within this framework, we discover CANON LAYERS: lightweight architectural components -- named after the musical term "canon" -- that promote horizontal information flow across neighboring tokens. Canon layers compute weighted sums of nearby token representations and integrate seamlessly into Transformers, linear attention, state-space models, or any sequence architecture. We present 12 key results. This includes how Canon layers enhance reasoning depth (e.g., by $2\times$), reasoning breadth, knowledge manipulation, etc. They lift weak architectures like NoPE to match RoPE, and linear attention to rival SOTA linear models like Mamba2/GDN -- validated both through synthetic tasks and real-world academic-scale pretraining. This synthetic playground offers an economical, principled path to isolate core model capabilities often obscured at academic scales. Equipped with infinite high-quality data, it may even PREDICT how future architectures will behave as training pipelines improve -- e.g., through better data curation or RL-based post-training -- unlocking deeper reasoning and hierarchical inference.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

PILAR: Personalizando Interações em Realidade Aumentada com Explicações Centricas no Humano e Confiáveis Baseadas em LLM para Casos de Uso Diários

Sistemas de realidade aumentada (AR) impulsionados por inteligência artificial (AI) estão se integrando cada vez mais à vida cotidiana, aumentando a necessidade de explicações em tempo real. O PILAR é um novo framework que utiliza um modelo de linguagem grande (LLM) pré-treinado para gerar explicações personalizadas e contextuais, melhorando a experiência do usuário em sistemas AR baseados em AI.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience

arXiv:2512.17260v1 Announce Type: new Abstract: Large language models have recently made significant progress to generate rigorous mathematical proofs. In contrast, utilizing LLMs for theorem proving in formal languages (such as Lean) remains challenging and computationally expensive, particularly when addressing problems at the undergraduate level and beyond. In this work, we present \textbf{Seed-Prover 1.5}, a formal theorem-proving model trained via large-scale agentic reinforcement learning, alongside an efficient test-time scaling (TTS) workflow. Through extensive interactions with Lean and other tools, the model continuously accumulates experience during the RL process, substantially enhancing the capability and efficiency of formal theorem proving. Furthermore, leveraging recent advancements in natural language proving, our TTS workflow efficiently bridges the gap between natural and formal languages. Compared to state-of-the-art methods, Seed-Prover 1.5 achieves superior performance with a smaller compute budget. It solves \textbf{88\% of PutnamBench} (undergraduate-level), \textbf{80\% of Fate-H} (graduate-level), and \textbf{33\% of Fate-X} (PhD-level) problems. Notably, using our system, we solved \textbf{11 out of 12 problems} from Putnam 2025 within 9 hours. Our findings suggest that scaling learning from experience, driven by high-quality formal feedback, holds immense potential for the future of formal mathematical reasoning.

Fonte: arXiv cs.CL

NLP/LLMs • Score 96

When F1 Fails: Granularity-Aware Evaluation for Dialogue Topic Segmentation

arXiv:2512.17083v1 Announce Type: cross Abstract: Dialogue topic segmentation supports summarization, retrieval, memory management, and conversational continuity. Despite decades of prior work, evaluation practice in dialogue topic segmentation remains dominated by strict boundary matching and F1-based metrics, even as modern LLM-based conversational systems increasingly rely on segmentation to manage conversation history beyond the model's fixed context window, where unstructured context accumulation degrades efficiency and coherence. This paper introduces an evaluation objective for dialogue topic segmentation that treats boundary density and segment coherence as primary criteria, alongside window-tolerant F1 (W-F1). Through extensive cross-dataset empirical evaluation, we show that reported performance differences across dialogue segmentation benchmarks are driven not by model quality, but by annotation granularity mismatches and sparse boundary labels. This indicates that many reported improvements arise from evaluation artifacts rather than improved boundary detection. We evaluated multiple, structurally distinct dialogue segmentation strategies across eight dialogue datasets spanning task-oriented, open-domain, meeting-style, and synthetic interactions. Across these settings, we observe high segment coherence combined with extreme oversegmentation relative to sparse labels, producing misleadingly low exact-match F1 scores. We show that topic segmentation is best understood as selecting an appropriate granularity rather than predicting a single correct boundary set. We operationalize this view by explicitly separating boundary scoring from boundary selection.

Fonte: arXiv cs.AI

NLP/LLMs • Score 95

AVM: Towards Structure-Preserving Neural Response Modeling in the Visual Cortex Across Stimuli and Individuals

arXiv:2512.16948v1 Announce Type: new Abstract: While deep learning models have shown strong performance in simulating neural responses, they often fail to clearly separate stable visual encoding from condition-specific adaptation, which limits their ability to generalize across stimuli and individuals. We introduce the Adaptive Visual Model (AVM), a structure-preserving framework that enables condition-aware adaptation through modular subnetworks, without modifying the core representation. AVM keeps a Vision Transformer-based encoder frozen to capture consistent visual features, while independently trained modulation paths account for neural response variations driven by stimulus content and subject identity. We evaluate AVM in three experimental settings, including stimulus-level variation, cross-subject generalization, and cross-dataset adaptation, all of which involve structured changes in inputs and individuals. Across two large-scale mouse V1 datasets, AVM outperforms the state-of-the-art V1T model by approximately 2% in predictive correlation, demonstrating robust generalization, interpretable condition-wise modulation, and high architectural efficiency. Specifically, AVM achieves a 9.1% improvement in explained variance (FEVE) under the cross-dataset adaptation setting. These results suggest that AVM provides a unified framework for adaptive neural modeling across biological and experimental conditions, offering a scalable solution under structural constraints. Its design may inform future approaches to cortical modeling in both neuroscience and biologically inspired AI systems.

Fonte: arXiv cs.CV