Avaliação / Benchmarks

Veja os artigos deste label, com traduções para PT-BR.

Artigos

🧪Avaliação / Benchmarks • 369 artigo(s) encontrados

NLP/LLMs • Score 85

Enhancing Temporal Awareness in LLMs for Temporal Point Processes

arXiv:2601.00845v1 Announce Type: new Abstract: Temporal point processes (TPPs) are crucial for analyzing events over time and are widely used in fields such as finance, healthcare, and social systems. These processes are particularly valuable for understanding how events unfold over time, accounting for their irregularity and dependencies. Despite the success of large language models (LLMs) in sequence modeling, applying them to temporal point processes remains challenging. A key issue is that current methods struggle to effectively capture the complex interaction between temporal information and semantic context, which is vital for accurate event modeling. In this context, we introduce TPP-TAL (Temporal Point Processes with Enhanced Temporal Awareness in LLMs), a novel plug-and-play framework designed to enhance temporal reasoning within LLMs. Rather than using the conventional method of simply concatenating event time and type embeddings, TPP-TAL explicitly aligns temporal dynamics with contextual semantics before feeding this information into the LLM. This alignment allows the model to better perceive temporal dependencies and long-range interactions between events and their surrounding contexts. Through comprehensive experiments on several benchmark datasets, it is shown that TPP-TAL delivers substantial improvements in temporal likelihood estimation and event prediction accuracy, highlighting the importance of enhancing temporal awareness in LLMs for continuous-time event modeling. The code is made available at https://github.com/chenlilil/TPP-TAL

Fonte: arXiv cs.AI

Artigos

Enhancing Temporal Awareness in LLMs for Temporal Point Processes

Decomposing LLM Self-Correction: The Accuracy-Correction Paradox and Error Depth Hypothesis

Aletheia: Quantifying Cognitive Conviction in Reasoning Models via Regularized Inverse Confusion Matrix

PsychEval: A Multi-Session and Multi-Therapy Benchmark for High-Realism AI Psychological Counselor

AI Agent Systems: Architectures, Applications, and Evaluation

Reinforcement Learning Enhanced Multi-hop Reasoning for Temporal Knowledge Question Answering

Logics-STEM: Empowering LLM Reasoning via Failure-Driven Post-Training and Document Knowledge Enhancement

Temporal Attack Pattern Detection in Multi-Agent AI Workflows: An Open Framework for Training Trace-Based Security Models

ChaosBench-Logic: A Benchmark for Logical and Symbolic Reasoning on Chaotic Dynamical Systems

Acelerando a Busca em Árvores de Monte-Carlo com Políticas Posteriores Otimizadas

Universal Conditional Logic: A Formal Language for Prompt Engineering

AI Agente para Tomada de Decisão de Risco de Crédito Autônoma, Explicável e em Tempo Real

CogCanvas: Extração de Artefatos Fundamentados Verbatim para Longas Conversas com LLM

OmniNeuro: Um Framework HCI Multimodal para Feedback Explicável de BCI via IA Generativa e Sonificação

Can We Trust AI Explanations? Evidence of Systematic Underreporting in Chain-of-Thought Reasoning

COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLMs

Jenius Agent: Towards Experience-Driven Accuracy Optimization in Real-World Scenarios

Beyond Gemini-3-Pro: Revisiting LLM Routing and Aggregation at Scale

Regularização de Ações de Ordem Superior em Aprendizado por Reforço Profundo: Do Controle Contínuo à Gestão de Energia em Edifícios

Reading Between the Lines: Deconfounding Causal Estimates using Text Embeddings and Deep Learning

Can Large Language Models Solve Engineering Equations? A Systematic Comparison of Direct Prediction and Solver-Assisted Approaches

Theory Trace Card: Theory-Driven Socio-Cognitive Evaluation of LLMs

MathLedger: Um Substrato de Aprendizado Verificável com Feedback Atestado por Ledger

Toward Auditable Neuro-Symbolic Reasoning in Pathology: SQL as an Explicit Trace of Evidence

Comentário sobre: Seu Cérebro no ChatGPT: Acumulação de Dívida Cognitiva ao Usar um Assistente de IA para Tarefas de Redação de Ensaios

Clinical Knowledge Graph Construction and Evaluation with Multi-LLMs via Retrieval-Augmented Generation

CaveAgent: Transforming LLMs into Stateful Runtime Operators

Improving Behavioral Alignment in LLM Social Simulations via Context Formation and Navigation

KGCE: Knowledge-Augmented Dual-Graph Evaluator for Cross-Platform Educational Agent Benchmarking with Multimodal Language Models

Alinhamento Semântico de Grafos de Conhecimento Multilíngues via Projeções Vetoriais Contextualizadas

A New Benchmark for the Appropriate Evaluation of RTL Code Optimization

ElecTwit: A Framework for Studying Persuasion in Multi-Agent Social Systems

CNC-TP: Classifier Nominal Concept Based on Top-Pertinent Attributes

LooC: Effective Low-Dimensional Codebook for Compositional Vector Quantization

GranAlign: Granularity-Aware Alignment Framework for Zero-Shot Video Moment Retrieval

Intelligent Traffic Surveillance for Real-Time Vehicle Detection, License Plate Recognition, and Speed Estimation

FlashInfer-Bench: Construindo o Ciclo Virtuoso para Sistemas LLM Impulsionados por IA

FaithSCAN: Detecção de Alucinações em Uma Única Passagem Baseada em Modelos para Respostas Visuais de Perguntas Fiéis

NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos

CropNeRF: A Neural Radiance Field-Based Framework for Crop Counting

Memory Bank Compression for Continual Adaptation of Large Language Models

Application Research of a Deep Learning Model Integrating CycleGAN and YOLO in PCB Infrared Defect Detection

DichroGAN: Towards Restoration of in-air Colours of Seafloor from Satellite Imagery

S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

Focal-RegionFace: Generating Fine-Grained Multi-attribute Descriptions for Arbitrarily Selected Face Focal Regions

SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting

ActErase: A Training-Free Paradigm for Precise Concept Erasure via Activation Patching

Towards Syn-to-Real IQA: A Novel Perspective on Reshaping Synthetic Data Distributions

A Spatially Masked Adaptive Gated Network for multimodal post-flood water extent mapping using SAR and incomplete multispectral data

Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark

Modelagem de Sinais de ECG de 24 Horas para Prever o Risco de Insuficiência Cardíaca com IA Explicável

Toward Better Temporal Structures for Geopolitical Events Forecasting

Context-Aware Pesticide Recommendation via Few-Shot Pest Recognition for Precision Agriculture

Framework Auto-reparador Agente Bio-inspirado para Sistemas de Computação Distribuída Resilientes

Universos Paralelos, Linguagens Paralelas: Um Estudo Abrangente sobre Geração de Exemplos Contrafactuais Multilíngues Baseados em LLM

FCMBench: Um Benchmark Multimodal Abrangente de Crédito Financeiro para Aplicações do Mundo Real

IntraStyler: Exemplar-based Style Synthesis for Cross-modality Domain Adaptation

FreeText: Training-Free Text Rendering in Diffusion Transformers via Attention Localization and Spectral Glyph Injection

ABFR-KAN: Kolmogorov-Arnold Networks for Functional Brain Analysis

Pat-DEVAL: Chain-of-Legal-Thought Evaluation for Patent Description

RoLID-11K: A Dashcam Dataset for Small-Object Roadside Litter Detection

Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning

TeleDoCTR: Domain-Specific and Contextual Troubleshooting for Telecommunications

BHaRNet: Reliability-Aware Body-Hand Modality Expertized Networks for Fine-grained Skeleton Action Recognition

StockBot 2.0: Vanilla LSTMs Outperform Transformer-based Forecasting for Stock Prices

Raciocínio em Ação: Recuperação de Conhecimento Orientada por MCTS para Modelos de Linguagem Grandes

Adapting Natural Language Processing Models Across Jurisdictions: A pilot Study in Canadian Cancer Registries

CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns

InfoSynth: Information-Guided Benchmark Synthesis for LLMs

Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak

Fast-weight Product Key Memory

A Language-Agnostic Hierarchical LoRA-MoE Architecture for CTC-based Multilingual ASR

Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations

Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach

Rumo ao Diagnóstico Diferencial Automatizado de Doenças de Pele Usando Deep Learning e Estratégias Conscientes de Imbalance

Exploring the Performance of Large Language Models on Subjective Span Identification Tasks

Beyond IVR: Benchmarking Customer Support LLM Agents for Business-Adherence

Comparative Efficiency Analysis of Lightweight Transformer Models: A Multi-Domain Empirical Benchmark for Enterprise NLP Deployment

Rule-Based Approaches to Atomic Sentence Extraction