Skip to content

Welcome to TempoEval

Evaluate Temporal Reasoning in RAG Systems

TempoEval is a comprehensive framework for evaluating the temporal reasoning capabilities of RAG (Retrieval-Augmented Generation) systems with 20+ specialized metrics.

Get Started View on GitHub


Why TempoEval?

Traditional RAG evaluation metrics (like precision/recall on text overlap) fail to capture time—a critical dimension for many real-world applications.

The Temporal Gap

Query: "Who was president in 1999?"

  • Retrieved: "President Clinton" ✅
  • Retrieved: "President Bush" ❌

Both are semantically similar (both presidents), but temporally distinct. Standard metrics can't tell the difference.

The Problem with Current Metrics

  • Wrong Time Period


    Standard metrics score high if keywords match, even when retrieving documents from the wrong year.

  • Date Hallucinations


    Traditional evaluation often misses fabricated dates in generated answers.

  • Temporal Confusion


    Confusing "before" and "after" or getting event ordering wrong goes undetected.

  • Expensive LLM Calls


    LLM-as-judge is slow, costly, and non-deterministic for temporal evaluation.


Key Features

  • Focus Time Extraction


    Extract the temporal "focus" of queries, documents, and answers using Regex, HeidelTime, or LLMs.

    Learn about Focus Time

  • Multi-Layer Metrics


    Evaluate across three layers: Retrieval (12 metrics), Generation (5 metrics), and Reasoning (4 metrics).

    Explore Metrics

  • TempoScore


    A single composite score combining precision, recall, faithfulness, and coherence.

    TempoScore Details

  • Benchmark Datasets


    Built-in support for TEMPO, TimeQA, TimeBench, and SituatedQA datasets.

    Work with Datasets

  • Easy Integration


    Simple API that works with any retriever or LLM. Three modes: Focus Time, LLM-as-judge, or Gold labels.

    Quick Start

  • Examples


    Explore our comprehensive examples to learn how to use TempoEval effectively.

    View Examples


Quick Example

from tempoeval.core import extract_qft, extract_dft
from tempoeval.metrics import TemporalPrecision

# Define query and documents
query = "What happened during the 2008 financial crisis?"
documents = [
    "The collapse of Lehman Brothers in 2008 triggered...",  # ✅ Relevant (2008)
    "The COVID-19 pandemic started in 2019...",              # ❌ Irrelevant (2019)
]

# Extract Focus Times
qft = extract_qft(query)  # {2008}
dfts = [extract_dft(doc) for doc in documents]  # [{2008}, {2019}]

# Compute Metric
metric = TemporalPrecision(use_focus_time=True)
score = metric.compute(qft=qft, dfts=dfts, k=2)

print(f"Temporal Precision@2: {score}")  # 0.5 (1/2 relevant)

Three Evaluation Modes

Mode Input Pros Cons
Focus Time QFT & DFT ⚡ Fast, 💰 Free, 🎯 Interpretable Requires extraction
LLM-as-Judge Query & Docs 🧠 Flexible, No gold labels needed 🐌 Slow, 💸 Expensive
Gold Labels IDs & Gold 📊 Standard IR benchmark Requires annotations

Community & Support


What's Next?

  1. Install TempoEval - Get up and running in 2 minutes
  2. Quick Start - Your first evaluation in 5 minutes
  3. Understand Focus Time - Learn the core concept
  4. Explore Metrics - Choose the right metrics for your use case