Welcome to TempoEval¶

Why TempoEval?¶

Traditional RAG evaluation metrics (like precision/recall on text overlap) fail to capture time—a critical dimension for many real-world applications.

The Temporal Gap

Query: "Who was president in 1999?"

Retrieved: "President Clinton" ✅
Retrieved: "President Bush" ❌

Both are semantically similar (both presidents), but temporally distinct. Standard metrics can't tell the difference.

The Problem with Current Metrics¶

Wrong Time Period

Standard metrics score high if keywords match, even when retrieving documents from the wrong year.
Date Hallucinations

Traditional evaluation often misses fabricated dates in generated answers.
Temporal Confusion

Confusing "before" and "after" or getting event ordering wrong goes undetected.
Expensive LLM Calls

LLM-as-judge is slow, costly, and non-deterministic for temporal evaluation.

Key Features¶

Focus Time Extraction

Extract the temporal "focus" of queries, documents, and answers using Regex, HeidelTime, or LLMs.

Learn about Focus Time
Multi-Layer Metrics

Evaluate across three layers: Retrieval (12 metrics), Generation (5 metrics), and Reasoning (4 metrics).

Explore Metrics
TempoScore

A single composite score combining precision, recall, faithfulness, and coherence.

TempoScore Details
Benchmark Datasets

Built-in support for TEMPO, TimeQA, TimeBench, and SituatedQA datasets.

Work with Datasets
Easy Integration

Simple API that works with any retriever or LLM. Three modes: Focus Time, LLM-as-judge, or Gold labels.

Quick Start
Examples

Explore our comprehensive examples to learn how to use TempoEval effectively.

View Examples

Quick Example¶

from tempoeval.core import extract_qft, extract_dft
from tempoeval.metrics import TemporalPrecision

# Define query and documents
query = "What happened during the 2008 financial crisis?"
documents = [
    "The collapse of Lehman Brothers in 2008 triggered...",  # ✅ Relevant (2008)
    "The COVID-19 pandemic started in 2019...",              # ❌ Irrelevant (2019)
]

# Extract Focus Times
qft = extract_qft(query)  # {2008}
dfts = [extract_dft(doc) for doc in documents]  # [{2008}, {2019}]

# Compute Metric
metric = TemporalPrecision(use_focus_time=True)
score = metric.compute(qft=qft, dfts=dfts, k=2)

print(f"Temporal Precision@2: {score}")  # 0.5 (1/2 relevant)

Three Evaluation Modes¶

Mode	Input	Pros	Cons
Focus Time	QFT & DFT	⚡ Fast, 💰 Free, 🎯 Interpretable	Requires extraction
LLM-as-Judge	Query & Docs	🧠 Flexible, No gold labels needed	🐌 Slow, 💸 Expensive
Gold Labels	IDs & Gold	📊 Standard IR benchmark	Requires annotations

Community & Support¶

GitHub

Star us on GitHub and contribute to the project.

DataScienceUIBK/tempoeval
PyPI

Install via pip and start evaluating in minutes.

PyPI Package
Documentation

Comprehensive guides, tutorials, and API reference.

Get Started
Contact

Have questions? Reach out to the team.

Contact Us

What's Next?¶

Install TempoEval - Get up and running in 2 minutes
Quick Start - Your first evaluation in 5 minutes
Understand Focus Time - Learn the core concept
Explore Metrics - Choose the right metrics for your use case