Welcome to TempoEval¶
Why TempoEval?¶
Traditional RAG evaluation metrics (like precision/recall on text overlap) fail to capture time—a critical dimension for many real-world applications.
The Temporal Gap
Query: "Who was president in 1999?"
- Retrieved: "President Clinton" ✅
- Retrieved: "President Bush" ❌
Both are semantically similar (both presidents), but temporally distinct. Standard metrics can't tell the difference.
The Problem with Current Metrics¶
-
Wrong Time Period
Standard metrics score high if keywords match, even when retrieving documents from the wrong year.
-
Date Hallucinations
Traditional evaluation often misses fabricated dates in generated answers.
-
Temporal Confusion
Confusing "before" and "after" or getting event ordering wrong goes undetected.
-
Expensive LLM Calls
LLM-as-judge is slow, costly, and non-deterministic for temporal evaluation.
Key Features¶
-
Focus Time Extraction
Extract the temporal "focus" of queries, documents, and answers using Regex, HeidelTime, or LLMs.
-
Multi-Layer Metrics
Evaluate across three layers: Retrieval (12 metrics), Generation (5 metrics), and Reasoning (4 metrics).
-
TempoScore
A single composite score combining precision, recall, faithfulness, and coherence.
-
Benchmark Datasets
Built-in support for TEMPO, TimeQA, TimeBench, and SituatedQA datasets.
-
Easy Integration
Simple API that works with any retriever or LLM. Three modes: Focus Time, LLM-as-judge, or Gold labels.
-
Examples
Explore our comprehensive examples to learn how to use TempoEval effectively.
Quick Example¶
from tempoeval.core import extract_qft, extract_dft
from tempoeval.metrics import TemporalPrecision
# Define query and documents
query = "What happened during the 2008 financial crisis?"
documents = [
"The collapse of Lehman Brothers in 2008 triggered...", # ✅ Relevant (2008)
"The COVID-19 pandemic started in 2019...", # ❌ Irrelevant (2019)
]
# Extract Focus Times
qft = extract_qft(query) # {2008}
dfts = [extract_dft(doc) for doc in documents] # [{2008}, {2019}]
# Compute Metric
metric = TemporalPrecision(use_focus_time=True)
score = metric.compute(qft=qft, dfts=dfts, k=2)
print(f"Temporal Precision@2: {score}") # 0.5 (1/2 relevant)
Three Evaluation Modes¶
| Mode | Input | Pros | Cons |
|---|---|---|---|
| Focus Time | QFT & DFT | ⚡ Fast, 💰 Free, 🎯 Interpretable | Requires extraction |
| LLM-as-Judge | Query & Docs | 🧠 Flexible, No gold labels needed | 🐌 Slow, 💸 Expensive |
| Gold Labels | IDs & Gold | 📊 Standard IR benchmark | Requires annotations |
Community & Support¶
-
GitHub
Star us on GitHub and contribute to the project.
-
PyPI
Install via pip and start evaluating in minutes.
-
Documentation
Comprehensive guides, tutorials, and API reference.
-
Contact
Have questions? Reach out to the team.
What's Next?¶
- Install TempoEval - Get up and running in 2 minutes
- Quick Start - Your first evaluation in 5 minutes
- Understand Focus Time - Learn the core concept
- Explore Metrics - Choose the right metrics for your use case