Tutorial 1: Evaluating Retrieval¶
In this tutorial, we will evaluate a Retrieval system's ability to find temporally relevant documents.
We will: 1. Set up a simple document corpus. 2. Run a retriever (BM25). 3. Evaluate the results using Focus Time.
1. Setup Data¶
Let's imagine we are building a RAG system for History Question Answering. We have a query about the 2008 Financial Crisis, but our database contains documents from various years.
from tempoeval.core import extract_qft, extract_dft
# The User Query
query = "What caused the 2008 financial crisis?"
# The Corpus (Documents)
documents = [
"The collapse of Lehman Brothers in 2008 marked the beginning...", # Relevant (2008)
"The 1929 Wall Street Crash was the most devastating...", # Irrelevant (1929)
"In 2020, the COVID-19 pandemic caused a global recession...", # Irrelevant (2020)
"The Housing and Economic Recovery Act of 2008 was enacted..." # Relevant (2008)
]
doc_ids = ["doc1", "doc2", "doc3", "doc4"]
2. Run Retrieval¶
We'll use TempoEval's built-in BM25Retriever.
from tempoeval.retrieval import BM25Retriever
retriever = BM25Retriever()
results = retriever.retrieve(
queries=[query],
query_ids=["q1"],
documents=documents,
doc_ids=doc_ids,
k=2
)
# Output: {'q1': [{'doc_id': 'doc1', 'score': 1.5}, {'doc_id': 'doc4', 'score': 1.2}]}
retrieved_ids = [r['doc_id'] for r in results['q1']]
print(f"Retrieved: {retrieved_ids}")
3. Evaluate with Focus Time¶
Traditional evaluation needs "Gold Labels" (human-annotated relevance). But we can use Focus Time to automatically judge relevance based on the years mentioned.
from tempoeval.metrics import TemporalPrecision
# Calculate Focus Times
# Simple Regex is fast and free!
qft = extract_qft(query) # {2008}
dfts = [extract_dft(doc) for doc in documents]
# Re-map stored DFTs to retrieved docs
# (In a real app, you'd cache these DFTs)
retrieved_dfts = [
dfts[doc_ids.index(did)]
for did in retrieved_ids
]
# Compute Metric
metric = TemporalPrecision(use_focus_time=True)
score = metric.compute(
qft=qft,
dfts=retrieved_dfts,
k=2
)
print(f"Temporal Precision@2: {score}")
Result Analysis¶
If the retriever pulled doc1 (2008) and doc4 (2008), the intersection with the query 2008 is perfect.
Score: 1.0
If it pulled doc2 (1929), the intersection is empty. The score would drop.
Next Steps¶
Now that you've evaluated retrieval, let's look at generation.