Temporal Faithfulness¶
Generation Metric
Temporal Faithfulness measures whether the temporal claims in the generated answer are grounded in the retrieved documents. This prevents temporal hallucinations—when the model fabricates dates or time periods not mentioned in the context.
Formula (Focus Time Mode)¶
Where:
- \(AFT\) = Answer Focus Time (years in the generated answer)
- \(DFT_i\) = Document Focus Time for document \(i\)
- \(K\) = number of retrieved documents
In simple terms: What fraction of years mentioned in the answer appear in the retrieved documents?
Formula (LLM Mode)¶
Where the LLM judges each temporal claim as SUPPORTED, PARTIALLY_SUPPORTED, NOT_SUPPORTED, or CONTRADICTED based on the context.
Example (Focus Time Mode)¶
Retrieved documents:
- Doc 1: "In 2008, Lehman Brothers collapsed." → DFT = {2008}
- Doc 2: "The 2009 stimulus package..." → DFT = {2009}
Answer: "The crisis occurred in 2008 and continued into 2009." → AFT = {2008, 2009}
Union of document years: {2008, 2009}
Temporal Faithfulness = 2/2 = 1.0 (all answer years are grounded)
Example (Hallucination)¶
Retrieved documents: DFT = {2008, 2009}
Answer (hallucinated): "The crisis started in 2007 and ended in 2010." → AFT = {2007, 2010}
Temporal Faithfulness = 0/2 = 0.0 (no answer years are grounded in context)
Inputs¶
answer- Generated answer textcontextsorretrieved_docs- List of retrieved document texts- Optional:
aftanddftsif already extracted
Output¶
- Range: [0, 1], higher is better
Code Example (Focus Time Mode)¶
from tempoeval.core import extract_aft, extract_dft
from tempoeval.metrics import TemporalFaithfulness
# Extract Focus Times
answer = "The crisis occurred in 2008 and continued into 2009."
contexts = [
"In 2008, Lehman Brothers collapsed.",
"The 2009 stimulus package helped recovery."
]
aft = extract_aft(answer)
dfts = [extract_dft(ctx) for ctx in contexts]
# Compute metric
metric = TemporalFaithfulness()
score = metric.compute(aft=aft, dfts=dfts)
print(f"Temporal Faithfulness: {score}") # 1.0
Code Example (LLM Mode)¶
from tempoeval.metrics import TemporalFaithfulness
from tempoeval.llm import OpenAIProvider
llm = OpenAIProvider(model="gpt-4o")
metric = TemporalFaithfulness()
metric.llm = llm
score = await metric.acompute(
answer="The crisis occurred in 2008 and continued into 2009.",
contexts=["In 2008, Lehman Brothers collapsed.", "The 2009 stimulus..."]
)
print(f"Temporal Faithfulness: {score}")
LLM Prompt¶
The LLM mode uses a detailed prompt to:
- Extract temporal claims from the answer (dates, durations, sequences)
- Verify each claim against the retrieved context
- Classify as SUPPORTED, PARTIALLY_SUPPORTED, NOT_SUPPORTED, or CONTRADICTED
- Calculate score using the formula above
Temporal Hallucination
A Faithfulness score < 0.5 indicates serious hallucination—the answer mentions years or time periods not supported by the retrieved documents.
When to use Faithfulness
Use Temporal Faithfulness to: - Detect hallucinated dates in generated answers - Ensure RAG systems stay grounded in retrieved documents - Evaluate if the retrieval step provided sufficient temporal coverage
Focus Time vs LLM Mode
- Focus Time Mode: Fast, checks if answer years appear in documents
- LLM Mode: Slower, but verifies specific temporal claims and their evidence