Duration Accuracy¶
Duration Accuracy evaluates whether duration or time span claims in the answer are correct. It supports LLM or regex-based verification.
Inputs¶
answergold_durations(optional)
Output¶
- Range: [0, 1], higher is better.
Prompt (LLM mode)¶
## Task
Evaluate the correctness of duration/time span claims in the answer.
## Answer
{answer}
## Gold Durations (if provided)
{gold_durations}
## Instructions
1. Extract all duration claims from the answer (e.g., "lasted 5 years", "for 3 months")
2. Compare with gold durations if provided
3. Check if durations are consistent with any dates mentioned
## Output (JSON)
{
"duration_claims": [
{
"claim": "duration claim text",
"extracted_duration": "5 years",
"associated_event": "event this duration refers to",
"gold_duration": "actual duration if known",
"is_correct": true or false,
"error_magnitude": 0
}
],
"total_claims": 0,
"correct_claims": 0,
"accuracy_score": 0.0
}
Examples¶
from tempoeval.metrics import DurationAccuracy
metric = DurationAccuracy(use_llm=True)
metric.llm = llm
score = metric.compute(answer="...", gold_durations={"Event": "5 years"})
Synchronous usage
Use compute(...) for sync calls and acompute(...) for async.