TFRBench TFRBench :

A Reasoning Benchmark for Evaluating Forecasting Systems

*Equal Contribution
1Google Cloud AI Research 2University of Kentucky


Abstract

We introduce TFRBench, the first benchmark designed to evaluate the reasoning capabilities of forecasting systems. Traditionally, time-series forecasting has been evaluated solely on numerical accuracy, treating foundation models as “black boxes.” Unlike existing benchmarks, TFRBench provides a protocol for evaluating the reasoning generated by forecasting systems–specifically their analysis of cross-channel dependencies, trends, and external events. To enable this, we propose a systematic multi-agent framework that utilizes an iterative verification loop to synthesize numerically grounded reasoning traces. Spanning ten datasets across five domains, our evaluation confirms that this reasoning is causally effective; useful for evaluation; and prompting LLMs with our generated traces significantly improves forecasting accuracy compared to direct numerical prediction (e.g., avg. ∼ 40.2% → 56.6%), validating the quality of our reasoning. Conversely, benchmarking experiments reveal that off-theshelf LLMs consistently struggle with both reasoning (lower LLM-as-a-Judge scores) and numerical forecasting, frequently failing to capture domain-specific dynamics. TFRBench thus establishes a new standard for interpretable, reasoningbased evaluation in time-series forecasting.

TFRBench Leaderboard

Comprehensive evaluations across all datasets and reasoning metrics.

Datasets

Domain Dataset Source Total Samples Reference Samples Freq. Window (in-out) Ch. Series
Energy Solar GIFT-Eval 274 193 Daily 96-96 1 137
Electricity GIFT-Eval 1346 353 Daily 96-96 1 81
Sales Car-parts GIFT-Eval 1037 692 Monthly 25-25 1 1037
Hierarchical Sales GIFT-Eval 1370 1076 Daily 96-96 1 81
Web/CloudOps Bitbrains Fast Storage GIFT-Eval 1153 678 Hourly 96-96 2 197
Web Traffic Monash archive 1214 710 Daily 96-96 1 162
Transportation Traffic Autoformer et al. 1104 449 Hourly 96-96 1 6
NYC Taxi nyc.gov 1000 391 Hourly 96-96 1 1
Economics/Finance Amazon pricing Yahoo Finance/nasdaq 512 328 Daily 14-7 5 1
Apple pricing Yahoo Finance/nasdaq 809 509 Daily 14-7 5 1

TFRBench provides a multi-domain suite with diverse temporal frequencies and dimensionalities, including univariate to multivariate series.

TFRBench Overall Methodology Overview

Overview of Data Creation System. The pipeline integrates context retrieval, assumption verification, and a feedback loop to align reasoning with forecasting. The system orchestrates agents to iteratively refine reasoning traces until the forecast outperforms the naive baseline (MASE < 1.0), and LLM verifier gives high Score.

Reasoning Output Example

Reasoning Example (Economics/Finance Domain)
Apple Stock Forecast Plot

[Close]:

  • Cross-Channel Analysis: The Close price forecast must be logically consistent with the other price channels, falling strictly between the High and Low forecasts for the day. It is strongly influenced by the Open price and serves as the primary anchor for the next day's Open. Significant movements in the Close price should be accompanied by a corresponding increase in the Volume forecast, signaling market conviction.
  • Trend Component Plan: Following the recent sharp decline, model a flattening of the underlying trend. This represents a period of price stabilization. The trend should not be extrapolated downwards aggressively; instead, it should reflect the potential for the price to re-test recent low points as part of a bottoming process.
  • Seasonality Component Plan: Project the established weekly (day-of-the-week) seasonality pattern forward. This pattern captures the regular, recurring investor behaviors observed throughout a typical trading week.
  • Residuals & Event Plan: The "U.S. Senate Bombing (1983-11-07)" introduced significant market uncertainty.
    • U.S. Senate Bombing (1983-11-07): Incorporate a sharp, negative residual shock (a downward adjustment of around 2-4% of the price) on and for a few days after this date to model the resulting fear and selling pressure.
    • Non-Impactful Events: The "Martin Luther King, Jr. Day Bill Signed" and "Jesse Jackson Announces Presidential Candidacy" events are considered to have a negligible impact on market residuals.

[High]:

  • Cross-Channel Analysis: The High forecast must be the highest value of all price channels for any given day, acting as a ceiling for the Open, Low, and Close. The spread between the High and Low is a measure of volatility and is expected to widen on days with higher forecasted Volume, reflecting greater market activity and uncertain.
  • Trend Component Plan: The trend for the High price will closely follow the stabilized trend modeled for the other price channels (Open, Close, Low), reflecting a pause in the recent sharp downtrend.
  • Seasonality Component Plan: Continue to apply the established weekly seasonality pattern, which reflects recurring intraday peaks in buying pressure that vary by the day of the week.
  • Residuals & Event Plan: The market uncertainty following the "U.S. Senate Bombing (1983-11-07)" will increase intraday volatility.
    • U.S. Senate Bombing (1983-11-07): Introduce moderate positive residuals (upward adjustments of 1-3% of the price) on the days following the event to model higher-than-expected price spikes as the market reacts with choppiness.

[Low]:

  • Cross-Channel Analysis: The Low forecast must be the lowest value of all price channels for a given day, serving as the floor for the Open, High, and Close. A widening gap between High and Low, driven by a lower Low, indicates increased selling pressure and should be correlated with higher forecasted Volume.
  • Trend Component Plan: The trend for the Low price will move in lockstep with the other price channels. Model a stabilization of the trend, indicating that the intense downward pressure from the recent sell-off is subsiding but that the risk of re-testing recent lows remains.
  • Seasonality Component Plan: Project the existing weekly seasonality pattern forward, as intraday selling pressure often reaches its peak at predictable times during the trading week.
  • Residuals & Event Plan: The "U.S. Senate Bombing (1983-11-07)" is expected to heighten investor fear.
    • U.S. Senate Bombing (1983-11-07): Apply a significant negative residual shock (a downward adjustment of 3-5% of the price) on and immediately after this date to capture the increased selling pressure pushing the daily low further down.

[Open]:

  • Cross-Channel Analysis: The Open price is fundamentally anchored to the previous day's Close. Forecasted gaps between the prior Close and the Open signal overnight sentiment shifts and suggest a more volatile session ahead, which should be reflected in a wider High-Low range and higher Volume for that day.
  • Trend Component Plan: The trend for the Open will be nearly identical to the Close price's trend. It should reflect the same stabilization pattern after the recent sharp price decline.
  • Seasonality Component Plan: Maintain the established day-of-the-week seasonality pattern, as certain days may systematically open higher or lower relative to the previous close due to recurring news cycles or investor behavior.
  • Residuals & Event Plan: The primary impact of the "U.S. Senate Bombing (1983-11-07)" would be felt at the market open on the following day.
    • U.S. Senate Bombing (1983-11-07): Apply a large negative residual on the morning of 1983-11-08 to model the market gapping down as it digests the overnight news.

[Volume]:

  • Cross-Channel Analysis: Volume is positively correlated with volatility (the High-Low spread). A forecast for a stable or narrow trading range in the price channels should correspond with a forecast for lower Volume. Conversely, a large price movement or wide trading range implies higher Volume.
  • Trend Component Plan: The underlying trend for Volume should show a sharp decay from the recent massive spike. Model a mean-reversion process where trading activity returns toward more normal historical levels, though potentially remaining slightly elevated compared to the pre-spike baseline.
  • Seasonality Component Plan: Project the typical weekly seasonality forward. This often includes higher volume at the beginning and end of the week (Monday/Friday) and lower volume mid-week.
  • Residuals & Event Plan:
    • U.S. Senate Bombing (1983-11-07): This event will drive a surge in trading activity. Incorporate a large positive residual shock (a spike of 1.5x to 2.5x the recent average) to Volume around this date to reflect fear-based trading.
    • Veterans Day (1983-11-11): Apply a moderate negative residual to model the lighter-than-expected trading volume often associated with holidays.

BibTeX

@inproceedings{tfrbench,
    title={TFRBench: A Reasoning Benchmark for Evaluating Forecasting Systems},
    author={Md Atik Ahamed, Mihir Parmar, Palash Goyal, Yiwen Song, Long T. Le, Qiang Cheng, Chun-Liang Li, Hamid Palangi, Jinsung Yoon, Tomas Pfister},
    year={2026},
    eprint={2604.05364},
    archivePrefix={arXiv},
    primaryClass={cs.AI},
    url={https://arxiv.org/abs/2604.05364}, 
}