We use the market Brier as an estimate of question difficulty for market questions.
For models that don't use tools and were not provided the crowd forecast, that metric is not apt, as they face additional difficulty from lack of context for forecasting.
Hence, use 2fe to estimate question difficulty for the Baseline leaderboard.
Consequently, drop Baseline models from the Tournament leaderboard.
We use the market Brier as an estimate of question difficulty for market questions.
For models that don't use tools and were not provided the crowd forecast, that metric is not apt, as they face additional difficulty from lack of context for forecasting.
Hence, use 2fe to estimate question difficulty for the Baseline leaderboard.
Consequently, drop Baseline models from the Tournament leaderboard.