-
Notifications
You must be signed in to change notification settings - Fork 309
Description
The public dataset format for notes-00000.tsv appears to have changed: the file now contains 24 columns instead of 23. As a result, the scoring pipeline fails early in tsv_parser() (scoring/src/scoring/process_data.py) with:
ValueError: Invalid input: Expected 23 columns, but got 24
This prevents running python main.py with the latest downloaded Community Notes data.
To Reproduce
- Set up the scorer environment as described in the repo README (create venv, install requirements.txt, etc.). 
- Download the latest daily datasets (notes/ratings/status/userEnrollment), e.g. https://ton.twimg.com/birdwatch-public-data/2026/01/11/notes/notes-00000.zip
- Run python
main.py - See the exception above.
Expected behavior
The scorer should be able to parse the latest notes-00000.tsv format (24 columns), or at least tolerate extra columns (e.g., ignore unknown columns) while still requiring the known/required ones.
Screenshots
N/A (stack trace below). (I can add a screenshot of the error output if helpful.)
Environment
uname -a
Linux gl-login6.arc-ts.umich.edu 4.18.0-553.85.1.el8_10.x86_64 #1 SMP Thu Nov 13 12:55:12 EST 2025 x86_64 x86_64 x86_64 GNU/Linux
python 3.10.4
pandas==2.2.2
Repo commit: d2f2ea3 (main)
Additional context
The schema change seems to be an extra trailing column in notes-00000.tsv.
Header before (23 columns):
noteId noteAuthorParticipantId createdAtMillis tweetId classification believable harmful validationDifficulty misleadingOther misleadingFactualError misleadingManipulatedMedia misleadingOutdatedInformation misleadingMissingImportantContext misleadingUnverifiedClaimAsFact misleadingSatire notMisleadingOther notMisleadingFactuallyCorrect notMisleadingOutdatedButNotWhenWritten notMisleadingClearlySatire notMisleadingPersonalOpinion trustworthySources summary isMediaNote
Header after (24 columns):
noteId noteAuthorParticipantId createdAtMillis tweetId classification believable harmful validationDifficulty misleadingOther misleadingFactualError misleadingManipulatedMedia misleadingOutdatedInformation misleadingMissingImportantContext misleadingUnverifiedClaimAsFact misleadingSatire notMisleadingOther notMisleadingFactuallyCorrect notMisleadingOutdatedButNotWhenWritten notMisleadingClearlySatire notMisleadingPersonalOpinion trustworthySources summary isMediaNote isCollaborativeNote
So the new column is: isCollaborativeNote (added as the last column).
Full traceback:
Traceback (most recent call last):
File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/process_data.py", line 90, in tsv_parser
raise ValueError(f"Expected {len(columns)} columns, but got {num_fields}")
ValueError: Expected 23 columns, but got 24
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/main.py", line 31, in <module>
main()
File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/runner.py", line 278, in main
return _run_scorer(args=args, dataLoader=dataLoader, extraScoringArgs=extraScoringArgs)
File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/pandas_utils.py", line 722, in _inner
retVal = main(*args, **kwargs)
File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/runner.py", line 201, in _run_scorer
notes, ratings, statusHistory, userEnrollment = dataLoader.get_data()
File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/process_data.py", line 699, in get_data
notes, ratings, noteStatusHistory, userEnrollment = read_from_tsv(
File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/process_data.py", line 193, in read_from_tsv
notes = tsv_reader(
File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/process_data.py", line 167, in tsv_reader
return tsv_reader_single(
File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/process_data.py", line 145, in tsv_reader_single
return tsv_parser(handle.read(), mapping, columns, header, convertNAToNone=convertNAToNone)
File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/process_data.py", line 137, in tsv_parser
raise ValueError(f"Invalid input: {e}")
ValueError: Invalid input: Expected 23 columns, but got 24