Skip to content

notes-00000.tsv now has 24 columns; tsv_parser() expects 23 and fails with ValueError #396

@2vitalik

Description

@2vitalik

The public dataset format for notes-00000.tsv appears to have changed: the file now contains 24 columns instead of 23. As a result, the scoring pipeline fails early in tsv_parser() (scoring/src/scoring/process_data.py) with:

ValueError: Invalid input: Expected 23 columns, but got 24

This prevents running python main.py with the latest downloaded Community Notes data.

To Reproduce

  1. Set up the scorer environment as described in the repo README (create venv, install requirements.txt, etc.). 
  2. Download the latest daily datasets (notes/ratings/status/userEnrollment), e.g. https://ton.twimg.com/birdwatch-public-data/2026/01/11/notes/notes-00000.zip
  3. Run python main.py
  4. See the exception above.

Expected behavior
The scorer should be able to parse the latest notes-00000.tsv format (24 columns), or at least tolerate extra columns (e.g., ignore unknown columns) while still requiring the known/required ones.

Screenshots
N/A (stack trace below). (I can add a screenshot of the error output if helpful.)

Environment

uname -a

Linux gl-login6.arc-ts.umich.edu 4.18.0-553.85.1.el8_10.x86_64 #1 SMP Thu Nov 13 12:55:12 EST 2025 x86_64 x86_64 x86_64 GNU/Linux

python 3.10.4

pandas==2.2.2

Repo commit: d2f2ea3 (main)

Additional context
The schema change seems to be an extra trailing column in notes-00000.tsv.

Header before (23 columns):

noteId noteAuthorParticipantId createdAtMillis tweetId classification believable harmful validationDifficulty misleadingOther misleadingFactualError misleadingManipulatedMedia misleadingOutdatedInformation misleadingMissingImportantContext misleadingUnverifiedClaimAsFact misleadingSatire notMisleadingOther notMisleadingFactuallyCorrect notMisleadingOutdatedButNotWhenWritten notMisleadingClearlySatire notMisleadingPersonalOpinion trustworthySources summary isMediaNote

Header after (24 columns):

noteId noteAuthorParticipantId createdAtMillis tweetId classification believable harmful validationDifficulty misleadingOther misleadingFactualError misleadingManipulatedMedia misleadingOutdatedInformation misleadingMissingImportantContext misleadingUnverifiedClaimAsFact misleadingSatire notMisleadingOther notMisleadingFactuallyCorrect notMisleadingOutdatedButNotWhenWritten notMisleadingClearlySatire notMisleadingPersonalOpinion trustworthySources summary isMediaNote isCollaborativeNote

So the new column is: isCollaborativeNote (added as the last column).

Full traceback:

Traceback (most recent call last):
  File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/process_data.py", line 90, in tsv_parser
    raise ValueError(f"Expected {len(columns)} columns, but got {num_fields}")
ValueError: Expected 23 columns, but got 24

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/main.py", line 31, in <module>
    main()
  File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/runner.py", line 278, in main
    return _run_scorer(args=args, dataLoader=dataLoader, extraScoringArgs=extraScoringArgs)
  File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/pandas_utils.py", line 722, in _inner
    retVal = main(*args, **kwargs)
  File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/runner.py", line 201, in _run_scorer
    notes, ratings, statusHistory, userEnrollment = dataLoader.get_data()
  File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/process_data.py", line 699, in get_data
    notes, ratings, noteStatusHistory, userEnrollment = read_from_tsv(
  File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/process_data.py", line 193, in read_from_tsv
    notes = tsv_reader(
  File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/process_data.py", line 167, in tsv_reader
    return tsv_reader_single(
  File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/process_data.py", line 145, in tsv_reader_single
    return tsv_parser(handle.read(), mapping, columns, header, convertNAToNone=convertNAToNone)
  File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/process_data.py", line 137, in tsv_parser
    raise ValueError(f"Invalid input: {e}")
ValueError: Invalid input: Expected 23 columns, but got 24

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions