notes-00000.tsv now has 24 columns; tsv_parser() expects 23 and fails with ValueError

The public dataset format for `notes-00000.tsv` appears to have changed: the file now contains 24 columns instead of 23. As a result, the scoring pipeline fails early in `tsv_parser()` (`scoring/src/scoring/process_data.py`) with:

> ValueError: Invalid input: Expected 23 columns, but got 24

This prevents running python `main.py` with the latest downloaded Community Notes data.

**To Reproduce**
1.	Set up the scorer environment as described in the repo README (create venv, install requirements.txt, etc.).  ￼
2.	Download the latest daily datasets (notes/ratings/status/userEnrollment), e.g. https://ton.twimg.com/birdwatch-public-data/2026/01/11/notes/notes-00000.zip
3.	Run python `main.py`
4.	See the exception above.

**Expected behavior**
The scorer should be able to parse the latest `notes-00000.tsv` format (24 columns), or at least tolerate extra columns (e.g., ignore unknown columns) while still requiring the known/required ones.

**Screenshots**
N/A (stack trace below). (I can add a screenshot of the error output if helpful.)

**Environment**
> uname -a

Linux gl-login6.arc-ts.umich.edu 4.18.0-553.85.1.el8_10.x86_64 #1 SMP Thu Nov 13 12:55:12 EST 2025 x86_64 x86_64 x86_64 GNU/Linux

python 3.10.4

pandas==2.2.2

Repo commit: d2f2ea3bccadd75289802a06b0f4f4d327cd84e7 (main)

**Additional context**
The schema change seems to be an extra trailing column in `notes-00000.tsv`.

**Header before (23 columns):**
> noteId	noteAuthorParticipantId	createdAtMillis	tweetId	classification	believable	harmful	validationDifficulty	misleadingOther	misleadingFactualError	misleadingManipulatedMedia	misleadingOutdatedInformation	misleadingMissingImportantContext	misleadingUnverifiedClaimAsFact	misleadingSatire	notMisleadingOther	notMisleadingFactuallyCorrect	notMisleadingOutdatedButNotWhenWritten	notMisleadingClearlySatire	notMisleadingPersonalOpinion	trustworthySources	summary	isMediaNote

**Header after (24 columns):**
> noteId	noteAuthorParticipantId	createdAtMillis	tweetId	classification	believable	harmful	validationDifficulty	misleadingOther	misleadingFactualError	misleadingManipulatedMedia	misleadingOutdatedInformation	misleadingMissingImportantContext	misleadingUnverifiedClaimAsFact	misleadingSatire	notMisleadingOther	notMisleadingFactuallyCorrect	notMisleadingOutdatedButNotWhenWritten	notMisleadingClearlySatire	notMisleadingPersonalOpinion	trustworthySources	summary	isMediaNote	isCollaborativeNote

So the new column is: `isCollaborativeNote` (added as the last column).

Full traceback:
```
Traceback (most recent call last):
  File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/process_data.py", line 90, in tsv_parser
    raise ValueError(f"Expected {len(columns)} columns, but got {num_fields}")
ValueError: Expected 23 columns, but got 24

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/main.py", line 31, in <module>
    main()
  File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/runner.py", line 278, in main
    return _run_scorer(args=args, dataLoader=dataLoader, extraScoringArgs=extraScoringArgs)
  File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/pandas_utils.py", line 722, in _inner
    retVal = main(*args, **kwargs)
  File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/runner.py", line 201, in _run_scorer
    notes, ratings, statusHistory, userEnrollment = dataLoader.get_data()
  File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/process_data.py", line 699, in get_data
    notes, ratings, noteStatusHistory, userEnrollment = read_from_tsv(
  File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/process_data.py", line 193, in read_from_tsv
    notes = tsv_reader(
  File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/process_data.py", line 167, in tsv_reader
    return tsv_reader_single(
  File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/process_data.py", line 145, in tsv_reader_single
    return tsv_parser(handle.read(), mapping, columns, header, convertNAToNone=convertNAToNone)
  File "/nfs/turbo/si-presnick1/csmr-communitynotes/communitynotes/scoring/src/scoring/process_data.py", line 137, in tsv_parser
    raise ValueError(f"Invalid input: {e}")
ValueError: Invalid input: Expected 23 columns, but got 24
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

notes-00000.tsv now has 24 columns; tsv_parser() expects 23 and fails with ValueError #396

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

notes-00000.tsv now has 24 columns; tsv_parser() expects 23 and fails with ValueError #396

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions