Skip to content

The PTB dataset downloaded using the script in the manual has wrong unique words number #12

@SplashCloud

Description

@SplashCloud

There are some problems with the ptb dataset downloaded using the script in the manual, it can not pass the test of assert ntokens == 10000 in the test_ptb_dataset. The total number of the unique words is 8481 < 10000(I think my code is right). And I download the data from this url, and it can pass the test above. Then I compare the datasets(train.txt, test.txt, valid.txt) from the two sources using diff, and they do differ on train.txt.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions