There are some problems with the ptb dataset downloaded using the script in the manual, it can not pass the test of assert ntokens == 10000 in the test_ptb_dataset. The total number of the unique words is 8481 < 10000(I think my code is right). And I download the data from this url, and it can pass the test above. Then I compare the datasets(train.txt, test.txt, valid.txt) from the two sources using diff, and they do differ on train.txt.