Skip to content

Empty jsonl file after conversion to DPR format #13

@jblagoja

Description

@jblagoja

I am following the step-by-step guide on how to prepare the data, train the retriever and reader, and evaluate the model.

The issue is that after running the below command as output I am getting empty jsonl file:

relik data convert-to-dpr
data/blink/processed/blink-dev-kilt-relik-windowed.jsonl
data/blink/processed/blink-dev-kilt-relik-windowed-dpr.jsonl
data/kb/wikipedia/documents.jsonl
--title-map data/kb/wikipedia/title_map.json

The same happens with train/dev/test subsets of the BLINK dataset as well as with the AIDA dataset.

What I noticed is that if I remove/hide below lines of the code in the convert_to_dpr function found in relik/cli/data.py:

if len(positive_pssgs) == 0:
continue

then the output jsonl file will be filled with data like below (this is a sample from the generated file blink-dev-kilt-relik-windowed-dpr.jsonl for the document with id=0):

{"id": "0_0", "doc_topic": "{", "question": "{"id": "blink-dev-0", "input": "On 8 October 1806, Napoleon's troops first entered Prussian territory and battles took", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_31", "doc_topic": "{", "question": ""On 8 October 1806, Napoleon's troops first entered Prussian territory and battles took place on the banks of the Saale between the forces of Napoleon I of France and", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_119", "doc_topic": "{", "question": "place on the banks of the Saale between the forces of Napoleon I of France and Frederick William III of Prussia. Napoleon spent the night of 8 October in Schloss Ebersdorf", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_198", "doc_topic": "{", "question": "Frederick William III of Prussia. Napoleon spent the night of 8 October in Schloss Ebersdorf. This (a precursor to the Battle of Schleiz on 9 October and the [", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_290", "doc_topic": "{", "question": ". This (a precursor to the Battle of Schleiz on 9 October and the [START_ENT] Battle of Jena-Auerstadt [END_ENT] on 14 October), was", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_357", "doc_topic": "{", "question": "START_ENT] Battle of Jena-Auerstadt [END_ENT] on 14 October), was the first battle of the War of the Fourth Coalition.", "output"", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_423", "doc_topic": "{", "question": "the first battle of the War of the Fourth Coalition.", "output": [{"answer": "Battle of Jena\u2013Auerstedt", "", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_486", "doc_topic": "{", "question": ": [{"answer": "Battle of Jena\u2013Auerstedt", "provenance": [{"wikipedia_id": "295160", "title"", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_529", "doc_topic": "{", "question": "provenance": [{"wikipedia_id": "295160", "title": "Battle of Jena\u2013Auerstedt"}]}], "meta"", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_577", "doc_topic": "{", "question": ": "Battle of Jena\u2013Auerstedt"}]}], "meta": {"mention": "Battle of Jena-Auerstadt", "left_context", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_617", "doc_topic": "{", "question": ": {"mention": "Battle of Jena-Auerstadt", "left_context": "On 8 October 1806, Napoleon's troops first entered Prussian territory and", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_672", "doc_topic": "{", "question": "": "On 8 October 1806, Napoleon's troops first entered Prussian territory and battles took place on the banks of the Saale between the forces of Napoleon I of", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_750", "doc_topic": "{", "question": "battles took place on the banks of the Saale between the forces of Napoleon I of France and Frederick William III of Prussia. Napoleon spent the night of 8 October in", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_831", "doc_topic": "{", "question": "France and Frederick William III of Prussia. Napoleon spent the night of 8 October in Schloss Ebersdorf. This (a precursor to the Battle of Schleiz on 9 October and", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_917", "doc_topic": "{", "question": "Schloss Ebersdorf. This (a precursor to the Battle of Schleiz on 9 October and the", "right_context": "on 14 October), was the first", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_979", "doc_topic": "{", "question": "on 9 October and the", "right_context": "on 14 October), was the first battle of the War of the Fourth Coalition."}}", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}

We can notice that the properties "positive_ctxs", "negative_ctxs", and "hard_negative_ctxs" are empty.
This is the same for all documents and lines in the jsonl file, and has to be confirmed whether it is fine or not.

I assume that this comes from the below lines of code in the convert_to_dpr function found in relik/cli/data.py:

for idx, entity in enumerate(sentence["window_labels"]):
entity = entity[2]
...
if entity in documents:
doc = documents.get_document_from_text(entity)
...
positive_pssgs.append(doc.to_dict())
...

If we take a look into the jsonl file that is generated in the previous step by running the below command:

relik data create-windows
data/blink/processed/blink-dev-kilt-relik.jsonl
data/blink/processed/blink-dev-kilt-relik-windowed.jsonl

we will find that the "window_labels" property is empty for all lines (this is a sample of the generated file blink-dev-kilt-relik-windowed.jsonl for the document with id=0):

{"doc_id": 0, "window_id": 0, "text": "{"id": "blink-dev-0", "input": "On 8 October 1806, Napoleon's troops first entered Prussian territory and battles took", "tokens": ["{", """, "i", "d", """, ":", """, "blink", "-", "dev-0", """, ",", """, "input", """, ":", """, "On", "8", "October", "1806", ",", "Napoleon", "'s", "troops", "first", "entered", "Prussian", "territory", "and", "battles", "took"], "words": ["{", """, "i", "d", """, ":", """, "blink", "-", "dev-0", """, ",", """, "input", """, ":", """, "On", "8", "October", "1806", ",", "Napoleon", "'s", "troops", "first", "entered", "Prussian", "territory", "and", "battles", "took"], "doc_topic": "{", "offset": 0, "spans": [], "token2char_start": {"0": 0, "1": 1, "2": 2, "3": 3, "4": 4, "5": 5, "6": 7, "7": 8, "8": 13, "9": 14, "10": 19, "11": 20, "12": 22, "13": 23, "14": 28, "15": 29, "16": 31, "17": 32, "18": 35, "19": 37, "20": 45, "21": 49, "22": 51, "23": 59, "24": 62, "25": 69, "26": 75, "27": 83, "28": 92, "29": 102, "30": 106, "31": 114}, "token2char_end": {"0": 1, "1": 2, "2": 3, "3": 4, "4": 5, "5": 6, "6": 8, "7": 13, "8": 14, "9": 19, "10": 20, "11": 21, "12": 23, "13": 28, "14": 29, "15": 30, "16": 32, "17": 34, "18": 36, "19": 44, "20": 49, "21": 50, "22": 59, "23": 61, "24": 68, "25": 74, "26": 82, "27": 91, "28": 101, "29": 105, "30": 113, "31": 118}, "char2token_start": {"0": 0, "1": 1, "2": 2, "3": 3, "4": 4, "5": 5, "7": 6, "8": 7, "13": 8, "14": 9, "19": 10, "20": 11, "22": 12, "23": 13, "28": 14, "29": 15, "31": 16, "32": 17, "35": 18, "37": 19, "45": 20, "49": 21, "51": 22, "59": 23, "62": 24, "69": 25, "75": 26, "83": 27, "92": 28, "102": 29, "106": 30, "114": 31}, "char2token_end": {"1": 0, "2": 1, "3": 2, "4": 3, "5": 4, "6": 5, "8": 6, "13": 7, "14": 8, "19": 9, "20": 10, "21": 11, "23": 12, "28": 13, "29": 14, "30": 15, "32": 16, "34": 17, "36": 18, "44": 19, "49": 20, "50": 21, "59": 22, "61": 23, "68": 24, "74": 25, "82": 26, "91": 27, "101": 28, "105": 29, "113": 30, "118": 31}, "window_labels": [], "window_labels_tokens": []}

Please check whether data preparation needed for the training is as expected or not.

With the generated train/dev/test files for BLINK and AIDA datasets I move to the next step - Training the Retriever model for Entity Linking.

After running the below command:

relik retriever train relik/retriever/conf/pretrain_iterable_in_batch.yaml
model.language_model=intfloat/e5-base-v2
data.train_dataset_path=data/blink/processed/blink-train-kilt-relik-windowed-dpr.jsonl
data.val_dataset_path=data/blink/processed/blink-dev-kilt-relik-windowed-dpr.jsonl
data.test_dataset_path=data/blink/processed/blink-test-kilt-relik-windowed-dpr.jsonl
data.shared_params.documents_path=data/kb/wikipedia/documents.jsonl

then I am getting the following error:

Error executing job with overrides: ['model.language_model=intfloat/e5-base-v2', 'data.train_dataset_path=data/blink/processed/blink-train-kilt-relik-windowed-dpr.jsonl', 'data.val_dataset_path=data/blink/processed/blink-dev-kilt-relik-windowed-dpr.jsonl', 'data.test_dataset_path=data/blink/processed/blink-test-kilt-relik-windowed-dpr.jsonl', 'data.shared_params.documents_path=data/kb/wikipedia/documents.jsonl']
Error in call to target 'relik.retriever.data.datasets.InBatchNegativesDataset':
IndexError('list index out of range')
full_key: datasets.train

The same happens with the below command:

relik retriever train relik/retriever/conf/finetune_iterable_in_batch.yaml
model.language_model=intfloat/e5-base-v2
data.train_dataset_path=data/aida/processed/aida-train-relik-windowed-dpr.jsonl
data.val_dataset_path=data/aida/processed/aida-dev-relik-windowed-dpr.jsonl
data.test_dataset_path=data/aida/processed/aida-test-relik-windowed-dpr.jsonl
data.shared_params.documents_path=data/kb/wikipedia/documents.jsonl

then I am getting the following error:

Error executing job with overrides: ['model.language_model=intfloat/e5-base-v2', 'data.train_dataset_path=data/aida/processed/aida-train-relik-windowed-dpr.jsonl', 'data.val_dataset_path=data/aida/processed/aida-dev-relik-windowed-dpr.jsonl', 'data.test_dataset_path=data/aida/processed/aida-test-relik-windowed-dpr.jsonl', 'data.shared_params.documents_path=data/kb/wikipedia/documents.jsonl']
Error in call to target 'relik.retriever.data.datasets.AidaInBatchNegativesDataset':
IndexError('list index out of range')
full_key: datasets.train

I am assuming that all this is related, so please take a look and provide feedback.

Thanks and best regards.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions