Data and code for the following paper.
Vanja M Karan, Stephen McQuistin, Ryo Yanagida, Colin Perkins, Gareth Tyson, Ignacio Castro, Patrick Healey, Matthew Purver, A Dataset for Expert Reviewer Recommendation with Large Language Models as Zero-shot Rankers, Proceedings of The 31st International Conference on Computational Linguistics, 2025
To reproduce results form the paper call:
python eval.py results-sts-[variant]-ta-[dataset].pickleWhere:
[variant] determines which model variant you want to evaluate, and consists of two components:
- llama3 (has "3") or llama2 (has empty string)
- size of the model "7b", "8b" or, "70b" E.g., 8b llama2 corresponds to "8b" and 70b llama3 corresponds to "3-70b"
[dataset] determines which dataset - can be "ietf", "nips", or "stelmakh"
Prompts used in the paper are available in prompts-sts-ta-[dataset].pickle files. They are a list of tuples of the form (label_data, prompt_text). Each of them represents a reviewer and a pair of papers which must be rated by the LLM, along with the true label. More details below:
label_datais a tuple of the form(reviewer_id, paperA_id, paperB_id, correct_label, gold_score_distance)- ids refer to the original data files,
correct_labelis "first" or "second", andgold_score_distanceis the difference between the scores used for calculating the evaluation metric
- ids refer to the original data files,
prompt_textis a string generated using the following template:
prompt = """[INST] <<SYS>> You are an expert pairing reviewers with suitable papers to review. <</SYS>>
The description of the reviewer is as follows:\n {reviewer_description} \n\n\n
Description of paper A:\n {paperA_description} \n\n\n
Description of paper B: \n {paperB_description} \n\n\n
Which paper is more relevant for this reviewer (your answer must be either 'paper A' or 'paper B')?
[/INST] My answer is:"""(for datasets where there is a single paper and a pair of reviewers the prompt and label data is analogous as above but prompts and reviewers are switched)