thanks for releasing the code and evaluation configs for OBLIVION. i have a question about the number of the questions of longmemeval
in README.md, it says full_eval_20260202/ -- Full-scale evaluation configs (488 samples, various strategies), but in the official repo the cleaned benchmark files are described as containing 500 evaluation instances
i also noticed that these codes exclude some questions in blacklist, but this list seemed not provided in this repo, so could you clarify whether the LongMemEval results in the paper were computed on the official 500-instance cleaned split, or on a filtered 488-instance subset? and if a filtered 488-instance subset was used, i wonder if it's common?
i may have misunderstood the setup, so please correct me if I missed something...
thanks for releasing the code and evaluation configs for OBLIVION. i have a question about the number of the questions of longmemeval
in README.md, it says full_eval_20260202/ -- Full-scale evaluation configs (488 samples, various strategies), but in the official repo the cleaned benchmark files are described as containing 500 evaluation instances
i also noticed that these codes exclude some questions in blacklist, but this list seemed not provided in this repo, so could you clarify whether the LongMemEval results in the paper were computed on the official 500-instance cleaned split, or on a filtered 488-instance subset? and if a filtered 488-instance subset was used, i wonder if it's common?
i may have misunderstood the setup, so please correct me if I missed something...