I failed to reproduce the Llama2-7b-4k (w/o SFT) in the paper

Hi,  I failed to reproduce the Llama2-7b-4k (w/o SFT)  in the paper.

Here is our result:

|Methods                        | Tokens | Coursera | GSM | QuALITY | TOEFL | CodeU | SFiction | Avg  |
|-------------------------------|--------|----------|-----|---------|-------|-------|----------|------|
|(L-Eval)Llama2-7b-4k (w/o SFT) |  4k    | 20.05    | 2.0 | 28.71   | 24.53 | 0.00  | 40.62    | 19.31|
|(Ours) Llama2-7b-4k (w/o SFT)   | 4k    |    15.26 | 19.0|   30.69 | 13.01 |  3.33 |  35.93   | 19.54|

Here is our experimental setting:
We change the llama2-chat-test.py file, disable the NTK parameters and using LLama2-7b to conduct the evaluation.
And run like this: 
python3 Baselines/llama2-chat-test.py \
       --scale 7b \
       --max_length 4k \
       --metric exam_eval 

What's the possible reason for that ?  Should I  adjust the prompt or  other pamameters?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I failed to reproduce the Llama2-7b-4k (w/o SFT) in the paper #17

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Methods	Tokens	Coursera	GSM	QuALITY	TOEFL	CodeU	SFiction	Avg
(L-Eval)Llama2-7b-4k (w/o SFT)	4k	20.05	2.0	28.71	24.53	0.00	40.62	19.31
(Ours) Llama2-7b-4k (w/o SFT)	4k	15.26	19.0	30.69	13.01	3.33	35.93	19.54

I failed to reproduce the Llama2-7b-4k (w/o SFT) in the paper #17

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions