Skip to content

Commit 680a3e8

Browse files
committed
update EVALUATION.md
1 parent 71ad282 commit 680a3e8

File tree

1 file changed

+24
-3
lines changed

1 file changed

+24
-3
lines changed

eval/EVALUATION.md

+24-3
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,13 @@ mkdir data/ceval
88
mv ceval-exam.zip data/ceval
99
cd data/ceval; unzip ceval-exam.zip
1010
cd ../../
11+
12+
# Qwen-7B
1113
python evaluate_ceval.py -d data/ceval/
14+
15+
# Qwen-7B-Chat
16+
pip install thefuzz
17+
python evaluate_chat_ceval.py -d data/ceval/
1218
```
1319

1420
- MMLU
@@ -19,27 +25,42 @@ mkdir data/mmlu
1925
mv data.tar data/mmlu
2026
cd data/mmlu; tar xf data.tar
2127
cd ../../
28+
29+
# Qwen-7B
2230
python evaluate_mmlu.py -d data/mmlu/data/
31+
32+
# Qwen-7B-Chat
33+
pip install thefuzz
34+
python evaluate_chat_mmlu.py -d data/mmlu/data/
2335
```
2436

2537
- HumanEval
2638

2739
Get the HumanEval.jsonl file from [here](https://github.com/openai/human-eval/tree/master/data)
2840

2941
```Shell
30-
python evaluate_humaneval.py -f HumanEval.jsonl -o HumanEval_res.jsonl
3142
git clone https://github.com/openai/human-eval
3243
pip install -e human-eval
44+
45+
# Qwen-7B
46+
python evaluate_humaneval.py -f HumanEval.jsonl -o HumanEval_res.jsonl
3347
evaluate_functional_correctness HumanEval_res.jsonl
48+
# Qwen-7B-Chat
49+
python evaluate_chat_mmlu.py -f HumanEval.jsonl -o HumanEval_res_chat.jsonl
50+
evaluate_functional_correctness HumanEval_res_chat.jsonl
3451
```
3552
3653
When installing package human-eval, please note its following disclaimer:
3754

3855
This program exists to run untrusted model-generated code. Users are strongly encouraged not to do so outside of a robust security sandbox. The execution call in execution.py is deliberately commented out to ensure users read this disclaimer before running code in a potentially unsafe manner. See the comment in execution.py for more information and instructions.
39-
4056

4157
- GSM8K
4258

4359
```Shell
60+
# Qwen-7B
4461
python evaluate_gsm8k.py
45-
```
62+
63+
# Qwen-7B-Chat
64+
python evaluate_chat_gsm8k.py # zeroshot
65+
python evaluate_chat_gsm8k.py --use-fewshot # fewshot
66+
```

0 commit comments

Comments
 (0)