Skip to content

Commit df995d4

Browse files
Add DeepSeek-R1 BKC (#3660) (#3661)
* Add DeepSeek-R1 BKC * download ckpt first * use git diff format to indicate change * format correction --------- Co-authored-by: Xia Weiwen <weiwen.xia@intel.com>
1 parent 432c149 commit df995d4

File tree

1 file changed

+54
-2
lines changed

1 file changed

+54
-2
lines changed

examples/cpu/llm/inference/README.md

+54-2
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
We have supported a long list of LLMs, including the most notable open-source models
44
like Llama series, Qwen series, Phi-3/Phi-4 series,
5-
and the phenomenal high-quality reasoning model DeepSeek-R1.
5+
and the phenomenal high-quality reasoning model [DeepSeek-R1](#223-deepseek-r1-671b).
66

77
## 1.1 Verified for single instance mode
88

@@ -472,7 +472,59 @@ huggingface-cli download hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 --l
472472
deepspeed --bind_cores_to_rank run.py --benchmark -m ./Llama-3.1-8B-GPTQ --ipex --ipex-weight-only-quantization --weight-dtype INT4 --lowp-mode BF16 --quant-with-amp --autotp --output-dir "saved_results"
473473
```
474474

475-
### 2.2.3 Additional configuration for specific models
475+
### 2.2.3 DeepSeek-R1 671B
476+
477+
IPEX applies dedicated optimizations on the full version of `DeepSeek-R1` model
478+
and it can be showcased with `run.py` script now!
479+
480+
- Currently, weight only quantization INT8 precision is supported.
481+
Please download the INT8 quantized version from [HuggingFace Models](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8).
482+
483+
```bash
484+
huggingface-cli download --resume meituan/DeepSeek-R1-Channel-INT8 --local-dir <DEEPSEEK_INT8_CKPT_SAVE_PATH>
485+
```
486+
487+
- A change is required in the `config.json` file of the downloaded checkpoint path in order to apply the optimizations.
488+
Please add the `quantization_config` field to the end of the file as below.
489+
490+
```diff
491+
"transformers_version": "4.46.3",
492+
"use_cache": true,
493+
"v_head_dim": 128,
494+
- "vocab_size": 129280
495+
+ "vocab_size": 129280,
496+
+ "quantization_config": {
497+
+ "quant_method": "int8",
498+
+ "bits": 8,
499+
+ "group_size": -1
500+
+ }
501+
}
502+
```
503+
504+
- Use the following command to run the test.
505+
506+
```bash
507+
# at examples/cpu/llm/inference
508+
deepspeed --bind_cores_to_rank run.py -m <DEEPSEEK_INT8_CKPT_SAVE_PATH> --benchmark --input-tokens 1024 --max-new-tokens 1024 --ipex-weight-only-quantization --weight-dtype INT8 --ipex --batch-size 1 --autotp --greedy --quant-with-amp --token-latency
509+
```
510+
511+
- Notes
512+
513+
(1) Since the hugeness of the model size as well as the cache based optimizations, it is recommended to use a server with 1.5TB
514+
or larger memory amount. The memory comsumption optimizations are in progress.
515+
516+
(2) Please add `--num_accelerators` and `--bind_core_list` arguments for `deepspeed` command based on your SNC configurations.
517+
For example, for a server having 2 sockets, 128 physical cores per socket with a total number of 6 sub-numa clusters,
518+
it is recommended to set `--num_accelerators 6 --bind_core_list 0-41,43-84,86-127,128-169,171-212,214-255`.
519+
520+
(3) The provided script is mainly for showcasing performance with the default input prompts.
521+
We can replace the prompts in `prompt.json` under `deepseekr1` key with your own inputs.
522+
Also, we can change the script, applying [the chat template](https://huggingface.co/docs/transformers/chat_templating)
523+
to get outputs with higher quality.
524+
525+
(4) We can enlarge `--max-new-tokens` setting for longer outputs and add `--streaming` to get streaming outputs in the console.
526+
527+
### 2.2.4 Additional configuration for specific models
476528

477529
There are some model-specific requirements to be aware of, as follows:
478530

0 commit comments

Comments
 (0)