|
2 | 2 |
|
3 | 3 | We have supported a long list of LLMs, including the most notable open-source models
|
4 | 4 | like Llama series, Qwen series, Phi-3/Phi-4 series,
|
5 |
| -and the phenomenal high-quality reasoning model DeepSeek-R1. |
| 5 | +and the phenomenal high-quality reasoning model [DeepSeek-R1](#223-deepseek-r1-671b). |
6 | 6 |
|
7 | 7 | ## 1.1 Verified for single instance mode
|
8 | 8 |
|
@@ -472,7 +472,59 @@ huggingface-cli download hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 --l
|
472 | 472 | deepspeed --bind_cores_to_rank run.py --benchmark -m ./Llama-3.1-8B-GPTQ --ipex --ipex-weight-only-quantization --weight-dtype INT4 --lowp-mode BF16 --quant-with-amp --autotp --output-dir "saved_results"
|
473 | 473 | ```
|
474 | 474 |
|
475 |
| -### 2.2.3 Additional configuration for specific models |
| 475 | +### 2.2.3 DeepSeek-R1 671B |
| 476 | + |
| 477 | +IPEX applies dedicated optimizations on the full version of `DeepSeek-R1` model |
| 478 | +and it can be showcased with `run.py` script now! |
| 479 | + |
| 480 | +- Currently, weight only quantization INT8 precision is supported. |
| 481 | +Please download the INT8 quantized version from [HuggingFace Models](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8). |
| 482 | + |
| 483 | +```bash |
| 484 | +huggingface-cli download --resume meituan/DeepSeek-R1-Channel-INT8 --local-dir <DEEPSEEK_INT8_CKPT_SAVE_PATH> |
| 485 | +``` |
| 486 | + |
| 487 | +- A change is required in the `config.json` file of the downloaded checkpoint path in order to apply the optimizations. |
| 488 | +Please add the `quantization_config` field to the end of the file as below. |
| 489 | + |
| 490 | +```diff |
| 491 | + "transformers_version": "4.46.3", |
| 492 | + "use_cache": true, |
| 493 | + "v_head_dim": 128, |
| 494 | +- "vocab_size": 129280 |
| 495 | ++ "vocab_size": 129280, |
| 496 | ++ "quantization_config": { |
| 497 | ++ "quant_method": "int8", |
| 498 | ++ "bits": 8, |
| 499 | ++ "group_size": -1 |
| 500 | ++ } |
| 501 | + } |
| 502 | +``` |
| 503 | + |
| 504 | +- Use the following command to run the test. |
| 505 | + |
| 506 | +```bash |
| 507 | +# at examples/cpu/llm/inference |
| 508 | +deepspeed --bind_cores_to_rank run.py -m <DEEPSEEK_INT8_CKPT_SAVE_PATH> --benchmark --input-tokens 1024 --max-new-tokens 1024 --ipex-weight-only-quantization --weight-dtype INT8 --ipex --batch-size 1 --autotp --greedy --quant-with-amp --token-latency |
| 509 | +``` |
| 510 | + |
| 511 | +- Notes |
| 512 | + |
| 513 | +(1) Since the hugeness of the model size as well as the cache based optimizations, it is recommended to use a server with 1.5TB |
| 514 | +or larger memory amount. The memory comsumption optimizations are in progress. |
| 515 | + |
| 516 | +(2) Please add `--num_accelerators` and `--bind_core_list` arguments for `deepspeed` command based on your SNC configurations. |
| 517 | +For example, for a server having 2 sockets, 128 physical cores per socket with a total number of 6 sub-numa clusters, |
| 518 | +it is recommended to set `--num_accelerators 6 --bind_core_list 0-41,43-84,86-127,128-169,171-212,214-255`. |
| 519 | + |
| 520 | +(3) The provided script is mainly for showcasing performance with the default input prompts. |
| 521 | +We can replace the prompts in `prompt.json` under `deepseekr1` key with your own inputs. |
| 522 | +Also, we can change the script, applying [the chat template](https://huggingface.co/docs/transformers/chat_templating) |
| 523 | +to get outputs with higher quality. |
| 524 | + |
| 525 | +(4) We can enlarge `--max-new-tokens` setting for longer outputs and add `--streaming` to get streaming outputs in the console. |
| 526 | + |
| 527 | +### 2.2.4 Additional configuration for specific models |
476 | 528 |
|
477 | 529 | There are some model-specific requirements to be aware of, as follows:
|
478 | 530 |
|
|
0 commit comments