Skip to content

Commit 1309995

Browse files
authored
Update TensorRT-LLM backend (#201)
* Update TensorRT-LLM backend
1 parent 171ed05 commit 1309995

18 files changed

+829
-91
lines changed

README.md

+8
Original file line numberDiff line numberDiff line change
@@ -200,8 +200,16 @@ The following table shows the fields that need to be modified before deployment:
200200
| Name | Description
201201
| :----------------------: | :-----------------------------: |
202202
| `decoupled` | Controls streaming. Decoupled mode must be set to `True` if using the streaming option from the client. |
203+
| `max_beam_width` | The maximum beam width that any request may ask for when using beam search |
203204
| `gpt_model_type` | Set to `inflight_fused_batching` when enabling in-flight batching support. To disable in-flight batching, set to `V1` |
204205
| `gpt_model_path` | Path to the TensorRT-LLM engines for deployment. In this example, the path should be set to `/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1` as the tensorrtllm_backend directory will be mounted to `/tensorrtllm_backend` within the container |
206+
| `max_tokens_in_paged_kv_cache` | The maximum size of the KV cache in number of tokens |
207+
| `max_attention_window_size` | When using techniques like sliding window attention, the maximum number of tokens that are attended to generate one token. Defaults to maximum sequence length |
208+
| `batch_scheduler_policy` | Set to `max_utilization` to greedily pack as many requests as possible in each current in-flight batching iteration. This maximizes the throughput but may result in overheads due to request pause/resume if KV cache limits are reached during execution. Set to `guaranteed_no_evict` to guarantee that a started request is never paused.|
209+
| `kv_cache_free_gpu_mem_fraction` | Set to a number between 0 and 1 to indicate the maximum fraction of GPU memory (after loading the model) that may be used for KV cache|
210+
| `max_num_sequences` | Maximum number of sequences that the in-flight batching scheme can maintain state for. Defaults to `max_batch_size` if `enable_trt_overlap` is `false` and to `2 * max_batch_size` if `enable_trt_overlap` is `true`, where `max_batch_size` is the TRT engine maximum batch size.
211+
| `enable_trt_overlap` | Set to `true` to partition available requests into 2 'microbatches' that can be run concurrently to hide exposed CPU runtime |
212+
| `exclude_input_in_output` | Set to `true` to only return completion tokens in a response. Set to `false` to return the prompt tokens concatenated with the generated tokens |
205213

206214
*triton_model_repo/postprocessing/config.pbtxt*
207215

all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt

+15-2
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,13 @@ input [
5555
data_type: TYPE_INT32
5656
dims: [ 1 ]
5757
},
58+
{
59+
name: "draft_input_ids"
60+
data_type: TYPE_INT32
61+
dims: [ -1 ]
62+
optional: true
63+
allow_ragged_batch: true
64+
},
5865
{
5966
name: "end_id"
6067
data_type: TYPE_INT32
@@ -246,9 +253,9 @@ parameters: {
246253
}
247254
}
248255
parameters: {
249-
key: "max_kv_cache_length"
256+
key: "max_attention_window_size"
250257
value: {
251-
string_value: "${max_kv_cache_length}"
258+
string_value: "${max_attention_window_size}"
252259
}
253260
}
254261
parameters: {
@@ -281,3 +288,9 @@ parameters: {
281288
string_value: "${exclude_input_in_output}"
282289
}
283290
}
291+
parameters: {
292+
key: "use_context_fmha_for_generation"
293+
value: {
294+
string_value: "${use_context_fmha_for_generation}"
295+
}
296+
}

dockerfile/Dockerfile.trt_llm_backend

-1
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,6 @@ FROM trt_llm_backend_builder as final
5858
WORKDIR /app/
5959
COPY --from=trt_llm_builder /app/tensorrt_llm/build /app/tensorrt_llm/build
6060
RUN cd /app/tensorrt_llm/build && pip3 install *.whl
61-
RUN rm -rf /app/tensorrt_llm
6261

6362
# Install TensorRT-LLM backend
6463
RUN mkdir /opt/tritonserver/backends/tensorrtllm

docs/baichuan.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ python3 tools/fill_template.py -i baichuan_ifb/preprocessing/config.pbtxt tokeni
4444
python3 tools/fill_template.py -i baichuan_ifb/postprocessing/config.pbtxt tokenizer_dir:${HF_BAICHUAN_MODEL},tokenizer_type:auto,triton_max_batch_size:64,postprocessing_instance_count:1
4545
python3 tools/fill_template.py -i baichuan_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
4646
python3 tools/fill_template.py -i baichuan_ifb/ensemble/config.pbtxt triton_max_batch_size:64
47-
python3 tools/fill_template.py -i baichuan_ifb/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:/tmp/baichuan/13B/trt_engines/fp16/1-gpu/,max_tokens_in_paged_kv_cache:2560,max_kv_cache_length:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,batching_strategy:inflight_batching,max_queue_delay_microseconds:600
47+
python3 tools/fill_template.py -i baichuan_ifb/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:/tmp/baichuan/13B/trt_engines/fp16/1-gpu/,max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,batching_strategy:inflight_batching,max_queue_delay_microseconds:600
4848
```
4949

5050
* Launch server
@@ -178,7 +178,7 @@ python3 tools/fill_template.py -i baichuan_ifb/preprocessing/config.pbtxt tokeni
178178
python3 tools/fill_template.py -i baichuan_ifb/postprocessing/config.pbtxt tokenizer_dir:${HF_BAICHUAN_MODEL},tokenizer_type:auto,triton_max_batch_size:64,postprocessing_instance_count:1
179179
python3 tools/fill_template.py -i baichuan_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:True,bls_instance_count:1,accumulate_tokens:True
180180
python3 tools/fill_template.py -i baichuan_ifb/ensemble/config.pbtxt triton_max_batch_size:64
181-
python3 tools/fill_template.py -i baichuan_ifb/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:True,max_beam_width:1,engine_dir:/tmp/baichuan/13B/trt_engines/fp16/1-gpu/,max_tokens_in_paged_kv_cache:2560,max_kv_cache_length:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,batching_strategy:inflight_batching,max_queue_delay_microseconds:600
181+
python3 tools/fill_template.py -i baichuan_ifb/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:True,max_beam_width:1,engine_dir:/tmp/baichuan/13B/trt_engines/fp16/1-gpu/,max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,batching_strategy:inflight_batching,max_queue_delay_microseconds:600
182182
183183
pip pinstall SentencePiece
184184
# please add `trust_remote_code=True` in tokenizer of preprocessing and postprocessing. Considering the security, we don't add it by default.

docs/llama.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ python3 tools/fill_template.py -i llama_ifb/preprocessing/config.pbtxt tokenizer
2525
python3 tools/fill_template.py -i llama_ifb/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:llama,triton_max_batch_size:64,postprocessing_instance_count:1
2626
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
2727
python3 tools/fill_template.py -i llama_ifb/ensemble/config.pbtxt triton_max_batch_size:64
28-
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:/tmp/llama/7B/trt_engines/fp16/1-gpu/,max_tokens_in_paged_kv_cache:2560,max_kv_cache_length:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,batching_strategy:inflight_batching,max_queue_delay_microseconds:600
28+
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:/tmp/llama/7B/trt_engines/fp16/1-gpu/,max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,batching_strategy:inflight_batching,max_queue_delay_microseconds:600
2929
```
3030

3131
* Launch server
@@ -114,7 +114,7 @@ python3 tools/fill_template.py -i llama_ifb/preprocessing/config.pbtxt tokenizer
114114
python3 tools/fill_template.py -i llama_ifb/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:llama,triton_max_batch_size:64,postprocessing_instance_count:1
115115
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:True,bls_instance_count:1,accumulate_tokens:True
116116
python3 tools/fill_template.py -i llama_ifb/ensemble/config.pbtxt triton_max_batch_size:64
117-
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:True,max_beam_width:1,engine_dir:/tmp/llama/7B/trt_engines/fp16/1-gpu/,max_tokens_in_paged_kv_cache:2560,max_kv_cache_length:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,batching_strategy:inflight_batching,max_queue_delay_microseconds:600
117+
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:True,max_beam_width:1,engine_dir:/tmp/llama/7B/trt_engines/fp16/1-gpu/,max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,batching_strategy:inflight_batching,max_queue_delay_microseconds:600
118118

119119
pip pinstall SentencePiece
120120
python3 scripts/launch_triton_server.py --world_size 1 --model_repo=llama_ifb/

inflight_batcher_llm/client/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)