You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|`decoupled`| Controls streaming. Decoupled mode must be set to `True` if using the streaming option from the client. |
203
+
|`max_beam_width`| The maximum beam width that any request may ask for when using beam search |
203
204
|`gpt_model_type`| Set to `inflight_fused_batching` when enabling in-flight batching support. To disable in-flight batching, set to `V1`|
204
205
|`gpt_model_path`| Path to the TensorRT-LLM engines for deployment. In this example, the path should be set to `/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1` as the tensorrtllm_backend directory will be mounted to `/tensorrtllm_backend` within the container |
206
+
|`max_tokens_in_paged_kv_cache`| The maximum size of the KV cache in number of tokens |
207
+
|`max_attention_window_size`| When using techniques like sliding window attention, the maximum number of tokens that are attended to generate one token. Defaults to maximum sequence length |
208
+
|`batch_scheduler_policy`| Set to `max_utilization` to greedily pack as many requests as possible in each current in-flight batching iteration. This maximizes the throughput but may result in overheads due to request pause/resume if KV cache limits are reached during execution. Set to `guaranteed_no_evict` to guarantee that a started request is never paused.|
209
+
|`kv_cache_free_gpu_mem_fraction`| Set to a number between 0 and 1 to indicate the maximum fraction of GPU memory (after loading the model) that may be used for KV cache|
210
+
| `max_num_sequences` | Maximum number of sequences that the in-flight batching scheme can maintain state for. Defaults to `max_batch_size` if `enable_trt_overlap` is `false` and to `2 * max_batch_size` if `enable_trt_overlap` is `true`, where `max_batch_size` is the TRT engine maximum batch size.
211
+
|`enable_trt_overlap`| Set to `true` to partition available requests into 2 'microbatches' that can be run concurrently to hide exposed CPU runtime |
212
+
|`exclude_input_in_output`| Set to `true` to only return completion tokens in a response. Set to `false` to return the prompt tokens concatenated with the generated tokens |
0 commit comments