You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To align the preprocessing process of deployment and training, a natural idea is to incorporate torchpipe into the training pipeline. However, this is not an easy task, mainly due to the involvement of multi-card data. In fact, there are currently few practices that integrate GPU preprocessing of torchvision into the training pipeline. One reference example is the [training API of Kornia](https://kornia.readthedocs.io/en/latest/get-started/training.html). However, it is quite heavy and requires a complete restructuring of the entire process.
In the train-infer iteration of a project, it is common practice to use the OpenCV module for data preprocessing on the CPU during the training phase. However, during the deployment phase, it is desirable to use the GPU for this operation to further enhance the performance of the online service.
18
+
19
+
* Regarding the performance improvement testing brought by this project, you can refer to the following performance improvement experiment. However, please note that metrics such as QPS and RT (Response Time) are heavily influenced by the actual online environment and data, so these results should be taken as reference only.
20
+
* Test Configuration: ResNet50, input image size 1080x1080.
21
+
22
+
23
+
| machine | decode type | client | cpu utilization |qps |avg |TP50 |TP99 |QPS improvement |
* Due to the misalignment between CPU data preprocessing and GPU, it may lead to fluctuations in the recognition accuracy of the model, thus restricting the application of GPU decoding during inference.
20
37
38
+
* For the negative impact of misalignment between training and inference, as well as the improvements introduced in this project, you can refer to the consistency test experiment below. The conclusions regarding the effects can be found in the analysis section of the table. This experiment uses an internal dataset as an example and compares the training with CPU decoding and GPU decoding, while keeping other experimental hyperparameters consistent.
39
+
40
+
41
+
| index | train type |Model Inference Method |recall |Diff |precision | Diff |Analysis |
| 1 | cpu decode, all data train | cpu decode | 94.76% || 89.73% |||
44
+
| 2 | cpu decode, all data train | gpu decode | 95.09% | 0.33% | 90.35% | 0.62% | The comparison between experiments 1 and 2 demonstrates that using the CPU for decoding during training and GPU for decoding during inference can result in significant fluctuations in model performance. This poses a risk to the stability of the model's performance once it is deployed. |
45
+
| 3 | cpu decode ,5% data train | cpu decode | 88.56% || 75.83% |||
46
+
| 4 | cpu decode ,5% data train | gpu decode | 91.14% | 2.58% | 79.86% | 4.03% | Experiments 3 and 4 further demonstrate that in scenarios with a small amount of training data, which are prone to overfitting, the pipeline of using CPU for decoding during training and GPU for decoding during inference can lead to significant fluctuations in model performance. This exacerbates the instability of the results. |
47
+
| 5 | gpu decode , all data train | gpu decode +torch模型 | 94.19% || 90.55% |||
48
+
| 6 | gpu decode , all data train | gpu decode + torchpipe fp32 | 94.19% || 90.59% |||
49
+
| 7 | gpu decode , all data train | gpu decode + torchpipe fp16 | 94.19% | 0 | 90.59% | 0.04% | Experiments 5, 6, and 7 provide evidence that the pipeline of using GPU for both training and inference based on torchpipe can align the results and further improve the performance for deploying the model. This suggests that using GPU for both training and inference can lead to consistent and improved performance for the model when deployed in a production environment. |
* This project aims to implement GPU preprocessing in the PyTorch training framework using the Torchpipe acceleration framework. Additionally, leveraging the multi-instance operation feature of Torchpipe can effectively improve training efficiency.
*Embedding the Torchpipe framework as a GPU preprocessing pipeline into the general PyTorch training framework for convenient usage.
61
+
*Aligning the GPU decoding for Train-Infer, further enhancing the performance of online services in terms of concurrency and response time.
62
+
*Implementing efficient training through techniques such as thread pools, caching queues, and multi-GPU data distribution, achieving multi-process Dataloader, GPU load balancing, and efficient training.
63
+
*Supporting various distributed training modes such as DP (Data Parallel) and DDP (Distributed Data Parallel).
64
+
*Supporting both CPU and GPU decoding, with the ability to control the proportion through parameters, effectively increasing the augment operations.
In this example, reference code for learning is provided in the torchpipe's example/gpu_train directory. The code files are train_gpu.py, train_dp.py, and train_ddp.py.
44
70
45
-
- train_dp.py 如果多卡并行使用了dp,可以参考这份代码。
71
+
- train_dp.py # if you use dp in your code,you can refer to this code.
46
72
```
47
73
sh train_dp.sh
48
74
```
49
-
- train_ddp.py 如果多卡并行使用了ddp,可以参考这份代码。
75
+
- train_ddp.py # if you use ddp in your code,you can refer to this code.
50
76
```
51
77
sh train_ddp.sh
52
78
```
53
79
54
80
55
81
56
-
**为了方便大家调用,只需8步,就可以将该方法应用到您的项目中去。**
82
+
**To make it easier for everyone to use, you can follow these 8 steps to apply this method to your project:**
-Toml is primarily used to configure GPU decoding, resizing, and other operations.
87
+
-You can refer to the `gpu_decode_train.toml` and `gpu_decode_val.toml` files in the "toml" folder as examples. These files are used for data loading and preprocessing during training and validation processes respectively.
88
+
-If necessary, you can modify the corresponding operations in these files.
##### step 4: set gpu augment , Here, we don't need to include the resize operation as it is already set in the toml file for TorchPipe. We only need to set the other operations, and the only difference is that [ToTensor]is replaced with a custom operation called [TensorToTensor].
transforms.ColorJitter(0.05, 0.05, 0.05), ## Here, we are setting three out of four parameters because the last parameter, "hue," tends to slow down the speed of computation on a 1080Ti GPU. Other GPUs should not have an issue with it.
##### step 5: 如果要做gpu与cpu的联合预处理,需要同时设置 cpu augment, 这个就是正常按照pytorch原来的就行。
120
+
##### step 5: If you want to perform joint preprocessing on both GPU and CPU, you need to set up the CPU augment as well. This can be done following the regular PyTorch approach for CPU preprocessing.
##### step 7: Apply the same operations to the "val" as applied to the "train" dataset.
131
159
132
-
##### step 8: 每个epoch需要重置一下迭代器
160
+
##### step 8: Reset the iterator after each epoch.
133
161
134
162
```python
135
163
@@ -140,31 +168,32 @@ wrap_val_loader.reset()
140
168
141
169
142
170
143
-
### 训练好的模型如何进行本地测试?
171
+
### How to perform local testing with a trained model?
172
+
173
+
Training and testing go hand in hand. We have already implemented GPU decoding using TorchPipe and completed model training. Now, how can we use the trained model for GPU decoding testing?
1. How to implement GPU decoding and preprocessing during the testing phase.
177
+
2. How to perform model inference.
146
178
147
-
这里主要涉及两个问题:
148
-
1. 在测试阶段如何实现gpu解码与预处理。
149
-
2. 模型如何进行前向。
179
+
Here are two solutions for your reference. You can choose the specific approach based on your project requirements.
150
180
151
-
这里给出两种解决方案,供大家参考,可根据项目实际情况来选择具体方案。
181
+
For detailed code examples, you can refer to "test_gpu.py" and "test_gpu.sh" in the "example" directory. First, read the following explanation, and then take a look at the detailed code to gain a clearer understanding.
#### Solution 1 (Recommended): Perform decoding and model inference all using TorchPipe
154
184
155
-
#### 方案一(推荐):解码和模型全部使用torchpipe来完成
185
+
This solution is suitable for relatively simple projects (such as those with 1 or 2 models) or for individuals who have a good understanding of TorchPipe and can utilize its capabilities to implement complex logic.
##### step 1: Convert the PyTorch model to an ONNX model. You can refer to the documentation here: [Converting PyTorch models to ONNX](../faq/onnx?_highlight=onnx).
*One important thing to consider in this step is whether to include the "subtract mean and divide by variance" preprocessing in the model itself. If you have already incorporated this preprocessing step, then there is no need to perform it separately later on.
*This version implements basic operations for decoding and forward pass of the model. It starts by decoding the image, resizing it, and converting the color space. Then, it passes the image through the model to obtain the results.
168
197
169
198
```python
170
199
batching_timeout =1
@@ -195,14 +224,14 @@ instance_num = 2
195
224
```
196
225
197
226
198
-
##### step3 : 前向的代码
227
+
##### step3:forward code example
199
228
200
229
```python
201
230
definit_decodeNode(self):
202
231
config = torchpipe.parse_toml(self.toml_path)
203
232
for key in config.keys():
204
233
if key !='global':
205
-
#如果toml里没有指定gpu,这里需要指定gpu
234
+
#If no GPU is specified in the TOML file, it is necessary to specify the GPU here.
This method only requires modifications to the preprocessing part of the original PyTorch code. No other modifications are needed, and there is no need to go through the process of converting to ONNX.
234
263
235
-
##### step1: toml例子(建议与val的toml保持一致)
264
+
##### step1: exmaple of toml(Suggest keeping the toml consistent with Val.)
236
265
237
-
*完成gpu解码、resize、cvtColor功能
238
-
*返回tensor类型(shape:1x3x224x224)
266
+
*Completed GPU decoding, resizing, and cvtColor functionalities.
267
+
*Returns tensor of shape1x3x224x224.
239
268
240
269
```python
241
270
batching_timeout =1
@@ -259,15 +288,15 @@ instance_num =8
259
288
260
289
```
261
290
262
-
##### step2 : infer代码:
291
+
##### step2 : infer code:
263
292
264
293
```python
265
294
266
295
definit_decodeNode(self):
267
296
config = torchpipe.parse_toml(self.toml_path)
268
297
for key in config.keys():
269
298
if key !='global':
270
-
#如果toml里没有指定gpu,这里需要指定gpu
299
+
#If no GPU is specified in the TOML file, it is necessary to specify the GPU here.
The core implementation code of this project is mainly the DataLoader class in gpu_train_tools.py, which is further encapsulated based on PyTorch. If you want to add functionality to your existing training framework or further explore, you can refer to this class for modifications.
314
341
342
+
**During the implementation process, there may be some unforeseen aspects that were not considered. If you encounter any bugs, please contact the authors (WangLichun, LinYuxing, ZhangShiyang) for assistance in resolving them.**
0 commit comments