Merge branch 'main' of github.com:torchpipe/torchpipe.github.io into main

tp-nan · tp-nan · commit bca804e7f0ce · 2023-09-06T16:55:18.000+08:00
diff --git a/docs/tools/training.md b/docs/tools/training.md
@@ -8,58 +8,84 @@ type: explainer
 This feature is currently under development.
 :::
 
-&#20026;&#20102;&#23545;&#40784;&#37096;&#32626;&#21644;&#35757;&#32451;&#30340;&#21069;&#22788;&#29702;&#36807;&#31243;&#65292;&#24456;&#33258;&#28982;&#30340;&#24819;&#27861;&#26159;&#23558;torchpipe&#21152;&#20837;&#35757;&#32451;&#27969;&#31243;&#27969;&#27700;&#32447;&#12290;&#28982;&#32780;&#36825;&#24182;&#19981;&#26159;&#23481;&#26131;&#30340;&#65292;&#20027;&#35201;&#26159;&#22240;&#20026;&#36825;&#23558;&#28041;&#21450;&#21040;&#22810;&#21345;&#25968;&#25454;&#30340;&#38382;&#39064;&#12290;&#20107;&#23454;&#19978;&#65292;&#30446;&#21069;&#40092;&#26377;&#23558;torchvision&#30340;gpu&#21069;&#22788;&#29702;&#21152;&#20837;&#35757;&#32451;&#27969;&#27700;&#32447;&#30340;&#23454;&#36341;&#12290;&#19968;&#20010;&#21487;&#20379;&#21442;&#32771;&#30340;&#20363;&#23376;&#26159;[kornia&#30340;training API](https://kornia.readthedocs.io/en/latest/get-started/training.html).&#28982;&#32780;&#23427;&#26159;&#38750;&#24120;&#37325;&#30340;&#65292;&#37325;&#26500;&#20102;&#25972;&#20010;&#27969;&#31243;&#12290;
+To align the preprocessing process of deployment and training, a natural idea is to incorporate torchpipe into the training pipeline. However, this is not an easy task, mainly due to the involvement of multi-card data. In fact, there are currently few practices that integrate GPU preprocessing of torchvision into the training pipeline. One reference example is the [training API of Kornia](https://kornia.readthedocs.io/en/latest/get-started/training.html). However, it is quite heavy and requires a complete restructuring of the entire process.
+
 
 
 ###  Motivation
 
-   * &#22312;&#39033;&#30446;&#30340;train-infer&#36845;&#20195;&#20013;&#65292;&#36890;&#24120;&#30340;&#20570;&#27861;&#26159;&#22312;&#35757;&#32451;&#29615;&#33410;&#20351;&#29992;Open-CV&#27169;&#22359;&#22312;CPU&#19978;&#36827;&#34892;&#25968;&#25454;&#39044;&#22788;&#29702;&#65307;&#32780;&#22312;&#19978;&#32447;&#29615;&#33410;&#20013;&#65292;&#24076;&#26395;&#20351;&#29992;GPU&#36827;&#34892;&#35813;&#25805;&#20316;&#20197;&#36827;&#19968;&#27493;&#25552;&#21319;&#19978;&#32447;&#26381;&#21153;&#24615;&#33021;.
+In the train-infer iteration of a project, it is common practice to use the OpenCV module for data preprocessing on the CPU during the training phase. However, during the deployment phase, it is desirable to use the GPU for this operation to further enhance the performance of the online service.
+
+* Regarding the performance improvement testing brought by this project, you can refer to the following performance improvement experiment. However, please note that metrics such as QPS and RT (Response Time) are heavily influenced by the actual online environment and data, so these results should be taken as reference only.
+* Test Configuration: ResNet50, input image size 1080x1080.
+
+       
+| machine  | decode type  | client  | cpu utilization  |qps  |avg  |TP50  |TP99  |QPS improvement  |
+| :----:  | :----:  | :----:  | :----:  | :----:  | :----:  | :----:  | :----: |  :----: | 
+| 3080Ti | cpu decode | 10 | 300% limit | 271 | 36.88 | 30.11 | 65.97 |  | 
+| 3080Ti | gpu decode  | 10 |  300% limit |319 | 31.34 | 23.15 | 61.60 | 17.71% | 
+| 3080Ti | cpu decode  | 1 |  300% limit |70 | 14.27 | 14.24 | 18.35 |  | 
+| 3080Ti | gpu decode  | 1 |  300% limit |90 | 11.08 | 10.76 | 14.46 | 28.57% | 
+| 3080Ti | cpu decode  | 5 |  300% limit |276 | 18.11 | 12.62 | 38.87 |  | 
+| 3080Ti | gpu decode  | 5 |  300% limit |321 | 15.59 | 9.83 | 40.43 | 16.30% | 
+
+
+
+
 
-      * &#20851;&#20110;&#26412;&#39033;&#30446;&#24102;&#26469;&#30340;&#26381;&#21153;&#24615;&#33021;&#25552;&#21319;&#27979;&#35797;&#65292;&#21487;&#20197;&#21442;&#32771;&#19979;&#38754;&#30340;&#24615;&#33021;&#25552;&#21319;&#23454;&#39564;&#65288;QPS&#12289;RT&#31561;&#25351;&#26631;&#21463;&#23454;&#38469;&#32447;&#19978;&#29615;&#22659;&#12289;&#25968;&#25454;&#24433;&#21709;&#36739;&#22823;&#65292;&#20165;&#20379;&#21442;&#32771;&#65289;
-          ![image](../images/training/gpu_decode_exp_show_readme_2.jpg)
+* Due to the misalignment between CPU data preprocessing and GPU, it may lead to fluctuations in the recognition accuracy of the model, thus restricting the application of GPU decoding during inference.
 
+* For the negative impact of misalignment between training and inference, as well as the improvements introduced in this project, you can refer to the consistency test experiment below. The conclusions regarding the effects can be found in the analysis section of the table. This experiment uses an internal dataset as an example and compares the training with CPU decoding and GPU decoding, while keeping other experimental hyperparameters consistent.
+  
+       
+| index  | train type  |Model Inference Method  |recall  |Diff  |precision  | Diff  |Analysis  |
+| :----:  | :----:  | :----:  | :----:  | :----:  | :----:  | :----: | :----: | 
+| 1 | cpu decode&#65292; all data train | cpu decode  | 94.76% |  | 89.73% |  |  | 
+| 2 | cpu decode&#65292; all data train | gpu decode  | 95.09% | 0.33% | 90.35% | 0.62% | The comparison between experiments 1 and 2 demonstrates that using the CPU for decoding during training and GPU for decoding during inference can result in significant fluctuations in model performance. This poses a risk to the stability of the model's performance once it is deployed. | 
+| 3 | cpu decode &#65292;5% data train | cpu decode  | 88.56% |  | 75.83% |  |  | 
+| 4 | cpu decode &#65292;5% data train | gpu decode  | 91.14% | 2.58% | 79.86% | 4.03% | Experiments 3 and 4 further demonstrate that in scenarios with a small amount of training data, which are prone to overfitting, the pipeline of using CPU for decoding during training and GPU for decoding during inference can lead to significant fluctuations in model performance. This exacerbates the instability of the results. | 
+| 5 | gpu decode &#65292; all data train | gpu decode +torch&#27169;&#22411; | 94.19% |  | 90.55% |  |  | 
+| 6 | gpu decode &#65292; all data train | gpu decode + torchpipe fp32 | 94.19% |  | 90.59% |  |  | 
+| 7 | gpu decode &#65292; all data train | gpu decode + torchpipe fp16  | 94.19% | 0 | 90.59% | 0.04% | Experiments 5, 6, and 7 provide evidence that the pipeline of using GPU for both training and inference based on torchpipe can align the results and further improve the performance for deploying the model. This suggests that using GPU for both training and inference can lead to consistent and improved performance for the model when deployed in a production environment.  | 
 
-   * &#22522;&#20110;CPU&#30340;&#25968;&#25454;&#39044;&#22788;&#29702;&#19982;GPU&#19981;&#23545;&#40784;&#65292;&#21487;&#33021;&#20250;&#23548;&#33268;&#27169;&#22411;&#30340;&#35782;&#21035;&#25928;&#26524;&#20986;&#29616;&#27874;&#21160;&#65292;&#22240;&#32780;&#38480;&#21046;&#20102;infer&#26102;GPU&#35299;&#30721;&#30340;&#24212;&#29992;&#12290;
 
-     * &#20851;&#20110;train-infer&#19981;&#23545;&#40784;&#24102;&#26469;&#30340;&#36127;&#38754;&#24433;&#21709;&#21644;&#26412;&#39033;&#30446;&#24102;&#26469;&#30340;&#25913;&#36827;&#65292;&#21487;&#20197;&#21442;&#32771;&#19979;&#38754;&#30340;&#25928;&#26524;&#19968;&#33268;&#24615;&#27979;&#35797;&#23454;&#39564;,&#25928;&#26524;&#32467;&#35770;&#21487;&#20197;&#30475;&#34920;&#20013;&#27880;&#37322;&#37096;&#20998;
-      ![image](../images/training/gpu_decode_exp_show_readme.jpg)
      
     
-   * &#26412;&#39033;&#30446;&#24076;&#26395;&#22312;Pytorch&#35757;&#32451;&#26694;&#26550;&#20013;&#22522;&#20110;Torchpipe&#21152;&#36895;&#26694;&#26550;&#23454;&#29616;GPU&#39044;&#22788;&#29702;&#65292;&#21516;&#26102;&#21033;&#29992;Torchpipe&#30340;&#22810;&#23454;&#20363;&#25805;&#20316;&#65292;&#26377;&#25928;&#25552;&#39640;&#35757;&#32451;&#25928;&#29575;&#12290;
+   * This project aims to implement GPU preprocessing in the PyTorch training framework using the Torchpipe acceleration framework. Additionally, leveraging the multi-instance operation feature of Torchpipe can effectively improve training efficiency.
+
 
 
 
 ### Major features
-* &#23558;Torchpipe&#26694;&#26550;&#20316;&#20026;GPU&#39044;&#22788;&#29702;pipeline&#23884;&#20837;&#36890;&#29992;&#30340;Pytorch&#35757;&#32451;&#26694;&#26550;&#20013;&#65292;&#23454;&#29616;&#20415;&#25463;&#24335;&#20351;&#29992;
-* Train-Infer&#30340;gpu&#35299;&#30721;&#23545;&#40784;&#65292;&#36827;&#19968;&#27493;&#25552;&#21319;&#19978;&#32447;&#26381;&#21153;&#24615;&#33021;&#65288;&#24182;&#21457;&#12289;&#32791;&#26102;&#31561;&#65289;
-* &#21033;&#29992;&#32447;&#31243;&#27744;&#12289;&#32531;&#23384;&#38431;&#21015;&#12289;&#22810;&#21345;&#20998;&#21457;&#25968;&#25454;&#31561;&#25216;&#26415;&#65292;&#23454;&#29616;&#20102;&#22810;&#36827;&#31243;Dataloader & GPU&#36127;&#36733;&#22343;&#34913; & &#39640;&#25928;&#35757;&#32451;
-* &#25903;&#25345;DP & DDP&#31561;&#22810;&#31181;&#20998;&#24067;&#24335;&#35757;&#32451;&#27169;&#24335;
-* &#21516;&#26102;&#25903;&#25345;cpu&#19982;gpu&#35299;&#30721;&#65292;&#21487;&#36890;&#36807;&#21442;&#25968;&#25511;&#21046;&#27604;&#20363;&#65292;&#21464;&#30456;&#22686;&#21152;augment&#25805;&#20316;
+* Embedding the Torchpipe framework as a GPU preprocessing pipeline into the general PyTorch training framework for convenient usage.
+* Aligning the GPU decoding for Train-Infer, further enhancing the performance of online services in terms of concurrency and response time.
+* Implementing efficient training through techniques such as thread pools, caching queues, and multi-GPU data distribution, achieving multi-process Dataloader, GPU load balancing, and efficient training.
+* Supporting various distributed training modes such as DP (Data Parallel) and DDP (Distributed Data Parallel).
+* Supporting both CPU and GPU decoding, with the ability to control the proportion through parameters, effectively increasing the augment operations.
 
 
 ### Quick Usage
 
-&#26412;&#26679;&#20363;&#20013;&#65292;&#25552;&#20379;&#20102;&#23398;&#20064;&#21442;&#32771;&#20195;&#30721;&#65292;&#22312;torchpipe&#30340;example/gpu_train &#25991;&#20214;&#22841;&#19979;&#65292;&#20998;&#21035;&#20026;train_gpu.py&#65292;train_dp.py, train_ddp.py&#12290;
-
+In this example, reference code for learning is provided in the torchpipe's example/gpu_train directory. The code files are train_gpu.py, train_dp.py, and train_ddp.py.
   
-- train_dp.py &#22914;&#26524;&#22810;&#21345;&#24182;&#34892;&#20351;&#29992;&#20102;dp&#65292;&#21487;&#20197;&#21442;&#32771;&#36825;&#20221;&#20195;&#30721;&#12290;
+- train_dp.py  # if you use dp in your code&#65292;you can refer to this code.
   ```
   sh train_dp.sh
   ```
-- train_ddp.py &#22914;&#26524;&#22810;&#21345;&#24182;&#34892;&#20351;&#29992;&#20102;ddp&#65292;&#21487;&#20197;&#21442;&#32771;&#36825;&#20221;&#20195;&#30721;&#12290;
+- train_ddp.py # if you use ddp in your code&#65292;you can refer to this code.
   ```
   sh train_ddp.sh
   ```
 
 
 
-**&#20026;&#20102;&#26041;&#20415;&#22823;&#23478;&#35843;&#29992;&#65292;&#21482;&#38656;8&#27493;&#65292;&#23601;&#21487;&#20197;&#23558;&#35813;&#26041;&#27861;&#24212;&#29992;&#21040;&#24744;&#30340;&#39033;&#30446;&#20013;&#21435;&#12290;**
+**To make it easier for everyone to use, you can follow these 8 steps to apply this method to your project&#65306;**
 
-##### step 1: &#20934;&#22791;toml
+##### step 1: Prepare toml file
 
-- toml&#20027;&#35201;&#29992;&#20110;&#35774;&#32622;gpu decode&#12289;resize&#31561;&#25805;&#20316;&#12290;
-- &#21487;&#20197;&#21442;&#32771;&#20363;&#23376;&#20013;&#30340;toml&#25991;&#20214;&#22841;&#19979;&#30340;gpu_decode_train.toml&#20197;&#21450;gpu_decode_val.toml&#65292;&#20998;&#21035;&#29992;&#20110;train&#21644;val&#30340;&#25968;&#25454;&#21152;&#36733;&#20197;&#21450;&#39044;&#22788;&#29702;&#36807;&#31243;&#65292;
-- &#22914;&#26377;&#24517;&#35201;&#65292;&#21487;&#20197;&#22312;&#20854;&#20013;&#20462;&#25913;&#23545;&#24212;&#30340;&#25805;&#20316;&#12290;
+- Toml is primarily used to configure GPU decoding, resizing, and other operations.
+- You can refer to the `gpu_decode_train.toml` and `gpu_decode_val.toml` files in the "toml" folder as examples. These files are used for data loading and preprocessing during training and validation processes respectively.
+- If necessary, you can modify the corresponding operations in these files.
 
 ##### step 2: import library
 
@@ -74,7 +100,7 @@ train_dataset = datasets.ImageFolder(traindir, loader=cv2_loader)
 
 ```
 
-##### step 4: &#35774;&#32622; gpu augment , &#36825;&#37324;&#19981;&#38656;&#35201;resize&#25805;&#20316;&#65292;resize&#22312;torchpipe&#30340;toml&#37324;&#38754;&#35774;&#32622;&#20102;&#65292;&#21482;&#35774;&#32622;&#20854;&#20182;&#30340;&#23601;&#34892;&#65292;&#21807;&#19968;&#19981;&#21516;&#30340;&#26159; [ToTensor] &#21464;&#25104;&#20102;&#33258;&#23450;&#20041;&#30340; [TensorToTensor]
+##### step 4: set gpu augment , Here, we don't need to include the resize operation as it is already set in the toml file for TorchPipe. We only need to set the other operations, and the only difference is that [ToTensor] is replaced with a custom operation called [TensorToTensor].
 
 ```python
 
@@ -83,15 +109,15 @@ train_transform_gpu = transforms.Compose([
     transforms.RandomHorizontalFlip(0.05),
     transforms.RandomGrayscale(0.02),
     transforms.RandomRotation(10),
-    transforms.ColorJitter(0.05, 0.05, 0.05),  ## 4&#20010;&#21442;&#25968;&#36825;&#37324;&#35774;&#32622;3&#20010;&#65292;&#26159;&#22240;&#20026;&#26368;&#21518;&#30340;hue&#21442;&#25968;&#22312;1080Ti&#35745;&#31639;&#20250;&#23548;&#33268;&#36895;&#24230;&#21464;&#24930;&#65292;&#20854;&#20182;&#26174;&#21345;&#19981;&#20250;&#26377;&#38382;&#39064;
+    transforms.ColorJitter(0.05, 0.05, 0.05),  ## Here, we are setting three out of four parameters because the last parameter, "hue," tends to slow down the speed of computation on a 1080Ti GPU. Other GPUs should not have an issue with it.
     TensorToTensor(),
     #the same as normalize range [0,1]
     transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
 ])
 
 ```
 
-##### step 5: &#22914;&#26524;&#35201;&#20570;gpu&#19982;cpu&#30340;&#32852;&#21512;&#39044;&#22788;&#29702;&#65292;&#38656;&#35201;&#21516;&#26102;&#35774;&#32622; cpu augment, &#36825;&#20010;&#23601;&#26159;&#27491;&#24120;&#25353;&#29031;pytorch&#21407;&#26469;&#30340;&#23601;&#34892;&#12290;
+##### step 5: If you want to perform joint preprocessing on both GPU and CPU, you need to set up the CPU augment as well. This can be done following the regular PyTorch approach for CPU preprocessing.
 
 ```python
 
@@ -108,11 +134,13 @@ train_transform_cpu =transforms.Compose([
 
 ```
 
-##### step 6: &#23558;dataloader&#31867;&#36827;&#34892;&#21253;&#35013;&#65292;&#21253;&#35013;&#25104;&#25105;&#20204;&#30340;wrap_dataloader_torchpipe&#31867;&#12290;
+##### step 6: We can wrap the DataLoader class into our custom class called `wrap_dataloader_torchpipe`.
+
+
+* Please note that the path to the TOML file needs to be passed as an argument.
+* The `local_rank` parameter is only used when using Distributed Data Parallel (DDP). If DDP is not used, this parameter can be left unset.
+* `cpu_percentage` represents the percentage of CPU decoding and preprocessing.
 
-* &#38656;&#35201;&#27880;&#24847;&#30340;&#26159;&#65292;&#36825;&#37324;&#38656;&#35201;&#20256;&#20837;toml&#30340;&#36335;&#24452;
-* local_rank&#20026;&#20351;&#29992;ddp&#30340;&#26102;&#20505;&#25165;&#20250;&#29992;&#21040;&#30340;&#21442;&#25968;&#65292;&#19981;&#20351;&#29992;ddp&#65292;&#19981;&#35774;&#32622;&#36825;&#20010;&#21442;&#25968;&#21363;&#21487;&#12290;
-* cpu_percentage &#20195;&#34920;&#30340;&#26159;cpu&#35299;&#30721;&#20197;&#21450;&#39044;&#22788;&#29702;&#30340;&#30334;&#20998;&#27604;&#12290;
 
 ```python
   
@@ -127,9 +155,9 @@ wrap_train_loader = wrap_dataloader_torchpipe(
 )
 ```
 
-##### step 7: &#23558;val&#20570;&#36319;train&#21516;&#26679;&#30340;&#25805;&#20316;&#12290;
+##### step 7: Apply the same operations to the "val" as applied to the "train" dataset.
 
-##### step 8: &#27599;&#20010;epoch&#38656;&#35201;&#37325;&#32622;&#19968;&#19979;&#36845;&#20195;&#22120;
+##### step 8: Reset the iterator after each epoch.
 
 ```python
 
@@ -140,31 +168,32 @@ wrap_val_loader.reset()
 
 
 
-### &#35757;&#32451;&#22909;&#30340;&#27169;&#22411;&#22914;&#20309;&#36827;&#34892;&#26412;&#22320;&#27979;&#35797;?
+### How to perform local testing with a trained model?
+
+Training and testing go hand in hand. We have already implemented GPU decoding using TorchPipe and completed model training. Now, how can we use the trained model for GPU decoding testing?
 
-&#35757;&#32451;&#21644;&#27979;&#35797;&#23494;&#19981;&#21487;&#20998;&#65292;&#21069;&#38754;&#25105;&#20204;&#24050;&#32463;&#23454;&#29616;&#20102;&#21033;&#29992;torchpipe&#23454;&#29616;gpu&#35299;&#30721;&#65292;&#24182;&#23558;&#27169;&#22411;&#35757;&#32451;&#23436;&#25104;&#65292;&#37027;&#20040;&#22914;&#20309;&#21033;&#29992;&#35757;&#32451;&#22909;&#30340;&#27169;&#22411;&#26469;&#36827;&#34892;gpu&#35299;&#30721;&#27979;&#35797;&#21602;&#65311;
+There are primarily two aspects to consider:
+1. How to implement GPU decoding and preprocessing during the testing phase.
+2. How to perform model inference.
 
-&#36825;&#37324;&#20027;&#35201;&#28041;&#21450;&#20004;&#20010;&#38382;&#39064;&#65306;
-1. &#22312;&#27979;&#35797;&#38454;&#27573;&#22914;&#20309;&#23454;&#29616;gpu&#35299;&#30721;&#19982;&#39044;&#22788;&#29702;&#12290;
-2. &#27169;&#22411;&#22914;&#20309;&#36827;&#34892;&#21069;&#21521;&#12290;
+Here are two solutions for your reference. You can choose the specific approach based on your project requirements.
 
-&#36825;&#37324;&#32473;&#20986;&#20004;&#31181;&#35299;&#20915;&#26041;&#26696;&#65292;&#20379;&#22823;&#23478;&#21442;&#32771;&#65292;&#21487;&#26681;&#25454;&#39033;&#30446;&#23454;&#38469;&#24773;&#20917;&#26469;&#36873;&#25321;&#20855;&#20307;&#26041;&#26696;&#12290;
+For detailed code examples, you can refer to "test_gpu.py" and "test_gpu.sh" in the "example" directory. First, read the following explanation, and then take a look at the detailed code to gain a clearer understanding.
 
-&#35814;&#32454;&#20195;&#30721;&#21487;&#20197;&#21442;&#32771;example&#37324;&#38754;&#30340;test_gpu.py&#65292;&#20197;&#21450;test_gpu.sh&#65292;&#22823;&#23478;&#21487;&#20197;&#20808;&#30475;&#19979;&#38754;&#30340;&#20171;&#32461;&#65292;&#28982;&#21518;&#20877;&#30475;&#35814;&#32454;&#20195;&#30721;&#65292;&#23601;&#19968;&#30446;&#20102;&#28982;&#20102;&#12290;
+#### Solution 1 (Recommended): Perform decoding and model inference all using TorchPipe
 
-#### &#26041;&#26696;&#19968;&#65288;&#25512;&#33616;&#65289;&#65306;&#35299;&#30721;&#21644;&#27169;&#22411;&#20840;&#37096;&#20351;&#29992;torchpipe&#26469;&#23436;&#25104;
+This solution is suitable for relatively simple projects (such as those with 1 or 2 models) or for individuals who have a good understanding of TorchPipe and can utilize its capabilities to implement complex logic.
 
-&#36825;&#20010;&#26041;&#26696;&#36866;&#21512;&#39033;&#30446;&#30456;&#23545;&#31616;&#21333;&#65288;&#27604;&#22914;&#21482;&#26377;1&#12289;2&#20010;&#27169;&#22411;&#65289;&#65292;&#25110;&#32773;&#23545;torchpipe&#20855;&#26377;&#19968;&#23450;&#25484;&#25569;&#65292;&#21487;&#20197;&#21033;&#29992;torchpipe&#23454;&#29616;&#22797;&#26434;&#36923;&#36753;&#30340;&#21516;&#23398;
 
-##### step 1: &#23558;pytorch&#27169;&#22411;&#36716;&#25442;&#20026;onnx&#27169;&#22411;&#65292;&#36825;&#37324;&#21487;&#20197;&#21442;&#32771;&#25991;&#26723;&#65306;[pytorch&#27169;&#22411;&#36716;onnx](../faq/onnx)
+##### step 1: Convert the PyTorch model to an ONNX model. You can refer to the documentation here: [Converting PyTorch models to ONNX](../faq/onnx?_highlight=onnx).
 
-* &#36825;&#19968;&#27493;&#38656;&#35201;&#27880;&#24847;&#30340;&#26159;&#65306;&#35201;&#19981;&#35201;&#23558;**&#20943;&#22343;&#20540;&#12289;&#38500;&#26041;&#24046;**&#25918;&#21040;&#27169;&#22411;&#20013;&#65292;&#36825;&#37324;&#20570;&#20102;&#65292;&#21518;&#38754;&#23601;&#19981;&#38656;&#35201;&#36825;&#20010;&#39044;&#22788;&#29702;&#20102;&#12290;
+* One important thing to consider in this step is whether to include the "subtract mean and divide by variance" preprocessing in the model itself. If you have already incorporated this preprocessing step, then there is no need to perform it separately later on.
 
-##### step 2: &#23454;&#29616;toml
+##### step 2: write your toml
 
-&#36825;&#37324;&#32473;&#19968;&#20010;&#31616;&#21333;&#30340;&#36890;&#29992;&#29256;&#26412;&#65306;
+Here's a simple and generic version:
 
-* &#36825;&#20010;&#29256;&#26412;&#23454;&#29616;&#20102;&#35299;&#30721;&#19982;&#27169;&#22411;&#21069;&#21521;&#30340;&#22522;&#26412;&#25805;&#20316;&#65292;&#20808;&#23545;&#22270;&#20687;&#36827;&#34892;&#35299;&#30721;&#12289;resize&#12289;cvtcolor&#65292;&#28982;&#21518;&#36807;&#27169;&#22411;&#65292;&#32473;&#20986;&#32467;&#26524;&#12290;
+* This version implements basic operations for decoding and forward pass of the model. It starts by decoding the image, resizing it, and converting the color space. Then, it passes the image through the model to obtain the results.
 
 ```python
 batching_timeout = 1 
@@ -195,14 +224,14 @@ instance_num = 2
 ```
 
 
-##### step3 &#65306; &#21069;&#21521;&#30340;&#20195;&#30721;
+##### step3&#65306;forward code example
 
 ```python
 def init_decodeNode(self):
     config = torchpipe.parse_toml(self.toml_path)
     for key in config.keys():
         if key != 'global':
-            # &#22914;&#26524;toml&#37324;&#27809;&#26377;&#25351;&#23450;gpu&#65292;&#36825;&#37324;&#38656;&#35201;&#25351;&#23450;gpu
+            # If no GPU is specified in the TOML file, it is necessary to specify the GPU here.
             config[key]["device_id"] = 0
     print(config)
     decode_node = torchpipe.pipe(config)
@@ -228,14 +257,14 @@ def predict(self, img_path):
 
 ```
 
-#### &#26041;&#26696;&#20108;&#65306;&#21482;&#20351;&#29992;torchpipe&#26469;&#23436;&#25104;gpu&#35299;&#30721;&#12289;resize&#65292;&#27169;&#22411;&#20381;&#28982;&#20351;&#29992;PyTorch&#30340;&#27169;&#22411;
+#### Solution 2 &#65306;We will only use TorchPipe for GPU decoding and resizing, while still utilizing PyTorch models.
 
-&#36825;&#20010;&#26041;&#27861;&#65292;&#21482;&#38656;&#35201;&#25226;&#21407;&#26469;PyTorch&#20195;&#30721;&#20013;&#30340;&#39044;&#22788;&#29702;&#20462;&#25913;&#20102;&#23601;&#21487;&#20197;&#20102;&#65292;&#20854;&#20182;&#19981;&#38656;&#35201;&#20570;&#20462;&#25913;&#65292;&#20063;&#19981;&#38656;&#35201;&#36716;onnx&#36825;&#27493;&#39588;&#20102;&#12290;
+This method only requires modifications to the preprocessing part of the original PyTorch code. No other modifications are needed, and there is no need to go through the process of converting to ONNX.
 
-##### step1: toml&#20363;&#23376;(&#24314;&#35758;&#19982;val&#30340;toml&#20445;&#25345;&#19968;&#33268;)
+##### step1: exmaple of toml(Suggest keeping the toml consistent with Val.)
 
-* &#23436;&#25104;gpu&#35299;&#30721;&#12289;resize&#12289;cvtColor&#21151;&#33021;
-* &#36820;&#22238;tensor&#31867;&#22411;(shape:1x3x224x224)
+* Completed GPU decoding, resizing, and cvtColor functionalities.
+* Returns tensor of shape 1x3x224x224.
 
 ```python
 batching_timeout = 1 
@@ -259,15 +288,15 @@ instance_num =8
 
 ```
 
-##### step2 : infer&#20195;&#30721;&#65306;
+##### step2 : infer code&#65306;
 
 ```python
 
 def init_decodeNode(self):
     config = torchpipe.parse_toml(self.toml_path)
     for key in config.keys():
         if key != 'global':
-            # &#22914;&#26524;toml&#37324;&#27809;&#26377;&#25351;&#23450;gpu&#65292;&#36825;&#37324;&#38656;&#35201;&#25351;&#23450;gpu
+            # If no GPU is specified in the TOML file, it is necessary to specify the GPU here.
             config[key]["device_id"] = 0
     print(config)
     decode_node = torchpipe.pipe(config)
@@ -308,9 +337,8 @@ def predict(self, img_path):
 ```
 
 ### For Users:
-&#26412;&#39033;&#30446;&#30340;&#26680;&#24515;&#23454;&#29616;&#20195;&#30721;&#20027;&#35201;&#26159;gpu_train_tools.py&#20013;&#22522;&#20110;Pytoch&#36827;&#19968;&#27493;&#23553;&#35013;&#30340;DataLoader&#31867;&#65292;&#22914;&#26524;&#24744;&#24819;&#22312;&#33258;&#24049;&#24050;&#26377;&#30340;&#35757;&#32451;&#26694;&#26550;&#20013;&#28155;&#21152;&#21151;&#33021;&#65292;&#25110;&#26159;&#36827;&#19968;&#27493;&#25506;&#32034;&#65292;&#21487;&#20197;&#21442;&#32771;&#36825;&#20010;&#31867;&#36827;&#34892;&#20462;&#25913;&#12290;
-
-**&#22312;&#21151;&#33021;&#23454;&#29616;&#36807;&#31243;&#20013;&#65292;&#38590;&#20813;&#20250;&#26377;&#19968;&#20123;&#27809;&#26377;&#32771;&#34385;&#21040;&#30340;&#22320;&#26041;&#65292;&#22914;&#26524;&#36935;&#21040;&#20102;bug&#65292;&#35831;&#32852;&#31995;Author(WangLichun&#65292;LinYuxing&#65292;ZhangShiyang)&#21327;&#21161;&#35299;&#20915;**
+The core implementation code of this project is mainly the DataLoader class in gpu_train_tools.py, which is further encapsulated based on PyTorch. If you want to add functionality to your existing training framework or further explore, you can refer to this class for modifications.
 
+**During the implementation process, there may be some unforeseen aspects that were not considered. If you encounter any bugs, please contact the authors (WangLichun, LinYuxing, ZhangShiyang) for assistance in resolving them.**
 
 
diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/tools/training.md b/i18n/zh/docusaurus-plugin-content-docs/current/tools/training.md
@@ -32,9 +32,9 @@ type: explainer
 
 
 
-   * &#22522;&#20110;CPU&#30340;&#25968;&#25454;&#39044;&#22788;&#29702;&#19982;GPU&#19981;&#23545;&#40784;&#65292;&#21487;&#33021;&#20250;&#23548;&#33268;&#27169;&#22411;&#30340;&#35782;&#21035;&#25928;&#26524;&#20986;&#29616;&#27874;&#21160;&#65292;&#22240;&#32780;&#38480;&#21046;&#20102;infer&#26102;GPU&#35299;&#30721;&#30340;&#24212;&#29992;&#12290;
+   * &#30001;&#20110;CPU&#30340;&#25968;&#25454;&#39044;&#22788;&#29702;&#19982;GPU&#19981;&#23545;&#40784;&#65292;&#21487;&#33021;&#20250;&#23548;&#33268;&#27169;&#22411;&#30340;&#35782;&#21035;&#25928;&#26524;&#20986;&#29616;&#27874;&#21160;&#65292;&#22240;&#32780;&#38480;&#21046;&#20102;infer&#26102;GPU&#35299;&#30721;&#30340;&#24212;&#29992;&#12290;
 
-    * &#20851;&#20110;train-infer&#19981;&#23545;&#40784;&#24102;&#26469;&#30340;&#36127;&#38754;&#24433;&#21709;&#21644;&#26412;&#39033;&#30446;&#24102;&#26469;&#30340;&#25913;&#36827;&#65292;&#21487;&#20197;&#21442;&#32771;&#19979;&#38754;&#30340;&#25928;&#26524;&#19968;&#33268;&#24615;&#27979;&#35797;&#23454;&#39564;,&#25928;&#26524;&#32467;&#35770;&#21487;&#20197;&#30475;&#34920;&#20013;&#20998;&#26512;&#37096;&#20998;&#65292;&#35813;&#23454;&#39564;&#20197;&#20869;&#37096;&#25968;&#25454;&#38598;&#20026;&#20363;&#65292;&#23545;cpu&#35299;&#30721;&#35757;&#32451;&#21644;gpu&#35299;&#30721;&#35757;&#32451;&#20570;&#20102;&#23545;&#27604;&#20998;&#26512;&#65292;&#20854;&#20182;&#23454;&#39564;&#36229;&#21442;&#25968;&#25253;&#32440;&#19968;&#33268;&#12290;
+    * &#20851;&#20110;train-infer&#19981;&#23545;&#40784;&#24102;&#26469;&#30340;&#36127;&#38754;&#24433;&#21709;&#21644;&#26412;&#39033;&#30446;&#24102;&#26469;&#30340;&#25913;&#36827;&#65292;&#21487;&#20197;&#21442;&#32771;&#19979;&#38754;&#30340;&#25928;&#26524;&#19968;&#33268;&#24615;&#27979;&#35797;&#23454;&#39564;,&#25928;&#26524;&#32467;&#35770;&#21487;&#20197;&#30475;&#34920;&#20013;&#20998;&#26512;&#37096;&#20998;&#65292;&#35813;&#23454;&#39564;&#20197;&#20869;&#37096;&#25968;&#25454;&#38598;&#20026;&#20363;&#65292;&#23545;cpu&#35299;&#30721;&#35757;&#32451;&#21644;gpu&#35299;&#30721;&#35757;&#32451;&#20570;&#20102;&#23545;&#27604;&#20998;&#26512;&#65292;&#20854;&#20182;&#23454;&#39564;&#36229;&#21442;&#25968;&#20445;&#25345;&#19968;&#33268;&#12290;
   
        
 | &#23454;&#39564;&#24207;&#21495;  | &#35757;&#32451;&#26041;&#24335;  |&#27169;&#22411;&#25512;&#29702;&#26041;&#24335;  |recall  |&#35823;&#24046;  |precision  | &#35823;&#24046;  |&#20998;&#26512;  |

-Original file line number
+Diff line change
 -   * 基于CPU的数据预处理与GPU不对齐，可能会导致模型的识别效果出现波动，因而限制了infer时GPU解码的应用。
 +   * 由于CPU的数据预处理与GPU不对齐，可能会导致模型的识别效果出现波动，因而限制了infer时GPU解码的应用。
 -    * 关于train-infer不对齐带来的负面影响和本项目带来的改进，可以参考下面的效果一致性测试实验,效果结论可以看表中分析部分，该实验以内部数据集为例，对cpu解码训练和gpu解码训练做了对比分析，其他实验超参数报纸一致。
 +    * 关于train-infer不对齐带来的负面影响和本项目带来的改进，可以参考下面的效果一致性测试实验,效果结论可以看表中分析部分，该实验以内部数据集为例，对cpu解码训练和gpu解码训练做了对比分析，其他实验超参数保持一致。
 | 实验序号  | 训练方式  |模型推理方式  |recall  |误差  |precision  | 误差  |分析  |