Training YoloV4 to Inference on Nvidia DLA’s

Mostafa Mohsen
6 min readSep 1, 2020

--

High performing Deep Learning models seem to be coming out every day now. With AI powered robotics becoming more accessible to hobbyists thanks to companies like Nvidia and their Jetson lineup, it is becoming more feasible to make use of these cutting edge models being published by top researchers, in your own basement.

In this post, I’ll be taking you through the process of training a model (not the emphasis), exporting it, and generating an inference engine to run it on a Deep Learning accelerator (DLA) to perhaps perform some task on your robot. DLAs are more tuned to perform inferencing specifically for convolutional neural nets. So your CNN Models can run on the DLAs while other vision related or machine learning tasks are able to run on the GPU concurrently. The DLA that I am using are the ones installed on the Nvidia Jetson Xavier. Nvidia has installed DLAs on all their Xavier products.

Training

This is not the emphasis of this post. Perhaps you already have a trained model you need to generate an inference engine from. But here is one way to go about this.

I am using Dat Tran’s raccoon dataset posted on roboflow.ai. In addition to datasets Roboflow provides data preprocessing and augmentation services. They also provide Google Colab notebooks for training — all for free. I’ll be using their PyTorch YoloV4 Colab notebook — I’ve had to make some changes to make their notebook work, I’ll outline that here.

Firstly, set up your environment by cloning the YoloV4 repo, install requirements, and download the pertained weights.

Next, upload your dataset to Colab. Make sure the file names do not have spaces in them. Or you can implement your own get_image_id method in dataset.py. Now you’ll need to change the cfg.py file according to your specifications.

  • Change the use_darknet_cfg parameter to False if you’re using the PyTorch converted weights
  • You probably want to change batch size to 4 if you are training on Colab
  • Change the train_label and val_label parameters to point to the correct places
  • Change height and width according to your dataset
  • Finally, go to the train.py file on line 135 and change the image size there as well

To save some space, you can also edit the training loop to only save the weights every X epochs instead of every epoch. You can do this by simply adding an extra if-statement before you save the weights — the first line of the following snippet

Now run the train file, the Colab notebook has an explanation of the arguments you can use.

If you run into trouble with the train/validate paths, try going to train.py on lines 247 and 248 try replacing config.train_label and config.val_label with the full paths of the respective driectories.

Converting PyTorch Model to ONNX

This consists of two steps. Building the model form the weights file, and then exporting it as onnx. Loading the model is pretty strait forward in PyTorch. Make sure you have the python file containing the model classes as well — you need this for re-constructing the model (with your trained weights) from the weights (.pth) file.

Now we actually perform the export. The important line to understand is the torch.onnx.export method and it’s parameters;

  • model the trained model
  • inputs model input (or a tuple for multiple inputs)
  • onnx_name where to save the model
  • export_params store the trained parameter weights in the model file
  • opset_version the version of ONNX to export to
  • do_constant_folding whether to execute constant folding for optimization
  • input_names model’s input names
  • output_names model’s output names
  • dynamic_axis variable length axes

Create TensorRT Engine from ONNX Model

First you need to have TensorRT installed on your machine. If you are working on a jetson, it comes pre-built with the Jetpack SDK. If not, follow installation instructions here.

Building trtexec — Command Line Program

The trtexec tool will be what we use to actually convert the .onnx file to an inference engine — which will have a .trt file extension.

trtexec has two main functionalities; benchmarking networks on random data, and building engines for inferencing on DLA’s and GPU’s. The ability to run models on DLA’s frees up your GPU which can then be used for vision related tasks that are not Deep Learning — eg. optical flow. In addition to practicality, trtexec allows us to load generated engines and test them with different parameters (like # of sterams) and benchmark them with other engines. We can also pull engines into programs which already run inference, and with multiple execution contexts, we can perform parallel inferencing.

To build it, go to the trtexec sample folder in TensorRT and build the tool

$ cd /usr/src/tensorrt/samples/trtexec 
$ sudo make

Compilation will prompt you to copy all the python example files to the bin folder of TensorRT, go ahead and do that.

$ cp *.py ../../bin/

To make sure it’s installed

$ cd ../../bin
$ ./trtexec -h

You should see a list of optional arguments. Yay done! Now for fun part, lets generate an inferencing engine.

Generating Inferencing Engines

Here’s an example

$ /usr/src/tensorrt/bin/./trtexec --onnx=test1.onnx --explicitBatch --saveEngine=Yolov4_DLA1_int8.trt --useDLACore=1 --workspace=1000 --int8 --allowGPUFallback
  • --onnx= path of the onnx file
  • --explicitBatch use explicit batch sizes when building the engine (default = implicit)
  • --saveEngine= path of trt file — where to save engine
  • --useDLACore= which DLA core to use (0 or 1 in my case — 2 DLA’s)
  • --workSpace= size of memory allowed (in MB) to be used during inferencing — you may need more
  • --int8 enable floating point int8 precision (FP16 also available)
  • --allowGPUFallback not every operation is supported on the DLA’s. You must have this to be able to support every operation
  • --device= Which GPU device to use (default is 0)

See this Github Repo for more information about arguments and usage

Running an Inference Engine

Now, generated engines can be benchmarked on eachother to see which performes best. This generates random data to test the engines, its just a means to benchmark the engines against eachother.

trtexec --loadEngine=eng1.trt --batch=1 --streams=1
trtexec --loadEngine=eng1.trt --batch=1 --streams=2
trtexec --loadEngine=eng1.trt --batch=1 --streams=3
trtexec --loadEngine=eng2.trt --batch=2 --streams=4

Note: you need to generate a new engine if you want to change the batch, but number of streams can be varied using the same engine.

Benchmarks

If we pipe the information that the trtexec command displays, we can compare the performances of different engines. Below is some of the benchmarks I've done myself. Everything here is generated with a batch size of 1 and a stream of 1.

It is worth noting that the DLA’s actually perform a bit less than the GPU. Firstly the on board GPU has 48 tensor cores, and the GPU is more suited for parallel computation on all operations — not all operations required are able to run on the DLA so overhead may be also a factor. So this may not be that surprising. Also, remember that the purpose of the DLA’s is to offload tasks from the GPU to allow for many models running concurrently, not to replace the GPU.

Nevertheless, improvements can alwyas be made and the Xavier NX should be able to do better than 20 to 60 QPS throughput. I’ll be updating this table as I continue benchmarking and acheive better results.

Possible Improvements

  • Increase batch size
  • Increase # streams
  • Use compact version of model (Yolov4-tiny)
  • Allocate more memory to the engine

--

--