What to Consider When Selecting YOLO for Real-Time Applications
Updated: Jun 9, 2020
Real-time object detection has seen a taken a huge leap from where it was a decade ago, the rise of Deep Neural Nets combined with very smart algorithms and the power of GPUs has made real-time object detection for a multiple source setup possible and with a very high degrees of accuracy.
The need to process multiple video streams simultaneously has been getting lots of traction lately, whether it is for self-driving cars where visual feeds from installed sensors need to be processed in fractions of a second to make a decision, organizations that wanna have real-time alerts to up their security game, or vendors looking to leverage their camera setup to get insights from their video streams that are impossible for a human to capture.
The YOLO (You Only Look Once) algorithm has topped the list in real-time object detection for it's incredible speed and accuracy compared to other competing algorithms, specially with their YOLO v3 which was released in 2018 allowing for much higher detection accuracy than its predecessor.
So now lets get a bit technical!
Latest YOLO comes with 3 model flavors (Yolov3, Yolo-SPP, and Tiny Yolo) each model has it's own precision and speed (in fps). to complicate things even further each model can handle 3 input image sizes (320, 416 and 512), so which configuration of YOLO should you choose for your project ? Lets try to answer this question with investigating the results of these combinations in depth.
Accuracy (mAP) vs FPS
Accuracy is correlated with the image input size where a higher image always corresponds to higher accuracy at the expense of frames per second. For the same input image size yolov3 and yolo-spp have similar mAP while tiny-yolo (which is made to be light-weight for mobile phones, rasperry pis, etc) has a much lower mAP.
On the other hand, the fps for tiny-yolo hits the roof near 200 fps, while other models usually fluctuates around 40 to 90 fps depending on the input image size.
If we plot fps vs mAP for all models and image sizes we will notice that there is a nearly inverse linear relationship between the two. the sweet spot for selecting a model is around 0.6 mAP with both yolov3 and yolo-spp at an input image size of 320.
But wait!! why is there variance in fps for the same mAP and input image size??
Leveraging the Power of GPUs (Batch Processing)
If GPUs can do one thing really well (besides giving us amazing gaming experiences) is that they do millions of floating point operations per second. which allows for the parallel processing of images through yolo's network. So it's clear a we should shove in the GPU's VRAM as much images as it can handle right ? Well, Not exactly. an investigation of different batch sizes on fps shows that for large architecture like yolov3 and yolo-spp there is a peak around a batch of 16 and the fps drops for larger batches (this result doesn't hold for smaller networks since for tiny yolo a batch of 32 has the same fps as a batch of 16) a result can be explained by the limitation of the available cuda cores. because if the GPU hits it's processing capacity, loading more images in the VRAM will be handled sequentially and no parallel processing speed will be gained.
Lets Talk VRAM Consumption
So how much VRAM does each model/batch size/Image size combination consume ? the chart below shows the VRAM consumption per each combination, it's fairly obvious that vram consumption increases with larger networks, larger image sizes, or a larger batch size. so make sure the selected network, image size and batch fit in your gpu. for most cases a batch size of 16 will always fit in a 6 GB gpu.
Any gpu that is cuda capable can be used to enhance the speed of inference, just make sure that the selected network, image size and batch fit in the gpu's vram. In case no gpu is available or the network is desired to run on z mobile phone or a rasperry pi, then selected tiny yolo would be the wise choice since it's light enough to run at a good fps.
Single Detection vs Consistency
Depending on your application, if the purpose is to detect that an object has passed within the camera frame in a specific time frame then using tiny yolo might do the job as it will detect most objects at least once but the consistency of the detection is not quite good. If your aim is consistency or tracking then using a more robust network is the way to go.
Different classes have different mAPs
For example classes like person, motorcycle, airplane, bus, train, truck, cat, dog, horse, sheep, cow, elephant, bear, zebra, giraffe and others all have higher mAPs than the average network mAP which is usually reported in most cases. This allows the use of smaller networks and smaller image size to achieve an acceptable mAP. for example the results below are for yolo spp at an image size of 320. you can see that the other classes have a higher mAP than the average. for example cats reach 0.92 mAP while the class average is only 0.63. so using a higher image size to detect cats would be an over kill and a waste of resources since a practically good mAP can be reached at a smaller image sizes which means faster speed. this logic can be followed to smaller networks, for example although tiny yolo has an average mAP of 0.3 some classes may reach an mAP of 0.6 or 0.7 which will be practically good enough for most applications. So consider which classes you target and find their mAPs for different models/image sizes before making a decision because you can easily gain a boost in fps if a smaller network or image size can get the job done.
** All tests were performed on an Nvidia 1660Ti
Synapse's own AI & Machine Learning engineer Marco Rizk on "What to Consider When Selecting YOLO for Real-Time Applications". Synapse Analytics
Want to make your operations A.I. powered?