August 04, 2018
Let’s explore a very naive method of detecting objects in an image. We will attempt to use an image classifier to detect objects and draw bounding boxes around them. While we know this is possible, aren’t sure how it’s done. How close can we get before we need to explore the literature?
For a given image, how can we find objects, and how well can we draw bounding boxes around them?
There are a few components to this problem:
We will start by trying to handle the simplest case first.
Given an image of a single object, which is well-cropped, how could we proceed?
First, we will classify the object in the image. I decided to use a pretrained Convolutional Neural Network, ResNet50. This allowed me to focus on the problem instead of building and training on an image classifier. We can always swap out the classifier for better accuracy later.
Next, let’s learn to draw a bounding box. I used MatPlotLib to draw the bounding box around the whole image. AFter I added the label to the image. We will use abstract this function and use it later.
Ok, that was the simplest case. How could we adapt this to multiple objects?
An image with multple objects will not be well-cropped. What do in that case? Perhaps we could divide up the image evenly using a grid, hoping for there to be at most 1 object in each cell.
I started by evenly spliting up the image. Then I ran each cell through the classifier, and drew bounding boxes with their labels. Clearly this is ineffective. There is only 1 object, but there are many boxes and many labels.
This make more sense if we choose an image with multiple objects.
Next, let’s get rid of poor labels by filtering out classifications of low confidence.
We have a method to deal with multiple objects, but the boxes are too large and poorly localized.
We’ve been dividing the input image into 4 cells. Could we get better sized bounding boxes if we increase the number of cells?
Doing so results in much more accurate box placement and sizing. However, it seems that we need to play with the threshold to handle multiple objects.
Now that we are able to create better bounding boxes, how can we deal with different objects?
Given an image with many different object, how well does a 2x2 grid perform?
It would seem that it can classify the different objects but the boxes are too large.
Even changing the number of cells isn’t the most effective. It would seem our grid system can’t overcome some issues in localization and sizing.
While we were able to poorly perform object detection with out very naive methodologies, the results were weak. Using a grid is a good idea, but I’m missing some key components. I’m unable to overcome the issue of bounding box sizing and position. I’m sure this has to do with a lack of a feedback loop to assess the bounding box size and location. We’ve been hand-coding them and visually assessing the results.
I’ll need to study the literature to understand how the best results have been obtained. I found that Andrew Ng’s deeplearning.ai Convolutional Neural Network course to be enlightening on this subject. He builds up to the highly effective, real-time YOLO algorithm; a sophisticated object detector.
I will return with a YOLO implementation. It will be lit 🔥.
Written by Peter Chau, a Canadian Software Engineer building AIs, APIs, UIs, and robots.