Now that I am too old to be solving it and too busy to spend hours on such things, I decided to build a system that uses deep learning to automatically solve it and spent weeks to build it.
Then I came across this paper on Mask R-CNN which sounded promising for this usecase. And it was indeed much better than my earlier approach:
In this post I would like to share how I was able to get the data, tag it and train a model to be able to solve Where's Waldo. You can checkout my code on github here.
Mask R-CNN
The main idea here is to:
- Take the input image and pass it into a set of convolutional layers that sort of generates a feature map for the given image.
- Now, you take this feature map and pass it into a region proposal network which generates rectangular regions that say that for the set of final classes, we might have an image in this region. This will have its own classification loss (rpn_class_loss) and bounding box loss (rpn_bbox_loss).
- Now you take these regions and pass it into a ROI pooling layer using something like non-max-suppression.
- The regions are then reshaped and passed on to set of convolution layers which predict if there is an object in them or not. This again will have its own classification and bounding box losses.
- Now, you have a separate convolution layers which predicts, for every pixel in the bounding box predicted, is it the given class or not. This essentially gives the mask required. Here in addition to the bounding box and classification losses, we also have mask loss.
- You run all these networks together backpropogating all the losses.
![]() |
Mask R-CNN arch from CS231n (In this case we have a 28x28 mask instead of 14x14) |
Data
As I mentioned earlier, I got 20 images from this repo and few more images from the medium post that used Faster R-CNN. So, a total of 29 images. I split this into 26 for training and 3 for validation. Then I used the via-via tool (used to tag VGG) to manually draw masks over Waldo in every image. You can find the images and annotations in my github repo here.
Training
I trained the model for 30 epochs with 100 steps per epoch. The losses on tensorboard:
If you want to train on your own dataset, first set your configurations in the waldo_config.py file or use the default.
This will download the Mask-RCNN model trained on coco dataset to the MODEL_DIR folder and trains a model with the data in the DATA_DIR folder.
python train.py
python predict.py [MODEL PATH] [PATH TO IMAGE]
# for example
python predict.py models/logs/waldo20180612T1628/mask_rcnn_waldo_0030.h5 data/val/5.jpg
In conclusion, the Mask R-CNN algorithm works fairly well to find Waldo for cases where it has already seen similar type of waldo image. Also, it looks like it works much better when the image quality is good and waldo is clearly visible. But I think it is still great since we only had a very tiny training data to train on.