Wednesday, 13 June 2018

    Finding Where's Waldo using Mask R-CNN

    When I was a kid, I really loved solving Where's Waldo. There were few books (it used to be called Where's Wally) in our school library on which I spent hours finding Waldo. For people who do not know what it is, basically Waldo - a unique character is hidden among hundreds of other characters and you have to find him in all the chaos in the image.

    Now that I am too old to be solving it and too busy to spend hours on such things, I decided to build a system that uses deep learning to automatically solve it and spent weeks to build it. 

    I started off by treating this like a classification problem with two classes - Waldo and not Waldo, similar to Hot dog - not Hot dog . Once we can get the classification problem successfully solved, we can just apply a classification action mapping (CAM) layer to find Waldo's activations in the image and thus finding Waldo. However I couldn't find enough images of Waldo. I found this repo which has about 20 images. And as there are only 20 Waldo vs thousands of not-Waldo characters, there is very high imbalance in the classes. I still tried though. But the results weren't that great.

    When I looked if someone has already worked on it, I found a medium post which used Tensorflow's Faster R-CNN model to do this. But I didn't want to just find bounding boxes, I wanted to actually mask out Waldo in the image. But I got more images of Where's Waldo from it.

    Then I came across this paper on Mask R-CNN which sounded promising for this usecase. And it was indeed much better than my earlier approach:

    Waldo masked out in the image

    Original Image

    In this post I would like to share how I was able to get the data, tag it and train a model to be able to solve Where's Waldo. You can checkout my code on github here.

    Fork deepwaldo on Github

    Mask R-CNN 

    The main idea here is to:

    1. Take the input image and pass it into a set of convolutional layers that sort of generates a feature map for the given image.
    2. Now, you take this feature map and pass it into a region proposal network which generates rectangular regions that say that for the set of final classes, we might have an image in this region. This will have its own classification loss (rpn_class_loss) and bounding box loss (rpn_bbox_loss).
    3. Now you take these regions and pass it into a ROI pooling layer using something like non-max-suppression.
    4. The regions are then reshaped and passed on to set of convolution layers which predict if there is an object in them or not. This again will have its own classification and bounding box losses. 
    5. Now, you have a separate convolution layers which predicts, for every pixel in the bounding box predicted, is it the given class or not. This essentially gives the mask required. Here in addition to the bounding box and classification losses, we also have mask loss.
    6. You run all these networks together backpropogating all the losses.
    If you want a more clear explanation, checkout the lecture in CS231n

    Mask R-CNN arch from CS231n (In this case we have a 28x28 mask instead of 14x14)


    As I mentioned earlier, I got 20 images from this repo and few more images from the medium post that used Faster R-CNN.  So,  a total of 29 images. I split this into 26 for training and 3 for validation. Then I used the via-via tool (used to tag VGG) to manually draw masks over Waldo in every image.  You can find the images and annotations in my github repo here.


    I trained the model for 30 epochs with 100 steps per epoch. The losses on tensorboard:

    If you want to train on your own dataset, first set your configurations in the file or use the default.

    This will download the Mask-RCNN model trained on coco dataset to the MODEL_DIR folder and trains a model with the data in the DATA_DIR folder.
    For prediction, you can do the following which shows a popup with waldo detected in the image.

    # for example
    python models/logs/waldo20180612T1628/mask_rcnn_waldo_0030.h5 data/val/5.jpg

    In conclusion, the Mask R-CNN algorithm works fairly well to find Waldo for cases where it has already seen similar type of waldo image. Also, it looks like it works much better when the image quality is good and waldo is clearly visible. But I think it is still great since we only had a very tiny training data to train on.

    Tuesday, 23 January 2018

    Higher level ops for building neural network layers with deeplearn.js

    I have been meddling with google's deeplearn.js lately for fun. It is surprisingly good given how new the project is and it seems to have a sold roadmap. However it still lacks something like tf.layers and tf.contrib.layers which have many higher level functions that has made using tensorflow so easy. It looks like they will be added to Graphlayers in future but their priorities as of now is to fix the lower level APIs first - which totally makes sense.

    So, I quickly built one for tf.layers.conv2d and tf.layers.flatten which I will share in this post. I have made them as close to function definitions in tensorflow as possible.

    1.  conv2d - Functional interface for the 2D convolution layer.

    • inputs Tensor input.
    • filters Integer, the dimensionality of the output space (i.e. the number of filters in the convolution).
    • kernel_size Number to specify the height and width of the 2D convolution window.
    • graph Graph opbject.
    • strides Number to specify the strides of convolution.
    • padding One of "valid" or "same" (case-insensitive).
    • data_format "channels_last" or "channel_first"
    • activation Optional. Activation function which is applied on the final layer of the function. Function should accept Tensor and graph as parameters
    • kernel_initializer An initializer object for the convolution kernel.
    • bias_initializer  An initializer object for bias.
    • name string which represents name of the layer.

    Tensor output.


    Add this to your code:

    2. flatten - Flattens an input tensor.

    I wrote these snippets while building a tool using deeplearnjs where I do things like loading datasets, batching, saving checkpoints along with visualization. I will share more on that in my future posts.

    Thursday, 11 January 2018

    Hacking FaceNet using Adversarial examples

    With the rise in popularity of face recognition systems with deep learning and it's application in security/ authentication, it is important to make sure that it is not that easy to fool them. I recently finished the 4th course on where there is an assignment which asks us to build a face recognition system - FaceNet. While I was working on the assignment, I couldn't stop thinking about how easy it is to fool it with adversarial examples. In this post I will tell you how I managed to do it.

    First off, some basics about FaceNet. Unlike image recognition systems which map every image with a class, it is not possible to assign a class label to every face in face recognition. This is because one, there are way too many faces that a system should handle in the real world to assign class to each of them and two, if there are new people the system should handle, it can't do it. So, what we do is, we build a system that learns similarities and dissimilarities. Basically, there is a neural network similar to what we have in image recognition and instead of applying softmax in the end, we just take the logits as embedding for the given image input and then minimize something called the triplet loss.  Consider face A, we have a positive match P and negative match N. If f is the embedding function and L is the triplet loss, we have this:

    Triplet loss

    Basically, it is incentivizing small distance between A - P and large distance between A - N. Also, I really recommend watching Ian Goodfellow's lecture from Stanford's CS231n course if you want to know about adversarial examples.

    Like I said earlier, this thought came to me while doing an assignment from 4th course from which can be found here and I have built on top of it.  The main idea here is to find small noise that when added to someone's photo although causing virtually no visual changes, can make faceNet identify them as the target.

    Benoit (attacker)
    Add noise
    Kian Actual (Target)

    First lets load the images of the attacker Benoit and the target Kian.

    Now say that the attacker image is A` and the target image is T. We want to define triplet loss to achieve two things:

    1. Minimize distance between A` and T
    2. Maximize distance between A` and A` (original)
    In other words the triplet loss L is:

    L (A, P, N) = L (A`, T, A`)

    Now, let's compute the gradient of the logits with respect to the input image 

    These gradients are used to obtain the adversarial noise as follows :

    noise = noise - step_size * gradients

    According to the assignment, a l2 distance of the embeddings of less than 0.7 indicates that two faces have the same person. So lets do that.

    The distance decreases from 0.862257 to 0.485102 which is considered enough in this case.

    L2 distance between embeddings of attacker and target
    This is impressive because, all this is done while not altering the image visibly just by adding a little calculated noise!

    Also note that the l2 scores indicate that the generated image is more of Kian than Benoit in spite of looking practically identical to Benoit. So there you go, adversarial example generation for FaceNet.