Deep Learning - Follow Me
-
In this project, we needed to build and trained a Fully Convolutional Network using Keras.
-
The goal of this project is to identify a target and then track it.
Network Architecture

The network architecture contains 3 encoder layers followed by an 1x1 convolution layer and 3 decoder layers.
In the encoder block code. It’s using separable conv2d_batchnorm, providing a normalized separable 2d convolution and batch normalization. Advantages of using these techniques are explained below.
Separable convolution layers performs a spatial convolution and then follow with a depthwise convolution. In the course an example is given comparing regular convolutions with separable convolution, which the latter greatly reduces the number of parameters. This reduction can improve the efficiency and runtime performance also added benefit of reducing overfitting. And it’s normally useful for mobile application.
Batch normalization has several advantages as described in the lessons. Such as, train the network faster, allowing higher learning rates, makes more activation functions viable, and simplifies the creation of deeper networks. The goal is to normalize the inputs of each layers in the network, not just the input, to have zero means equal variance.

Bilinear Upsampling consider the closest 2 by 2 neighborhood of known pixel values surrounding the unknown pixels computed location. And then takes a weighted average of those 4 pixels to have it’s final value. The below image is the example shown in the explanation of bilinear upsampling wiki page. To calculate value at (20.2, 14.5) it first derives the value at (20, 14.5) and (21, 14.5) and then interpolating linearly between those two value to get the desired value at the location.

Course Material
In this section, I will explain several concept I have learned from this course.
1 x 1 convolutional layer preserves spatial information. As the example in the lessons. This can solve the question of “Where is the hot dog” while it preserves the spatial information throughout the entire network. This then is good for the application for the follow me project to find the target. And replacing fully connected layers with convolutional layers also has an advantage of feeding images of any size into the trained network.
A Fully connected layer is perfect for the problem like “Is there a hot dog in the image?” Since it doesn’t really care the spatial information of the hot dog. This layer outputs an N dimensional vector (probabilities) where N is the number of classes that the program has to choose from.

Semantic Segmentation is understanding an image at pixel level. It assigns meanings to objects in the image.
The encoder gradually reduces the spatial dimension but increasing the depth (feature maps). Different filters pick up different qualities of a patch.
Decoder network is to map the low-resolution encoder feature maps to full input resolution feature maps. Details of upsampling was discussed above.
With the encoder-decoder architecture we can perform the semantic segmentation to classified objects in our image.