Model Architecture

The model architecture will be based on Faster R-CNN, a state-of-the-art object detection CNN that uses a region proposal algorithm to hypothesize object locations \citep{Ren_2017}.  CNN uses a serious of pattern detectors that the model learns from training data, and classify images in ImageNet training set into the different classes \cite{Krizhevsky_2017}. In order to derive object location information, R-CNN (the antecedent to Fast R-CNN and Faster R-CNN) creates region proposals, using a process called Selective Search \cite{Girshick_2014}, and run the images in bounding boxes through a pre-trained AlexNet and finally use SVM to classify objects. To speed up and simplify R-CNN, Ross Girshick came up Fast R-CNN with RoI (Region of Interest) Pooling, which shares the forward pass of a CNN for an image across its subregions \cite{Girshick_2015}. Based on the works of CNN, Faster R-CNN was chosen, as it runs more quickly and efficiently by reusing the convolutional features for both the image classification and the region proposal \cite{Ren_2017}