An image can catch different people's eyes differently. However, very often, we observe an image, photo, or painting as a whole, trying our best to take in its entirety before arriving at any conclusion. We do not know where the artist started first on his canvas, the process the painting was developed. Music, video, and speech, on the other hand, have a temporal dimension that make their data sequential. We process and store information in order. There is a before and after. As a result, traditional neural networks, despite its tremendous success in computer vision, cannot easily tackle problems related to sequence data [...]

VGG stands for Visual Geometry Group, a research group in the Department of Engineering Science at the University of Oxford, and refers to the deep convolutional network (ConvNet) models either with 16 layers (VGG-16) or 19 layers (VGG-19). [...]

AlexNet refers to an eight-layer convolutional neural network (CNN) that was the winner of the ILSVRC (ImageNet Large Scale Visual Recognition Competition), the Blackpool for image classification, in 2012, consisting of 5 convolutional layers, 3 fully connected layers with a final 1000-way softmax with 60 million parameters. [...]

The motivation arises from the fact that a fully connected network grows quickly with the size of an image, which consequently requires an enormous dataset to avoid overfitting (besides the prohibitive computational cost). [...]