A Comprehensive Study on Torchvision Pre-trained Models for Fine-grained Inter-species Classification

Feras Albardi¹, H M Dipu Kabir², Md Mahbub Islam Bhuiyan³,
Parham M. Kebria², Abbas Khosravi², Saeid Nahavandi^2,4.
¹King AbdulAziz University, Saudi Arabia. ²IISRI, Deakin University, Australia. ³Smartify Pty Ltd, Australia.
⁴ Harvard Paulson School of Engineering and Applied Sciences, Harvard University, Allston, MA 02134 USA.
feras.rb@gmail.com, {hussain.kabir, abbas.khosravi, saeid.nahavandi}@deakin.edu.au

Abstract

This study aims to explore different pre-trained models offered in the Torchvision package which is available in the PyTorch library. And investigate their effectiveness on fine-grained images classification. Transfer Learning is an effective method of achieving extremely good performance with insufficient training data. In many real-world situations, people cannot collect sufficient data required to train a deep neural network model efficiently. Transfer Learning models are pre-trained on a large data set, and can bring a good performance on smaller datasets with significantly lower training time. Torchvision package offers us many models to apply the Transfer Learning on smaller datasets. Therefore, researchers may need a guideline for the selection of a good model. We investigate Torchvision pre-trained models on four different data sets: 10 Monkey Species, 225 Bird Species, Fruits 360, and Oxford 102 Flowers. These data sets have images of different resolutions, class numbers, and different achievable accuracies. We also apply their usual fully-connected layer and the Spinal fully-connected layer to investigate the effectiveness of SpinalNet. The Spinal fully-connected layer brings better performance in most situations. We apply the same augmentation for different models for the same data set for a fair comparison. This paper may help future Computer Vision researchers in choosing a proper Transfer Learning model.

Index Terms:

Deep Learning, Transfer Learning, PyTorch, Torchvision, Fine-grained, Inter-species, Image Classification.

I Introduction

There exist various Machine Learning (ML) libraries that made it easier to develop and implement Neural Network (NN) models [1, 2]. One of these libraries is Pytorch[3] which is an open-source framework used for the research, development, and deployment of ML systems. This library includes various packages to process data and to configure models according to requirements. One of these packages is Torchvision which consists of various data sets, pre-trained models, and image processing tools and techniques. Torchvision includes many pre-trained models for Transfer Learning (TL). TL is becoming more popular as it enables for the development of high-performance Neural Network (NN) models with low data volume.

Transfer Learning is an efficient technique of using previously acquired knowledge and skills in novel problems. It is also similar to educating humans with a much broader syllabus to achieve competencies for an unpredictable future. Deep Neural Networks (DNNs) requires adequate training samples for proper training. Insufficient training samples may result in poor performance [4, 5, 6]. TL is an efficient DNN training technique where initial layers of DNN are pre-trained with a large data set[7]. The corresponding train data set trains only a few final layers. As a result, the user can get a well-trained NN for the specific data set of a small sample number, with lower computational overhead. TL is gaining huge popularity these days due to exceptional performance. Many researchers expect TL as the next driver of the commercial success of Machine Learning.

There are many pre-trained models in the Torchvision package. No model is good for all problems [8]. A user may not choose a model by considering its performance on the ImageNet[9] data set. Moreover, the user may have some limitation of resources and he/ she may choose a simpler low-performance model instead of a computationally demanding one.

II PyTorch Pre-trained Models

In Torchvision, there are a total of twelve main pre-trained models that are to be used for the classification of images from the fine-grained inter-species data sets. And these are AlexNet[10], VGG[11], ResNet[12], SqueezeNet[13], DenseNet[14], Inception v3[15], GoogLeNet[16], ShuffleNet v2[17], MobileNet v2[18], ResNeXt[19], Wide ResNet[20] and MNASNet[21]. Some models include multiple varieties such as ResNet and VGG [22, 23, 24]. In this section, we explore these different networks and define what differentiate each one from another.

II-A AlexNet

Alex Krizhevsky introduced the novel convolutional neural network (CNN) named AlexNet in 2014. A year after competing in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)[25] in 2012. This challenge requires the classification of a subset of ImageNet with over one million images and a thousand classes. AlexNet performed very well as it was announced the winner of the best performance for that year with top-5 error of 15.3%, 10.8% lower than the previous top performer. AlexNet is considered to be the first CNN to be announced the winner for the competition.

AlexNet architecture consists of a total of 8 layers, 5 convolutional layers, and 3 max-pooling layers. This architecture had many characteristics that made it perform very well, one is the utilization of ReLU (Rectified Linear Unit)[26] activation function within the convolutional layers in replacement of the popular Tanh at the time. ReLU offered better performance and lower computation time. AlexNet benefited as well from the training on multiple graphic processors which allowed to distribute work on different GPUs enabling for the training of a bigger model and lowering the computation time. AlexNet suffered from overfitting which is a general issue that many models suffer from. Overfitting occurs when the model is fitted too well to the data and is not able to generalize well to new data which means that the model do not classify new images very good. To solve this issue, AlexNet utilized some common solutions to the problem, data augmentation, and dropout layers[27].

II-B VGG

VGG (Visual Geometry Group) is a very popular CNN that was introduced in 2014 by Karen Simonyan and Andrew Zisserman from the University of Oxford. And similar to AlexNet, VGG participated in the ILSVRC challenge in 2014. VGG scored very well as it was announced the winner (2nd place) for that year with an accuracy of 97.7%. This architecture was able to surpass the performance of AlexNet as it had a deeper model design as well as replacing the large convolution kernels of AlexNet with multiple small convolution kernels of size 3x3.

II-C ResNet

Residual Neural Network abbreviated (ResNet) is a powerful network that was proposed in 2015 by Kaiming He et al. in their paper[12]. ResNet made a very big impact on deep learning as its unique architectural design enabled it to go deeper with layers and provide great performance. It was not an easy matter to get good performance with very deep models as there has been the problem of vanishing gradient. Vanishing gradient is a major problem in machine learning, it occurs as more layers are added to the network. Each layer has a certain activation function, and when the gradient is back-propagated to the earlier layers multiple times this causes the gradient to approach to zero (vanish) which makes it very hard to update the weights with very small values, and thus, the model do not converge and train well. ResNet offered a solution to the problem, in their paper, the authors described the use of a unique neural network layer named the ’Residual Block’. These layers utilize ’skip connections’ which are specific shortcuts to jump over some layers. These jumps usually skip over two or three layers, and these skip connections consists of a ReLU activation function and a batch normalization in between. The main intuition behind the utilization of this structure is to avoid vanishing gradients by the re-use of activation values from a previous layer to enable the neighboring layer to learn its weight.

II-D SqueezeNet

SqueezeNet has three main goals, first is to get more efficient in distributed training. Second is to export models smaller in size that are more convenient for professional and industrial clients. Lastly is to provide a model that is better deployed on embedded systems. SqueezeNet has two varieties, SqueezeNet_v1.0 and SqueezeNet_v1.1.

II-E DenseNet

DenseNet was introduced in 2017 by Gao Huang et al. It is a powerful CNN that is similar to ResNet in using shortcut connections in between layers to solve the gradient descent problem. DenseNet utilizes ’Dense Connections’ in between layers with the use of ’Dense Blocks’ which holds an n number of ’Dense Layers’. Each Dense Layer consists of a 1x1 convolution filter for feature extraction and a 3x3 convolution filter to decrease the number of channels. And in Dense Blocks, each layer receives feature maps from all previous layers, and then it passes the output which is concatenated as input to all subsequent layers. This special structure allows for the use of fewer layers and the re-use of the features learned by the network, and having a narrower model (fewer layers) is easier to train with having fewer parameters to learn.

II-F GoogLeNet And Inception v3

Inception[15][16][28] is a family of CNNs developed by Christian Szegedy et al. and Google with four different versions, Inception v1 which is the GoogLeNet, Inception v2, a refined version of Inception v1 with the introduction of batch normalization, Inception v3, a more refined version with additional convolution factorization, and Inception v4.

GoogLeNet (Inception v1) was the first member of the Inception family. This architecture was introduced in 2015 and prior to that the architecture participated in the ILSVRC challenge and did very well as it won 1st place and achieved a new State-of-the-Art for ImageNet in that year. GoogLeNet was proposed for the main problem of overfitting and the expensive computation of stacking too many layers. The main solution provided by the authors is to have convolution filters of various sizes (1x1, 3x3, and 5x5) at the same level. Having these different-size filters helps in extracting features of various details, big and small, and adding more filters make the network wider rather than deeper. GoogLeNet architecture consists of ’Inception Modules’ that carries the essence of the Inception family, and it is what allows to have convolution filters of various sizes rather than one. The idea is to let the model decide on which filter to use rather than the manual selection of a filter that might not perform well. This works with the parallel use of different filters on the same input with the same padding, and then concatenating the feature map from each filter to create one-big feature map which to be passed as input to the next inception module.

Inception v3 was proposed in 2016. And similar to GoogLeNet it participated in the ILSVRC challenge in 2015 and scored 1st place for that year. Inception v3 holds a very similar structure to that of GoogLeNet, however, it does have some differences that enable it to generally perform better than GoogLeNet. Inception v3 architecture hold different changes, first is special convolution filter techniques (Factorized Convolutions, Smaller Convolutions, and Asymmetric Convolutions). second, a modified version of ’Auxiliary Classifiers’. Third is the use of ’Efficient Grid Size Reduction’. And lastly implementing ’Model Regularization via Label Smoothing. Factorized convolutions are utilized to reduce the number of parameters, whilst keeping the network efficient.

II-G ShuffleNet v2

Ningning Ma et al. proposed ShuffleNet v2, a CNN based on the previous ShuffleNet v1 which aimed to build an efficient and effective network with lower computation needs than other networks. With ShuffleNet v2, the authors aimed to design a network that takes into consideration the direct metrics, such as speed and memory access cost, to measure the networks computational complexity rather than an indirect metric, such as FLOPs (Floating Point Operations). FLOPs are the default measure of the performance of a model, and specifically they are units to measure how many operations are needed to run a single instance of a model.

II-H MobileNet v2

MobileNet is a family of CNNs designed to work effectively and efficiently on mobile systems. MobileNet v2 was introduced in 2018 and it is based on the previous MobileNet v1[29] which was introduced in 2017. MobileNet v1 had two main layers forming the network, first is a ’Depth-wise Convolution’ layer which applies a single convolution filter per input channel for lightweight filtering, and the other is a ’Point-wise Convolution’ layer which consists of a 1x1 convolution filter aimed for computing linear combinations of input channels to extract new features, each of these two layers has non-linearity (ReLU).

II-I ResNeXt

ResNeXt was proposed in a paper in 2017, a year after achieving 1st runner-up in the ILSVRC challenge. ResNeXt is based upon ResNet, and it differs by adding a ’NeXt’ structure. This structure is a ’Cardinality’ dimension which sticks on top of the width and depth of the ResNet architecture. This dimension refers to the size of the transformations set. ResNeXt is special as it replaces the use of regular linear functions in the neurons of the layers with using a non-linear function which is a part of a ’ResNeXt Block’, the number of transformations in this block is referred to by the Cardinality, and after applying the required number of transformations, the results are aggregated together.

II-J Wide ResNet

Wide ResNet abbreviated (WRN) is a CNN proposed in 2016 by Sergey Zagoruyko and Nikos Komodakis. It is based on the concept of residual blocks introduced in the ResNet architecture. However, it aims to be shallower with fewer layers and less training time. Wide ResNet was proposed in efforts to solve certain limitations with the ResNet architecture. ResNet has an advantage in improving performance with a large number of layers, however, adding more layers causes the reuse of diminishing feature, this affects the network by making it slow to train. The authors of WRN proposed a wider architecture (has a wider ResNet block) and lower in-depth as well as having dropout layers in between layers of residual blocks. Wider and lower in-depth means increasing the number of channel dimensions, and decreasing the number of layers.

II-K MNASNet

MNASNet is a novel architecture proposed in 2018 by Mingxing Tan et al. Authors aimed to introduce a network that performs well with mobile devices as they require models that are smaller in size, yet perform effectively and efficiently. In their paper, the authors proposed three main ideas for their model, first is the introduction of a multi-objective architecture based on reinforcement learning which is used to automatically find the most suitable CNN for high performance. Second is the introduction of a novel ’Factorized Hierarchical Search Space’ structure which is used as blocks in between layers and it is aimed for more efficient use of computational resources. lastly, the authors demonstrated the effectiveness of their model with regards to other architectures aimed for mobile-use, such as MobileNet, ShuffleNet, and SqueezeNet.

Table I: Number of Training , Validation and Test Examples in Each Dataset, and the Size of Images.

Dataset	No. of Images			Image
	Training	Validation	Test	Size
10 Monkey Species	1370	272	-	400 $\times$ 300
Fruits 360	67692	22688	-	100 $\times$ 100
225 Bird Species	31316	1125	1125	224 $\times$ 224
Oxford 102 Flower	1020	1020	6149	Varying

Table II: Number of Layers and Parameters in Torchvision Pre-trained Models

Model	No. of	No. of	Year of	Top-1 error
	Layers	Parameters	Proposal	on Imagenet
AlexNet	8	61M	2014	43.45%
VGG-11	11	132M	2014	30.98%
VGG-13	13	133M	2014	30.07%
VGG-16	16	138M	2014	28.41%
VGG-19	19	143M	2014	27.62%
VGG-11 BN	11	132M	2014	29.62%
VGG-13 BN	13	133M	2014	28.45%
VGG-16 BN	16	138M	2014	26.63%
VGG-19 BN	19	143M	2014	25.76%
ResNet-18	18	11M	2015	30.24%
ResNet-34	34	21M	2015	26.70%
ResNet-50	50	25M	2015	23.85%
ResNet-101	101	44M	2015	22.63%
ResNet-152	152	50M	2015	21.69%
SqueezeNet 1.0	-	1M	2017	41.90%
SqueezeNet 1.1	-	1M	2017	41.81%
DenseNet-121	121	7M	2016	25.35&
DenseNet-169	169	14M	2016	24.00%
DenseNet-161	161	28M	2016	22.35%
DenseNet-201	201	20M	2016	22.80%
Inception v3	48	27M	2015	22.55%
GoogLeNet	22	6M	2014	30.22%
ShuffleNet v2	-	2M	2018	30.64%
MobileNet v2	53	4M	2018	28.12%
ResNeXt-50	50	25M	2016	22.38%
ResNeXt-101	101	88M	2016	20.69%
Wide ResNet-50-2	50	68M	2016	21.49%
Wide ResNet-101-2	101	126M	2016	21.16%
MNASNet 1.0	-	4M	2018	26.49%

III Methodology

This study was performed in several systematic steps that generalizes over different data sets and models to ensure consistent results.

III-A Phase 1. Data Collection and Preprocessing

There are four different sets of fine-grained Inter-species data. The first one, named 10 Monkey Species[30], is a set which includes a total of 1400 images of size 400x300 distributed among 10 different species of monkeys. These species include Alouattapalliata, Erythrocebuspatas, Cacajaocalvus, Macacafuscata, Cebuellapygmea, Cebuscapucinus, Micoargentatus, Saimirisciureus, Aotusnigriceps and Trachypithecusjohnii. The collection of these images was done with the help of search engine web scrapers. 10 Monkey Species is available on the popular data platform, Kaggle.

Refer to caption — Figure 1: A batch of 32 samples from the 10 Monkey Species data set.

The second set named Fruits 360[31]. It includes data of different fruits and vegetables. This set is much bigger than 10 Money Species with more than 90,000 images of fruits and vegetables of size 100x100 and ranging from watermelons, apples and oranges to potatoes and onions and much more with 103 classes.

225 Bird Species[32] is the third data set. It includes more than 30,000 training, testing, and validation images distributed among 225 classes of birds, all of these images are of size 224x224.

Oxford 102[33] is the fourth and last data set, and it consists of 102 classes of different flowers. This data set was created by the Visual Geometry Group of Oxford. The type of flowers available in the data set are commonly available in the United Kingdom. And each class holds between 40 and 258 images. Collection of images was done using web scrappers as well as manually capturing flower images.

One of the methods to increase the performance of models is to provide more training examples [34, 35]. this method can be accomplished with data augmentation where random examples are taken from the data set, and then apply different image processing methods on these examples, such as rotation, horizontal mirroring, and magnification. This allows for the addition of many examples to the training set. The main objective of utilizing image augmentation techniques is its proven effect on improving performance [36, 37]. Having a set with more images means that the model is faced with a bigger number of samples that can highly improve performance especially with low-volume data sets.

III-B Phase 2. Models Training

In this phase, we train each data set independently with 29 different pre-trained models available in the Torchvision package. These models include AlexNet, VGG (11, 13 16, 19, BN-11, BN-13, BN-16, BN-19), ResNet (18, 34, 50, 101, 152), SqueezeNet (1.0, 1.1), DenseNet (121, 169, 161, 201), Inception v3, GoogLeNet, ShuffleNet v2 (0.5, 1.0), MobileNet v2, ResNeXt (50, 101), Wide ResNet (50, 101) and MNASNet (1.0). For the training and development environment, we utilized Kaggle Kernels which are in essence Jupyter[38] Notebooks as they offer ease of importing required datasets.

III-C Phase 3. Investigating Results

This phase includes the extraction of results and performance of each model with each data set and form it all together in tables in order to understand and compare how each model performs with fine-grained Inter-species data. We investigate the results to find the best possible configurations for these data with Torchvision pre-trained models, and to suggest new information that can possibly lead for the implementation of more performant models.

III-D Phase 4. Investigating SpinalNet

This last phase is actually a derivative of the third phase. Here we implement each data set with pre-trained models with a small configuration. This change is to replace the last fully connected layer in each model with the SpinalNet[39] layer. SpinalNet is a novel network that is inspired from the human somatosensory system [40]. This network consists of input row, intermediate row and output row where the intermediate row contains a small number of neurons, and each hidden layer receives a part of the input and outputs of the previous layer with input segmentation.

IV Results

Exploring the results of the first data set, 10 Monkey Species, it can be noticed that pre-trained models provided very good performance with regards to the low data volume. Training hyperparameters are a batch of 40 images (20 for DenseNet-169, DenseNet-161, DenseNet-201, and ResNeXt-101. This is due to memory limitations.), 10 epochs, SGD optimizer, and varying values of the learning rate which depends on the model in-hand. Most models provided above 90% accuracy, while others, such as SqueezeNet 1.0 ended with an accuracy of 65.80% (Normal FC) and 69.11% (Spinal FC). However, SqueezeNet 1.1 did provide a better accuracy with about 84.55% (Normal FC) and 82.35% (Spinal FC). It is critical to note that models which performed poorly, such as SqueezeNet and MNASNet, were able to perform much better when trained using the Adam[41] optimizer instead of SGD, and that goes for both Normal and Spinal FC. However, it was decided to move forward with SGD results as to be consistent with all other models. And at the end of this section (in figure 9 and 10), we illustrated the accuracy and loss over epochs for the Wide ResNet-101-2 model which was selected randomly to view performance with both Normal and Spinal FC layer.

Table III: Performance of pre-trained models with the 10 Monkey Species data set with both default and SpinalNet FC

Model	Batch	Accuracy	Accuracy
	Size	(Normal FC)	(Spinal FC)
AlexNet	40	90.80%	90.44%
VGG-11	40	96.69%	96.69%
VGG-13	40	97.43%	97.43%
VGG-16	40	96.69%	97.43%
VGG-19	40	97.43%	97.79%
VGG-11 BN	40	95.59%	97.43%
VGG-13 BN	40	98.16%	98.53%
VGG-16 BN	40	98.53%	98.53%
VGG-19 BN	40	98.90%	98.90%
ResNet-18	40	94.85%	98.16%
ResNet-34	40	97.05%	98.89%
ResNet-50	40	97.05%	98.52%
ResNet-101	40	96.32%	99.26%
ResNet-152	40	96.32%	98.16%
SqueezeNet 1.0	40	65.80%	69.11%
SqueezeNet 1.1	40	84.55%	82.35%
DenseNet-121	40	95.95%	95.95%
DenseNet-169	20	95.58%	97.79%
DenseNet-161	20	93.75%	92.64%
DenseNet-201	20	96.69%	97.05%
Inception v3	40	97.79%	99.26%
GoogLeNet	40	97.05%	98.16%
ShuffleNet v2 0.5	40	91.91%	93.01%
ShuffleNet v2 1.0	40	96.32%	95.95%
MobileNet v2	40	94.85%	94.11%
ResNeXt-50	40	98.16%	98.89%
ResNeXt-101	20	97.79%	98.52%
Wide ResNet-50-2	40	98.16%	98.53%
Wide ResNet-101-2	40	98.53%	99.26%
MNASNet 1.0	40	79.05%	93.01%

In Fruits 360, training was performed on a batch size of 224 (112 for DenseNet-169, DenseNet-161, DenseNet-201, and ResNeXt-101. This is due to memory limitations.), and a default selection of 10 epochs, SGD optimizer, and varying values of the learning rate in accordance with different models. Almost all Models provided great performance reaching to an accuracy of above 99% with only 10 epochs. This is mostly due to the fact that this data set has a high number of examples. The only model which performed poorly is MNASNet with an accuracy of 58.51% (Normal FC) and 61.17% (Spinal FC). And similar to the previous data set, it was found that MNASNet is able to provide great performance similar to that of other models in the list when using the Adam optimizer. However, and for consistency we moved forward with SGD results. And it is worth noting that the best performance was provided by the DenseNet169 model with the SpinalNet fully connected layer with an accuracy of 100.0%.

Table IV: Performance of pre-trained models with the Fruits 360 data set with both default and SpinalNet FC

Model	Batch	Accuracy	Accuracy
	Size	(Normal FC)	(Spinal FC)
AlexNet	224	96.64%	96.09%
VGG-11	224	99.78%	99.82%
VGG-13	224	99.85%	99.90%
VGG-16	224	99.84%	99.90%
VGG-19	224	99.86%	99.90%
VGG-11 BN	224	99.78%	99.85%
VGG-13 BN	224	99.79%	99.88%
VGG-16 BN	224	99.88%	99.93%
VGG-19 BN	224	99.90%	99.96%
ResNet-18	224	99.81%	99.66%
ResNet-34	224	99.84%	99.71%
ResNet-50	224	99.93%	99.78%
ResNet-101	224	99.97%	99.87%
ResNet-152	224	99.98%	99.81%
SqueezeNet 1.0	224	99.02%	98.31%
SqueezeNet 1.1	224	99.25%	98.77%
DenseNet-121	224	99.68%	99.75%
DenseNet-169	112	99.98%	100%
DenseNet-161	112	99.83%	99.93%
DenseNet-201	112	99.78%	99.94%
Inception v3	224	99.92%	99.60%
GoogLeNet	224	99.87%	98.94%
ShuffleNet v2 0.5	224	98.39%	98.68%
ShuffleNet v2 1.0	224	99.36%	99.78%
MobileNet v2	224	99.51%	99.91%
ResNeXt-50	224	99.86%	99.89%
ResNeXt-101	112	99.98%	99.90%
Wide ResNet-50-2	224	99.96%	99.96%
Wide ResNet-101-2	224	99.96%	100%
MNASNet 1.0	224	58.51%	61.17%

225 Bird Species provided very good performance similar to that of Fruits 360. With configured hyperparameters of a batch size of 64 (32 for DenseNet-169, DenseNet-161, DenseNet-201, and ResNeXt-101. This is due to memory limitations.), 10 epochs, SGD optimizer, and varying values of the learning rate which depends on the model in-hand. Most models performed an accuracy of above 90%, such as VGG-16 99.11% (Normal FC) and 99.20% (Spinal FC) as well as the best performer the Wide ResNet-101-2 with 99.38% (Normal FC) and 99.56% (Spinal FC). Lowest performance was achieved with SqueezeNet 1.0 and SqueezeNet 1.1 with an accuracy of 82.95% (Normal FC), 84.61% (Spinal FC) and 99.08% (Normal FC), 84.17% (Spinal FC) respectively.

Table V: Performance of pre-trained models with the 225 Bird Species data set with both default and SpinalNet FC

Model	Batch	Accuracy	Accuracy
	Size	(Normal FC)	(Spinal FC)
AlexNet	64	93.33%	93.24%
VGG-11	64	98.87%	99.02%
VGG-13	64	99.06%	99.11%
VGG-16	64	99.11%	99.20%
VGG-19	64	99.16%	99.20%
VGG-11 BN	64	99.02%	99.02%
VGG-13 BN	64	99.11%	99.19%
VGG-16 BN	64	99.08%	99.29%
VGG-19 BN	64	99.02%	99.11%
ResNet-18	64	98.48%	98.48%
ResNet-34	64	98.13%	98.66%
ResNet-50	64	98.66%	98.66%
ResNet-101	64	98.75%	98.93%
ResNet-152	64	98.75%	99.02%
SqueezeNet 1.0	64	82.95%	84.61%
SqueezeNet 1.1	64	88.08%	84.17%
DenseNet-121	64	98.75%	99.11%
DenseNet-169	32	98.57%	98.93%
DenseNet-161	32	99.02%	98.93%
DenseNet-201	32	98.75%	98.84%
Inception v3	64	97.95%	98.57%
GoogLeNet	64	98.31%	98.04%
ShuffleNet v2 0.5	64	83.82%	84.48%
ShuffleNet v2 1.0	64	94.43%	93.30%
MobileNet v2	64	98.48%	98.75%
ResNeXt-50	64	98.93%	98.40%
ResNeXt-101	32	98.75%	98.84%
Wide ResNet-50-2	64	99.29%	99.38%
Wide ResNet-101-2	64	99.38%	99.56%
MNASNet 1.0	64	96.95%	96.34%

Last data set in this study is the Oxford 102 Flower data set. Most models were able to achieve an accuracy above 90% with different variations among them. Models, such as MobileNet v2 and AlexNet, achieved an accuracy of 91.02% and 92.78% respectively. And models, such as DenseNet-201, performed best with 98.29% for Normal FC and 98.36% for Spinal FC as well as Wide ResNet-101-2 which achieved 98.29% for both Normal and Spinal FC. It can be noticed that Spinal FC was able to averagely provide a better performance than that of Normal FC.

Table VI: Performance of pre-trained models with the Oxford 102 Flower data set with both default and SpinalNet FC

Model	Batch	Accuracy	Accuracy
	Size	(Normal FC)	(Spinal FC)
AlexNet	64	92.78%	92.66%
VGG-11	64	94.56%	94.67%
VGG-13	64	94.56%	95.01%
VGG-16	64	94.67%	95.01%
VGG-19	64	95.32%	95.17%
VGG-11 BN	64	94.62%	94.67%
VGG-13 BN	64	94.76%	95.01%
VGG-16 BN	64	94.97%	95.17%
VGG-19 BN	64	95.11%	95.38%
ResNet-18	64	95.83%	96.14%
ResNet-34	64	96.72%	96.72%
ResNet-50	64	97.43%	97.55%
ResNet-101	64	97.92%	97.92%
ResNet-152	64	97.92%	98.04%
SqueezeNet 1.0	64	95.01%	95.14%
SqueezeNet 1.1	64	95.26%	95.20%
DenseNet-121	64	97.58%	97.64%
DenseNet-169	32	98.00%	98.04%
DenseNet-161	32	98.04%	98.17%
DenseNet-201	32	98.29%	98.36%
Inception v3	64	98.17%	98.19%
GoogLeNet	64	95.93%	96.01%
ShuffleNet v2 0.5	64	92.78%	92.66%
ShuffleNet v2 1.0	64	93.22%	93.45%
MobileNet v2	64	91.02%	90.94%
ResNeXt-50	64	97.80%	97.80%
ResNeXt-101	32	98.00%	98.04%
Wide ResNet-50-2	64	98.04%	98.17%
Wide ResNet-101-2	64	98.29%	98.29%
MNASNet 1.0	64	95.67%	95.59%

SpinalNet is utilized to explore the performance of the novel network and compare it with the default network in each model. All models were trained the same. However, with SpinalNet we replace the default fully connected layer with the SpinalNet fully connected layer. SpinalNet provided very good performance as it has actually improved accuracies of most models over the default FC layer, an example would be with the 10 Monkey Species data set, MNASNet 1.0 provided an accuracy of 79.05% accuracy, and the same model with Spinal fully connected layer achieved an accuracy of 93.01%. And with the Oxford 102 Flower data set, VGG-13 provided an accuracy of 94.56%, and VGG-13 with Spinal fully connected layer achieved an accuracy of 95.01%. Although in a number of cases, the default FC layer did provide better performance. And overall, SpinalNet shows promising results in the classification of images, and it is possible to provide very good performance if integrated well with pre-trained models .

At first, we aimed to train all models with a constant set of hyperparameters. However, it turned out that several models performed much better when adjusting a specific parameter, which is the learning rate, to different values than the default one. AlexNet, SqueezeNet, and others performed moderately when trained with the default learning rate value. And We found that using Adam instead of SGD helped drastically in improving performance. However, this contradicts our goal of utilizing a specific optimizer instead of utilizing different optimizers with different models which can potentially create inconsistency. And with several attempts, we found that adjusting the learning rate when using SGD helped greatly in providing similar, and in some cases better performance, than changing the optimizer to Adam. This improvement in adjusting the value of the learning rate with different models reflects to all data sets.

V Conclusions

This study aimed to investigate the performance of various pre-trained models offered in the Torchvision package of Pytorch with fine-grained inter-species datasets. Each model has a unique architecture designed to perform best with visual data. We systematically fine-tuned each model with a configured set of hyperparameters to ensure consistency of results over different models. Models in the training of data sets with high volume of examples, such as the Fruits 360, provided great performance with most models reaching to 99% and above. This is also applicable with other data sets with a small number of examples, such as 10 Monkey Species, most models were still able to achieve very good results. We explored the novel SpinalNet architecture in efforts to confirm the improvement of pre-trained models performance. In many models, SpinalNet did perform very well reaching in some models to an accuracy of 100% in only 10 epochs for the Fruits 360 data set. And in other models, SpinalNet did perform well and was able to improve the accuracy. However, in many models, the normal fully connected layer performed similar or better. SpinalNet shows very promising results of improving the performance of pre-trained models. There are no specific models that showed best performance with the novel network. It may depend on the number of examples in the data set to provide better performance. SpinalNet provided better accuracies with data sets consisting of a large volume of examples.

Acknowledgment

I (Feras) would like to thank my brother Faris for his continuous support and encouragement.

References

[1] M. R. C. Qazani, H. Asadi, S. Mohamed, and S. Nahavandi, “Prepositioning of a land vehicle simulation-based motion platform using fuzzy logic and neural network,” IEEE Transactions on Vehicular Technology, vol. 69, no. 10, pp. 10 446–10 456, 2020.
[2] M. Hossain, S. Mekhilef, M. Danesh, L. Olatomiwa, and S. Shamshirband, “Application of extreme learning machine for short term output power forecasting of three grid-connected pv systems,” journal of Cleaner Production, vol. 167, pp. 395–405, 2017.
[3] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” in Advances in neural information processing systems, 2019, pp. 8026–8037.
[4] H. D. Kabir, A. Khosravi, M. A. Hosen, and S. Nahavandi, “Neural network-based uncertainty quantification: A survey of methodologies and applications,” IEEE access, vol. 6, pp. 36 218–36 234, 2018.
[5] M. R. C. Qazani, S. Pedrammehr, H. Abdi, and S. Nahavandi, “Performance evaluation and calibration of gantry-tau parallel mechanism,” Iranian Journal of Science and Technology, Transactions of Mechanical Engineering, vol. 44, no. 4, pp. 1013–1027, 2020.
[6] H. D. Kabir, A. Khosravi, S. K. Mondal, M. Rahman, S. Nahavandi, and R. Buyya, “Uncertainty-aware decisions in cloud computing: Foundations and future directions,” ACM Computing Surveys (CSUR), vol. 54, no. 4, pp. 1–30, 2021.
[7] M. R. C. Qazani, H. Asadi, and S. Nahavandi, “A model predictive control-based motion cueing algorithm with consideration of joints’ limitations for hexapod motion platform,” in 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC). IEEE, 2019, pp. 708–713.
[8] H. Asadi, A. Mohammadi, S. Mohamed, M. R. C. Qazani, C. P. Lim, A. Khosravi, and S. Nahavandi, “A model predictive control-based motion cueing algorithm using an optimized nonlinear scaling for driving simulators,” in 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC). IEEE, 2019, pp. 1245–1250.
[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in CVPR09, 2009.
[10] A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks,” arXiv preprint arXiv:1404.5997, 2014.
[11] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[13] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360, 2016.
[14] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
[15] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
[16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
[17] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical guidelines for efficient cnn architecture design,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 116–131.
[18] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520.
[19] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492–1500.
[20] S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016.
[21] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le, “Mnasnet: Platform-aware neural architecture search for mobile,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2820–2828.
[22] H. D. Kabir, A. Khosravi, A. Kavousi-Fard, S. Nahavandi, and D. Srinivasan, “Optimal uncertainty-guided neural network training,” Applied Soft Computing, vol. 99, p. 106878, 2021.
[23] M. Abdar, F. Pourpanah, S. Hussain, D. Rezazadegan, L. Liu, M. Ghavamzadeh, P. Fieguth, X. Cao, A. Khosravi, U. R. Acharya et al., “A review of uncertainty quantification in deep learning: Techniques, applications and challenges,” Information Fusion, 2021.
[24] H. D. Kabir, A. Khosravi, S. Nahavandi, and A. Kavousi-Fard, “Partial adversarial training for neural network-based uncertainty quantification,” IEEE Transactions on Emerging Topics in Computational Intelligence, 2019.
[25] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, no. 3, pp. 211–252, 2015.
[26] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in ICML, 2010.
[27] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
[28] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” arXiv preprint arXiv:1602.07261, 2016.
[29] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
[30] Mario, “10 monkey species,” Kaggle, 2018. [Online]. Available: https://www.kaggle.com/slothkong/10-monkey-species
[31] H. Muresan and M. Oltean, “Fruit recognition from images using deep learning,” Acta Universitatis Sapientiae, Informatica, vol. 10, no. 1, pp. 26–42, 2018. [Online]. Available: https://www.kaggle.com/moltean/fruits
[32] Gerry, “225 bird species,” Kaggle, 2020. [Online]. Available: https://www.kaggle.com/gpiosenka/100-bird-species
[33] M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
[34] M. R. C. Qazani, H. Asadi, and S. Nahavandi, “A new gantry-tau-based mechanism using spherical wrist and model predictive control-based motion cueing algorithm,” Robotica, vol. 38, no. 8, pp. 1359–1380, 2020.
[35] S. Pedrammehr, M. R. C. Qazani, and S. Nahavandi, “A novel axis symmetric parallel mechanism with coaxial actuated arms,” in 2018 4th International Conference on Control, Automation and Robotics (ICCAR). IEEE, 2018, pp. 476–480.
[36] M. R. C. Qazani, H. Asadi, and S. Nahavandi, “High-fidelity hexarot simulation-based motion platform using fuzzy incremental controller and model predictive control-based motion cueing algorithm,” IEEE Systems Journal, vol. 14, no. 4, pp. 5073–5083, 2019.
[37] M. R. C. Qazani, H. Asadi, S. Khoo, and S. Nahavandi, “A linear time-varying model predictive control-based motion cueing algorithm for hexapod simulation-based motion platform,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2019.
[38] T. Kluyver, B. Ragan-Kelley, F. Perez, B. Granger, M. Bussonnier, J. Frederic, K. Kelley, J. Hamrick, J. Grout, S. Corlay, P. Ivanov, D. Avila, S. Abdalla, C. Willing, and J. development team, “Jupyter notebooks ? a publishing format for reproducible computational workflows,” in Positioning and Power in Academic Publishing: Players, Agents and Agendas, F. Loizides and B. Scmidt, Eds. IOS Press, 2016, pp. 87–90. [Online]. Available: https://eprints.soton.ac.uk/403913/
[39] H. Kabir, M. Abdar, S. M. J. Jalali, A. Khosravi, A. F. Atiya, S. Nahavandi, and D. Srinivasan, “Spinalnet: Deep neural network with gradual input,” arXiv preprint arXiv:2007.03347, 2020.
[40] Z. Abbas, H. Tayara, and K. to Chong, “Spinenet-6ma: A novel deep learning tool for predicting dna n6-methyladenine sites in genomes,” IEEE Access, vol. 8, pp. 201 450–201 457, 2020.
[41] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.