\addauthor

Siddhant Gargsiddhantgarg@umass.edu1 \addauthorDebi Prasanna Mohantydebi.m@samsung.com2 \addauthorSiva Prasad Thotasiva.prasad@samsung.com2 \addauthorSukumar Moharanamsukumar@samsung.com2 \addinstitution University of Massachusetts
Amherst, USA \addinstitution On-Device AI
Samsung Research,
Bengaluru, India Self-Attention MobileNet for Image Tilt Correction

A Simple Approach to Image Tilt Correction with Self-Attention MobileNet for Smartphones

Abstract

Main contributions of our work are two-fold. First, we present a Self-Attention MobileNet, called SA-MobileNet Network that can model long-range dependencies between the image features instead of processing the local region as done by standard convolutional kernels. SA-MobileNet contains self-attention modules integrated with the inverted bottleneck blocks of the MobileNetV3 model which results in modeling of both channel-wise attention and spatial attention of the image features and at the same time introduce a novel self-attention architecture for low-resource devices. Secondly, we propose a novel training pipeline for the task of image tilt detection. We treat this problem in a multi-label scenario where we predict multiple angles for a tilted input image in a narrow interval of range $1\degree$ or $2\degree$ , depending on the dataset used. With the combination of our novel approach and the architecture, we present state-of-the-art results on detecting the image tilt angle on mobile devices as compared to the MobileNetV3 [Howard et al.(2019)Howard, Sandler, Chu, Chen, Chen, Tan, Wang, Zhu, Pang, Vasudevan, et al.] model. Finally, we establish that SA-MobileNet is more accurate than MobileNetV3 on SUN397 [Xiao et al.(2010)Xiao, Hays, Ehinger, Oliva, and Torralba], NYU-V1 [Silberman and Fergus(2011)] and ADE20K [Zhou et al.(2017)Zhou, Zhao, Puig, Fidler, Barriuso, and Torralba] datasets by 6.42%, 10.51%, and 9.09% points respectively and faster by at least 4 milliseconds on Snapdragon 750 Octa core.

Refer to caption — Figure 1: Results on Image Tilt Correction in Real-Time. Proposed model is able to detect large as well as finer tilt angles. The predicted tilt angles are shown in a red box at bottom-left area of the images. Tilt angle value inside the box implies that image is tilted by that value in anti-clockwise direction from the upright orientation.

\bmvaHangBox	\bmvaHangBox	\bmvaHangBox
(a) Image is rotated $2\degree$ anti-	(b) Image is rotated $8\degree$ clock-	(c) Image is rotated $2\degree$ anti-
clockwise to align verti-	wise to make it upright.	clockwise to make the
cal lines on the truck		horizon horizontal.

1 Introduction

Smartphones have become the most convenient way to capture high-quality photos and videos. Mobile phone cameras have evolved over the past years with both hardware as well as software improvements with AI-enabled technologies that allow the users to take extremely high-resolution images. But many of us are not professional photographers and usually take images that are slightly skewed from the exact upright orientation. This reduces the aesthetic quality of images and people want that their holiday snapshots are of the highest quality. Professional photographers use softwares like Lightroom or Photoshop to straighten the tilted vertical or horizontal lines.

We present an On-Device AI solution for automatic tilt angle detection of smartphone images and correct their orientations with a click to improve the overall picture quality. The proposed model is able to make inferences using mobile CPUs or GPUs with low latency values and at the same time respect the privacy of the user by removing the need to upload images to a server for processing. Currently, MobileNetV3 [Howard et al.(2019)Howard, Sandler, Chu, Chen, Chen, Tan, Wang, Zhu, Pang, Vasudevan, et al.] networks are the most popular lightweight models for mobile devices for many computer vision tasks like image classification or object detection.

In this paper, we are proposing Spatial Self-Attention Modules that can learn long-range dependencies and global context within the input images. Furthermore, to enable on-device inference on resource-limited devices, we integrated these modules with the Inverted Bottleneck blocks [Sandler et al.(2018)Sandler, Howard, Zhu, Zhmoginov, and Chen] of MobileNetV3 to give us a novel neural network architecture for mobile devices called Self-Attention MobileNet or SA-MobileNet. The proposed network is able to learn the spatial information and overcomes the limitations of traditional convolutional kernels that only looks for different features in an image and not their relative positioning.

We are also proposing a simple yet effective training approach, described in Section 3.2, to handle the image tilt detection problem that scales to variety of image datasets containing natural, or indoor/outdoor images. The combination of the Self-Attention MobileNet and proposed training approach gives us state-of-the-art results for detecting image tilt for mobile devices in real-time.

Therefore, our two main contributions are:

•

Self-Attention MobileNet for mobile/IoT devices for real-time inference.
•

A novel approach to tackle fine-grained Image Tilt Detection problem.

2 Related Works

Image Tilt Angle Detection is a long-standing problem. Before the advent of deep learning, low-level image features were used to detect the upright image orientation like Ciocca et al\bmvaOneDot [Ciocca et al.(2015)Ciocca, Cusano, and Schettini] who used LPB-based image features and logistic regression for this problem. But when the deep learning models for computer vision became popular, researchers started using high-level image features [Zhai et al.(2016)Zhai, Workman, and Jacobs, Workman et al.(2016)Workman, Zhai, and Jacobs, Hold-Geoffroy et al.(2018)Hold-Geoffroy, Sunkavalli, Eisenmann, Fisher, Gambaretto, Hadap, and Lalonde]. Fischer et al\bmvaOneDot [Fischer et al.(2015)Fischer, Dosovitskiy, and Brox] used AlexNet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] to regress the exact orientation angle of tilted images but there angle error was high( $\approx 20\degree$ ) for complete $360\degree$ range of image orientations. Applying CNNs for coarse-angle estimation and fuzzy logic for precise angle estimation on edge pixels [Reshmalakshmi and Sasikumar(2017), Prince et al.(2019)Prince, Alsuhibany, and Siddiqi] was recently employed to take into account the ambiguity and uncertainty in image orientations. Horizon lines and vanishing points are also used as cues for optimal image tilt detection but these methods are generally limited to outdoor images with a clear horizon line [Workman et al.(2016)Workman, Zhai, and Jacobs, Fefilatyev et al.(2006)Fefilatyev, Smarodzinava, Hall, and Goldgof] whereas our work addresses natural images in diverse environments.

Digital camera parameters from accelerometer data are also used to rectify the image orientations. Do et al\bmvaOneDot[Do et al.(2020)Do, Vuong, Roumeliotis, and Park] proposed spatial rectifier with ResNet-18 [He et al.(2016)He, Zhang, Ren, and Sun] backbone network for surface normal estimation of indoor images. G Olmschenk et al\bmvaOneDot [Olmschenk et al.(2017)Olmschenk, Tang, and Zhu] proposed an InceptionNet [Szegedy et al.(2016)Szegedy, Vanhoucke, Ioffe, Shlens, and Wojna] style architecture for estimating the pitch and roll of the camera from a single 2D image. Xian et al\bmvaOneDot[Xian et al.(2019)Xian, Li, Fisher, Eisenmann, Shechtman, and Snavely] used surface geometry to determine surface normals for estimating 2DoF [Son and Lee(2011)] camera orientations. But all of the above methods use neural networks that need high computational resources and it is not possible to deploy them in mobile/IoT devices.

Self-Attention has also gained a lot of popularity in recent years. It has quickly become state-of-the-art baselines for many NLP tasks [Waswani et al.(2017)Waswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin, Parikh et al.(2016)Parikh, Täckström, Das, and Uszkoreit, Cheng et al.(2016)Cheng, Dong, and Lapata] and after that it has also became popular in many computer vision tasks like Image Captioning [Xu et al.(2015)Xu, Ba, Kiros, Cho, Courville, Salakhudinov, Zemel, and Bengio], Image Question Answering [Yang et al.(2016)Yang, He, Gao, Deng, and Smola], and Object Detection [Carion et al.(2020)Carion, Massa, Synnaeve, Usunier, Kirillov, and Zagoruyko]. Self-Attention modules can model long-term dependencies and overcome the limitation of convolutional kernels which operate in a local neighbourhood [Zhang et al.(2019)Zhang, Goodfellow, Metaxas, and Odena]. BAM [Park et al.(2018)Park, Woo, Lee, and Kweon], CBAM [Woo et al.(2018)Woo, Park, Lee, and Kweon], and ULSAM [Saini et al.(2020)Saini, Jha, Das, Mittal, and Mohan] are light-weight attention modules for spatial attention but they use pooling operations that result in loss of information of the feature maps.

3 Method

We now present the architecture of our proposed Self-Attention MobileNet model and the novel training pipeline for Image Tilt Detection in detail in this section.

3.1 Self-Attention Convolutional Module

Squeeze-and-Excite [Hu et al.(2018)Hu, Shen, and Sun]: The squeeze operation captures channel-wise dependencies in a feature map by using global average pooling operation along the channel dimension to aggregate information of each 2D feature map through a scalar value that results in a channel descriptor vector. During the excite operation, we learn channel-wise dependencies by passing the channel descriptor vector through a 2-layer neural network which outputs channel-wise attention weights. This attention vector is then multiplied by the input feature map to adaptively weight each channel and improve the representational power of the feature maps.

Spatial Self-Attention: Standard convolutional kernels process a local neighborhood of an image at a time because of the smaller receptive field sizes for the input feature maps. Although, the kernels with large receptive fields can cover more region but that will incur high computational costs because of the increased number of parameters and training time. Self-attention modules can complement convolutional layers by learning global or long-range dependencies between different image regions without much computational overhead. To learn spatial self-attention, we take the feature map vectors along the channel dimension and calculate their key, query, and value representations in a low-dimensional subspace. The dot product of all key vectors with every other query vector with the softmax function gives us the attention weights between all the regions of the image. More specifically, after squeeze and excite operation, let the feature map be $\mathbf{F\in\mathbb{R}^{H\times W\times C}}$ and after expanding it along spatial dimensions we get the matrix $\mathbf{F\in\mathbb{R}^{N\times C}}$ , which contains $N(=H\cdot W)$ , $C$ -dimensional vectors representing various image regions. We use learnable weight matrices $\mathbf{W_{K}\in\mathbb{R}^{C\times C`}},\mathbf{W_{Q}\in\mathbb{R}^{C\times C`}}$ , and $\mathbf{W_{v`}\in\mathbb{R}^{C\times C`}}$ to calculate key( $\mathbf{K}$ ), query( $\mathbf{Q}$ ) and value( $\mathbf{v`}$ ) vectors respectively.

Here $C`=C/r$ , $r>1$ , $r$ is the reduction ratio which is used to decrease the dimensions of the vectors and calculate attention weights and values in a low-dimensional subspace. Let $\mathbf{F^{N\times C}}\equiv\mathbf{F}$

$\displaystyle\mathbf{K}=\mathbf{F\cdot W_{K}},\quad\mathbf{Q}$	$\displaystyle=\mathbf{F\cdot W_{Q}},\quad\mathbf{v`}=\mathbf{F\cdot W_{v`}}$	(1)
$\displaystyle\mathbf{\beta^{N\times N}}=\mathbf{Q\cdot K^{T}}\implies a_{ij}$	$\displaystyle=\frac{\exp(\beta_{ij})}{\sum_{j=1}^{N}\exp(\beta_{ij})},i,j=1,\dots,N$	(2)
$\displaystyle\mathbf{V^{N\times C`}}$	$\displaystyle=\mathbf{a}\cdot\mathbf{v`}$	(3)

Here, $\mathbf{a\in\mathbb{R}^{N\times N}}$ is the self-attention matrix, and $a_{ij}$ is the attention weight that region $i$ puts on region $j$ . $\mathbf{V\in\mathbb{R}^{N\times C`}}$ in Eq.3 is a value matrix in the low-dimensional subspace and we use $\mathbf{W_{V}\in\mathbb{R}^{C`\times C}}$ to project it into the original subspace to get the self-attention feature maps, $\mathbf{S\in\mathbb{R}^{N\times C}}$ . Finally, we will add a residual connection to get the final output $\mathbf{\tilde{F}^{N\times C}}$ .

	$\displaystyle\mathbf{S}$	$\displaystyle=\mathbf{V}\cdot\mathbf{W_{V}}$		(4)
	$\displaystyle\mathbf{\tilde{F}}$	$\displaystyle=\mathbf{F}+\alpha\mathbf{S}$		(5)

where $\alpha$ is a trainable scalar parameter initialized to $0$ . Eq.5 implies that the model first learns image features around the local neighbourhood and then gradually moves on to learn global dependencies [Zhang et al.(2019)Zhang, Goodfellow, Metaxas, and Odena]. The combination of squeeze and excite operation and spatial self-attention gives us the feature maps that are rich in content as well as contextual information.

The learning of global long-range dependencies and relative positioning of different image regions improved the model predictions on image tilt detection tasks. From the heatmaps, in Fig.4, we can see that the query (red) region is able to shift its attention with the image orientation. Specifically, in Fig.4-[1.a, 2.a, 3.a] (left column), the query (red) region on the horizon was able to focus on the other horizon points despite different orientations of the same image. In Fig.4-SET-B (right column), there is no clear horizon line but the query regions are able to attend to relevant regions for tilt detection. For example, query region on the ground in heatmaps-[4.b, 5.b] is attending to other points on the ground and query region in Fig.4-[4.a, 6.b] in sky is focusing on other points above the skyline. Also note that query regions in Fig.4-[1.c, 2.c, 3.c] (left column) are able to attend to far locations which indicates learning of long-range dependencies and spatial information. Therefore, for detecting image tilt angle, the neural network model needs to learn the relative positioning of the image pixels to differentiate between various distinct image orientations.

\bmvaHangBox	\bmvaHangBox
$1\hskip 39.83385pt1.a\hskip 34.1433pt1.b\hskip 39.83385pt1.c$	$4\hskip 39.83385pt4.a\hskip 34.1433pt4.b\hskip 39.83385pt4.c$
\bmvaHangBox	\bmvaHangBox
$2\hskip 39.83385pt2.a\hskip 34.1433pt2.b\hskip 39.83385pt2.c$	$5\hskip 39.83385pt5.a\hskip 34.1433pt5.b\hskip 39.83385pt5.c$
\bmvaHangBox	\bmvaHangBox
$3\hskip 39.83385pt3.a\hskip 34.1433pt3.b\hskip 39.83385pt3.c$	$6\hskip 39.83385pt6.a\hskip 34.1433pt6.b\hskip 39.83385pt6.c$
SET-A	SET-B

Figure 4: In the heatmaps, the red region indicates the query region and the white regions represent the points where the query region focuses its attention. Figures. 1.a, 2.a, and 3.a indicates that the query region on the horizon attends to other regions on the horizon. For the SET-B, the input images do not have straight lines but the model is able to understand the general upright orientation. In figures 4.c and 6.a, the query region on ground is attending to other ground regions. Similarly, in figures 4.a and 6.b the query region in sky is attending to another point in the sky. In figures 1.c, 2.c, and 3.c and figures 4.c, 5.c, and 6.c, the query region is attending to a far location indicating the learning of long-range dependencies.

3.2 Image Tilt Detection

3.2.1 Motivation and Problem Modeling

Although, the intuitive approach for tackling this problem seems to be regression where we use a Deep CNN to extract image features and minimize the angular distance between the ground truth angle and the predicted angle. But training deep neural networks for regression tasks is difficult [Lathuilière et al.(2019)Lathuilière, Mesejo, Alameda-Pineda, and Horaud] and when we enter the domain of light-weight models, like MobileNetV3 or SA-MobileNet, with few parameters as compared to ResNet-50 [He et al.(2016)He, Zhang, Ren, and Sun] or VGG-16 [Simonyan and Zisserman(2014)] networks, it becomes extremely difficult to achieve good results on regression tasks. In contrast to regression, deep neural networks achieve highly accurate results on classification tasks but we cannot model the image tilt detection as a single label classification problem due to a variety of reasons. If the prediction is $1\degree-2\degree$ off from the true label, the network will penalize it equally as it would if the prediction is off by $10\degree-20\degree$ or more. This might not be necessary because $1\degree$ or $2\degree$ variation in the image tilt angle might not be significant. Therefore, we train the model to predict multiple angles within a narrow interval of the ground truth tilt angle and penalize only those values that are outside this narrow range.

3.2.2 Training Pipeline

We model the problem of Image Tilt Detection in a multi-label scenario where we train our proposed neural network model with a tilted input image and the corresponding ground truth label vector with multiple labels of tilt angles in a narrow interval of either $\pm 1\degree$ or $\pm 2\degree$ depending on the dataset quality. Images in the training dataset were assumed to be upright and assigned $0\degree$ ground truth label by default. Every image is rotated by each angle from $0\degree,1\degree,\dots,359\degree$ and center-cropped before giving to the model as input.

To enable multi-label training for predictions of image tilt angles within the narrow interval, the ground truth label vectors are constructed as 360-dimensional vectors with value of $1$ at indices $G-I,\dots,G-1,G,G+1,\dots,G+I$ , and value of $0$ at rest of the indices. $G$ and $I$ are the ground truth image tilt angle and length of the narrow interval respectively. Note that the ground truth label vector is cyclic. If $G+x\geq 360$ , for some $0\leq x\leq I$ , then we set the value 1 at index $(G+x)\%360$ . Similarly, if $G-x<0$ , then we set 1 at index $G-x+360$ . Last layer of the model is a 360-dimensional layer, representing all the integer angles, with sigmoid activation function that holds the prediction scores of the tilt angles for a given input image. We use Binary Cross Entropy loss function to calculate the loss between given ground truth label vector and the predicted outputs.

\displaystyle\mathbf{\mathcal{L}(y,p)}=-\frac{1}{D}\Bigg{[}\sum_{i=1}^{D}y_{i}\log p_{i}+(1-y_{i})\log(1-p_{i})\Bigg{]}

(6)

Here $D=360$ and $\mathbf{y}$ and $\mathbf{p}$ are groundtruth and the predicted vectors respectively.

4 Experiments

4.0.1 Model Prediction

For the final prediction, we take the highest scoring label, in the last layer, and if more than two labels have the highest scores, we simply take their average value. From our experiments, we saw that most of the times, when more than two labels have the highest score, it was $1.0$ and the corresponding labels were consecutive and belonged to set of narrow interval around the ground truth tilt angle. This implies that there is an implicit correlation between the ground truth labeled angles within the narrow interval that helps the model to determine the image orientation over the complete $360\degree$ range. Another advantage of using this method, over single-label classification, is that the network only penalizes those output values that are outside the narrow interval around the ground truth tilt angle. Intuition is that if the ground truth image tilt angle is $45\degree$ , then we do not want to penalize $44\degree$ or $46\degree$ prediction as much as we want to penalize $42\degree$ or $46\degree$ predictions because though they are close to the ground truth value they will not make image look upright.

4.1 Model Architecture

For the MobileNetV3 model, the input image size is $224\times 224$ and the subsequent layers decrease the feature map size to $7\times 7$ in strides of $2$ . We integrated Spatial Self-Attention Modules within MobileNetV3 to get Self-Attention MobileNet. Spatial Self-Attention Modules were applied to the blocks of sizes $56\times 56$ , $28\times 28$ , $14\times 14$ , and $7\times 7$ . We selected these blocks because they are high-level image features which encodes meaningful image representations. Furthermore, the computational costs of calculating self-attention on these blocks was not very high because of the small feature map sizes. We also replaced the 1280-dimensional fully connected layer of MobileNetV3 with a 720-dimensional layer. This helped in reducing the MAdds that were added due to the introduction of spatial self-attention operations. As a result, the proposed SA-MobileNet model, when converted to Tflite, came out faster by an average of 4ms than the corresponding MobileNetV3 Tflite model when tested on Snapdragon 750 Octa core. All the convolutional kernels and fully connected layers were initialized from pretrained ImageNet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] weights apart from the newly added self-attention blocks that were initialized randomly.

The network was trained end-to-end using RMSprop [Tieleman and Hinton(2012)] optimizer with momentum $0.9$ . The initial learning rate was $0.001$ and it was decayed using exponential learning rate schedule with $40k$ decay steps and $0.95$ decay rate.

Previous Works	Accuracy $(\%)\uparrow$	Angle Error $(\degree)\downarrow$
Ciocca et al\bmvaOneDot [Ciocca et al.(2015)Ciocca, Cusano, and Schettini] (LPB-based featutes)	71.87	-
CNN + Fuzzy Edge Detection	85.21	-
Fischer et al\bmvaOneDot [Fischer et al.(2015)Fischer, Dosovitskiy, and Brox] (AlexNet)	-	21.23
Maji et al\bmvaOneDot [Maji and Bose(2020)] (Xception)	-	7.89
MobileNetV3 (baseline)	85.97	5.06
ResNet-50 (baseline)	93.67	3.98
SA-MobileNet (proposed)	92.39	4.27

Table 1: Accuracies and angles errors of various baseline methods on SUN397 dataset.

\uparrow/\downarrow

indicates that higher/lower is better respectively.

4.2 Datasets

We used publicly available SUN397 [Xiao et al.(2010)Xiao, Hays, Ehinger, Oliva, and Torralba], ADE20K [Zhou et al.(2017)Zhou, Zhao, Puig, Fidler, Barriuso, and Torralba], and NYU-V1 [Silberman and Fergus(2011)] datasets. SUN397 is a scene understanding dataset that contains 397 well-sampled categories of diverse scenes with $108,754$ distinct images. There are 10 train-test partitions for the dataset, and for our evaluation dataset, we took the union of images from all the test partitions that resulted in $15,691$ images. The remaining $92,793$ images were used for training. ADE20K dataset is another scene parsing dataset with $20,210$ images in the training set and $5,000$ images in the evaluation set. SUN397 and ADE20K datasets contain wide variety of natural images that may or may not contain straight vertical or horizontal lines. That is why we set the interval length, $I=2\degree$ while training on these two datasets. We also used NYU-V1 Depth dataset that contains frames from video sequences of various indoor scenes recorded from both RGB and Depth camera of Microsoft Kinect. Before using this dataset to train our model, we had to make the images upright because all the images were skewed as seen from Fig.6. We straightened the set of $2,282$ images and used $2000$ images for our training and $282$ images for testing. We split the data in such a way that frames from the same indoor scene does not come in both the splits. We set the interval length $I=1\degree$ because this dataset was manually annotated and we saw highly accurate results on the evaluation data.

4.3 Results

We train MobileNetV3 model as a baseline for mobile devices on SUN397, ADE20K and NYU-V1 dataset using the training approach, described in section 3.2. The proposed SA-MobileNet consistently performs better in terms of detection accuracies and angle errors on all the evaluation datasets as seen from Fig.5 and Table.4. We also trained MobileNetV3 and SA-MobileNet for regressing the image tilt angle, by using angle loss function, given by Eq.7 and 8, and AdaDelta optimizer on ADE20k dataset. The SA-MobileNet model gives us lower angle error when compared to the MobileNetV3 model as shown in Fig 5.d and Table 3.

	$\displaystyle e$	$\displaystyle=\mathinner{\!\left\lvert a_{true}-a_{pred}\right\rvert}$		(7)
	$\displaystyle\mathbf{\mathcal{L}_{angle}}$	$\displaystyle=min\{e,360-e\}$		(8)

Here $a_{true}$ and $a_{pred}$ are groundtruth and predicted angles respectively, with values ranging from $0\degree-to-359\degree$ . We also use this loss function to calculate the angle errors of the model trained with the multi-label approach.

\bmvaHangBox	\bmvaHangBox	\bmvaHangBox	\bmvaHangBox
(a) ADE20K	(b) NYU-V1	(c) SUN397	(d) Regression loss

Model	NYU-V1		ADE20K		SUN397
Model	Acc $(\%)\uparrow$	AE $\degree\downarrow$	Acc $(\%)\uparrow$	AE $\degree\downarrow$	Acc $(\%)\uparrow$	AE $\degree\downarrow$
MobileNetV3	88.02	15.79	87.68	16.84	85.97	5.06
ResNet-50	94.59	4.67	97.84	3.09	93.67	3.98
SA-MobileNet	98.53	3.45	96.77	3.45	92.39	4.27

Table 2: Evaluation accuracies and angle errors of the MobileNetV3, ResNet-50, and SA-MobileNet models on various datasets with the proposed tilt angle detection approach. Acc: Accuracy

(\%)

and AE: Angle Errors(

\degree

\uparrow/\downarrow

indicates that higher/lower is better respectively.

Model	Latency $(\downarrow)$	Parameters $(\downarrow)$
	(milliseconds)	(millions)
MobileNetV3	$79$	4.2
SA-MobileNet	75	4.5

Table 3: Tflite models were tested on Snapdragon 750, Octa core (2x 2.2 GHz, 6x 1.8 GHz) for latency measurements.

Model	Angle
	Error $\degree$ $(\downarrow)$
MobileNetV3	$21.07$
SA-MobileNet	15.53

Table 4: Regresion loss on ADE20k dataset trained with angle loss function Eq.8.

From Table.4 we see that the proposed SA-MobileNet model resulted in very low-angle errors on NYU-V1, ADE20K, and SUN397 dataset when compared with MobileNetV3 model. From the evaluation accuracy plots in Fig.5.a, Fig.5.b, Fig.5.c, we can see that the accuracy curve of our model (blue) is above the curve of MobileNetV3 model (green) during training for all the datasets. In Fig.5.d, we plot angle errors for both the models which were trained for regressing the exact orientation angle on ADE20K dataset. The SA-MobileNet model produced low angle errors when compared to the MobileNetV3 model. This also justifies that the long-range dependencies learned by the self-attention modules are necessary for image tilt detection. We also trained the ResNet-50 model as a baseline to validate the effectiveness of our training pipeline. Though, the ResNet-50 model outperforms both the MobileNetV3 and SA-MobileNet, the improvement is marginal despite the huge difference in the number of parameters between the ResNet-50 and the light-weight models.

Comparison with previous works: We also present comparison of our training approach with previous works for this problem in Table 1. Over the past few years many different methods have been proposed to tackle this problem but they have high angle errors or low accuracies. ResNet-50 baseline model gives the lowest angle error on the test dataset to give state-of-the-art results. However, the ResNet-50 network cannot be deployed on low-resource devices because it has over 25 million parameters. But the proposed Self-Attention MobileNet has around 4 million parameters with low-latency values and state-of-the-art results for mobile devices that makes it ideal to be deployed in smartphones for real-time inferences.

5 Conclusion and Future Work

We present a novel neural network model for mobile/IoT devices powered with the Self-Attention Modules. We also present highly accurate results on the task of Image Tilt Detection and are able to correct image orientations on smartphone in real-time. Our proposed training approach is also very simple but effective for this task. In the future work, we can use a dynamic value of the narrow interval, $I$ , which can be unique for each image within limit.

\bmvaHangBox	\bmvaHangBox	\bmvaHangBox
(a) $3\degree$ anticlockwise	(b) $4\degree$ clockwise	(c) $5\degree$ clockwise

We will also validate the effectiveness of Self-Attention MobileNet on other downstream tasks like image classification, object detection and image segmentation applications for mobile devices.

References

[Carion et al.(2020)Carion, Massa, Synnaeve, Usunier, Kirillov, and Zagoruyko] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision, pages 213–229. Springer, 2020.
[Cheng et al.(2016)Cheng, Dong, and Lapata] Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733, 2016.
[Ciocca et al.(2015)Ciocca, Cusano, and Schettini] Gianluigi Ciocca, Claudio Cusano, and Raimondo Schettini. Image orientation detection using lbp-based features and logistic regression. Multimedia Tools and Applications, 74(9):3013–3034, 2015.
[Do et al.(2020)Do, Vuong, Roumeliotis, and Park] Tien Do, Khiem Vuong, Stergios I Roumeliotis, and Hyun Soo Park. Surface normal estimation of tilted images via spatial rectifier. In European Conference on Computer Vision, pages 265–280. Springer, 2020.
[Fefilatyev et al.(2006)Fefilatyev, Smarodzinava, Hall, and Goldgof] Sergiy Fefilatyev, Volha Smarodzinava, Lawrence O Hall, and Dmitry B Goldgof. Horizon detection using machine learning techniques. In 2006 5th International Conference on Machine Learning and Applications (ICMLA’06), pages 17–21. IEEE, 2006.
[Fischer et al.(2015)Fischer, Dosovitskiy, and Brox] Philipp Fischer, Alexey Dosovitskiy, and Thomas Brox. Image orientation estimation with convolutional networks. In German Conference on Pattern Recognition, pages 368–378. Springer, 2015.
[He et al.(2016)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[Hold-Geoffroy et al.(2018)Hold-Geoffroy, Sunkavalli, Eisenmann, Fisher, Gambaretto, Hadap, and Lalonde] Yannick Hold-Geoffroy, Kalyan Sunkavalli, Jonathan Eisenmann, Matthew Fisher, Emiliano Gambaretto, Sunil Hadap, and Jean-François Lalonde. A perceptual measure for deep single image camera calibration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2354–2363, 2018.
[Howard et al.(2019)Howard, Sandler, Chu, Chen, Chen, Tan, Wang, Zhu, Pang, Vasudevan, et al.] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1314–1324, 2019.
[Hu et al.(2018)Hu, Shen, and Sun] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
[Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012.
[Lathuilière et al.(2019)Lathuilière, Mesejo, Alameda-Pineda, and Horaud] Stéphane Lathuilière, Pablo Mesejo, Xavier Alameda-Pineda, and Radu Horaud. A comprehensive analysis of deep regression. IEEE transactions on pattern analysis and machine intelligence, 42(9):2065–2081, 2019.
[Maji and Bose(2020)] Subhadip Maji and Smarajit Bose. Deep image orientation angle detection. arXiv preprint arXiv:2007.06709, 2020.
[Olmschenk et al.(2017)Olmschenk, Tang, and Zhu] Greg Olmschenk, Hao Tang, and Zhigang Zhu. Pitch and roll camera orientation from a single 2d image using convolutional neural networks. In 2017 14th Conference on Computer and Robot Vision (CRV), pages 261–268. IEEE, 2017.
[Parikh et al.(2016)Parikh, Täckström, Das, and Uszkoreit] Ankur P Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933, 2016.
[Park et al.(2018)Park, Woo, Lee, and Kweon] Jongchan Park, Sanghyun Woo, Joon-Young Lee, and In So Kweon. Bam: Bottleneck attention module. arXiv preprint arXiv:1807.06514, 2018.
[Prince et al.(2019)Prince, Alsuhibany, and Siddiqi] Master Prince, Suliman A Alsuhibany, and Nahid A Siddiqi. A step towards the optimal estimation of image orientation. IEEE Access, 7:185750–185759, 2019.
[Reshmalakshmi and Sasikumar(2017)] C Reshmalakshmi and M Sasikumar. Image edge orientation estimation via fuzzy logic. Materials Today: Proceedings, 4(2):4274–4282, 2017.
[Saini et al.(2020)Saini, Jha, Das, Mittal, and Mohan] Rajat Saini, Nandan Kumar Jha, Bedanta Das, Sparsh Mittal, and C Krishna Mohan. Ulsam: Ultra-lightweight subspace attention module for compact convolutional neural networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1627–1636, 2020.
[Sandler et al.(2018)Sandler, Howard, Zhu, Zhmoginov, and Chen] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
[Silberman and Fergus(2011)] Nathan Silberman and Rob Fergus. Indoor scene segmentation using a structured light sensor. In 2011 IEEE international conference on computer vision workshops (ICCV workshops), pages 601–608. IEEE, 2011.
[Simonyan and Zisserman(2014)] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[Son and Lee(2011)] Hungsun Son and Kok-Meng Lee. Two-dof magnetic orientation sensor using distributed multipole models for spherical wheel motor. Mechatronics, 21(1):156–165, 2011.
[Szegedy et al.(2016)Szegedy, Vanhoucke, Ioffe, Shlens, and Wojna] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
[Tieleman and Hinton(2012)] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
[Waswani et al.(2017)Waswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin] A Waswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, AN Gomez, L Kaiser, and I Polosukhin. Attention is all you need. In NIPS, 2017.
[Woo et al.(2018)Woo, Park, Lee, and Kweon] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
[Workman et al.(2016)Workman, Zhai, and Jacobs] Scott Workman, Menghua Zhai, and Nathan Jacobs. Horizon lines in the wild. arXiv preprint arXiv:1604.02129, 2016.
[Xian et al.(2019)Xian, Li, Fisher, Eisenmann, Shechtman, and Snavely] Wenqi Xian, Zhengqi Li, Matthew Fisher, Jonathan Eisenmann, Eli Shechtman, and Noah Snavely. Uprightnet: Geometry-aware camera orientation estimation from single images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9974–9983, 2019.
[Xiao et al.(2010)Xiao, Hays, Ehinger, Oliva, and Torralba] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010.
[Xu et al.(2015)Xu, Ba, Kiros, Cho, Courville, Salakhudinov, Zemel, and Bengio] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057. PMLR, 2015.
[Yang et al.(2016)Yang, He, Gao, Deng, and Smola] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 21–29, 2016.
[Zhai et al.(2016)Zhai, Workman, and Jacobs] Menghua Zhai, Scott Workman, and Nathan Jacobs. Detecting vanishing points using global image context in a non-manhattan world. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5657–5665, 2016.
[Zhang et al.(2019)Zhang, Goodfellow, Metaxas, and Odena] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In International conference on machine learning, pages 7354–7363. PMLR, 2019.
[Zhou et al.(2017)Zhou, Zhao, Puig, Fidler, Barriuso, and Torralba] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.