This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Department of Computer Science and Engineering
Michigan State University
11email: {huding,yemingqu}@msu.edu

A Novel Geometric Approach for Outlier Recognition in High Dimension

Hu Ding and Mingquan Ye
Abstract

Outlier recognition is a fundamental problem in data analysis and has attracted a great deal of attention in the past decades. However, most existing methods still suffer from several issues such as high time and space complexities or unstable performances for different datasets. In this paper, we provide a novel algorithm for outlier recognition in high dimension based on the elegant geometric technique “core-set”. The algorithm needs only linear time and space complexities and achieves a solid theoretical quality guarantee. Another advantage over the existing methods is that our algorithm can be naturally extended to handle multi-class inliers. Our experimental results show that our algorithm outperforms existing algorithms on both random and benchmark datasets.

1 Introduction

In this big data era, we are confronted with an extremely large amount of data and it is important to develop efficient algorithmic techniques to handle the arising realistic issues. Due to its recent rapid development, deep learning [20] becomes a powerful tool for many emerging applications; meanwhile, the quality of training dataset often plays a key role and seriously affects the final learning result. For example, we can collect tremendous data (e.g., texts or images) through the internet, however, the obtained dataset often contains a significant amount of outliers. Since manually removing outliers will be very costly, it is very necessary to develop some efficient algorithms for recognizing outliers automatically in many scenarios.

Outlier recognition is a typical unsupervised learning problem and its counterpart in supervised learning is usually called anomaly detection [33]. In anomaly detection, the given training data are always positive and the task is to generate a model to depict the positive samples. Therefore, any new data can be distinguished to be positive or negative (i.e., anomaly) based on the obtained model. Several existing methods, especially for image data, include autoencoder [30] and sparse coding [27].

Unlike anomaly detection, the given data for outlier recognition are unlabeled; thus we can only model it as an optimization problem based on some reasonable assumption in practice. For instance, it is very natural to assume that the inliers (i.e., normal data) locate in some dense region while the outliers are scattered in the feature space. Actually, many well known outlier recognition methods are based on this assumption [7, 13]. However, most of the density-based methods are only for low-dimensional space and quite limited for large-scale high-dimensional data that are very common in computer vision problems (note that several high-dimensional approaches often are of heuristic natures and need strong assumptions [21, 23, 3]). Recently, [26] applied the one-class support vector machine (SVM) method [31] to high-dimensional outlier recognition. Further, [34] introduced a new unsupervised model of autoencoder inspired by the observation that inliers usually have smaller reconstruction errors than outliers.

Our main contributions. Although the aforementioned methods could efficiently solve the problem of outlier recognition to a certain extent, they still suffer from several issues such as high time and space complexities or unstable performances for different datasets. In this paper, we present a novel geometric approach for outlier recognition. Roughly speaking, we try to build an approximate minimum enclosing ball (MEB) to cover the inliers but exclude the outliers. This model is seemed to be very simple but involves a couple of computational challenges. For example, the existence of outliers makes the problem to be not only non-convex but also highly combinatorial. Also, the high dimensionality makes the problem more difficult. To tackle these challenges, we develop a randomized algorithmic framework using a popular geometric concept called “core-set”. Comparing with existing results for outlier recognition, we provide a thorough analysis on the complexities and quality guarantee. Moreover, we propose a simple greedy peeling strategy to extend our method to multi-class inliers. Finally, we test our algorithm on both random and benchmark datasets and the experimental results reveal the advantages of our approach over various existing methods.

1.1 Other Related Work

Besides the aforementioned existing results, many other methods for outlier recognition/anomaly detection were developed previously and the readers can refer to several excellent surveys [22, 8, 18].

In computational geometry, a core-set [1] is a small set of points that approximate the shape of a much larger point set, and thus can be used to significantly reduce the time complexities for many optimization problems (please refer to a recent survey [28]). In particular, a core-set can be applied to efficiently compute an approximate MEB for a set of points in high-dimensional space [5, 24]. Moreover, [4] showed that it is possible to find a core-set of size 2/ϵ\lceil 2/\epsilon\rceil that yields a (1+ϵ)(1+\epsilon)-approximate MEB, with an important advantage that the size is independent of the original size and dimensionality of the dataset. In fact, the algorithm for computing the core-set of MEB is a Frank-Wolfe style algorithm [15], which has been systematically studied by Clarkson [9].

The problem of MEB with outliers also falls under the umbrella of the topic robust shape fitting [19, 2], but most of the approaches cannot be applied to high-dimensional data.  [35] studied MEB with outliers in high dimension, however, the resulting approximation is a constant 22 that is not fit enough for the applications proposed in this paper.

Actually, our idea is inspired by a recent work about removing outliers for SVM [12], where they proposed a novel combinatorial approach called Random Gradient Descent (RGD) Tree. It is known that SVM is equivalent to finding the polytope distance from the origin to the Minkowski Difference of the given two labeled point sets. Gilbert algorithm [17, 16] is an efficient Frank-Wolfe algorithm for computing polytope distance, but a significant drawback is that the performance is too sensitive to outliers. To remedy this issue, RGD Tree accommodates the idea of randomization to Gilbert algorithm. Namely, it selects a small random sample in each step by a carefully designed strategy to overcome the adverse effect from outliers.

1.2 Preliminaries

As mentioned before, we model outlier recognition as a problem of MEB with outliers in high dimension. Here we first introduce several definitions that are used throughout the paper.

Definition 1 (Minimum Enclosing Ball (MEB))

Given a set PP of points in d\mathbb{R}^{d}, MEB is the ball covering all the points with the smallest radius. The MEB is denoted by MEB(P)MEB(P).

Definition 2 (MEB with Outliers)

Given a set PP of nn points in d\mathbb{R}^{d} and a small parameter γ(0,1)\gamma\in(0,1), MEB with outliers is to find the smallest ball that covers at least (1γ)n(1-\gamma)n points. Namely, the task is to find a subset of PP having at least (1γ)n(1-\gamma)n points such that the resulting MEB is the smallest among all the possible choices; the induced ball is denoted by MEB(P,γ)MEB(P,\gamma).

From Definition 2 we can see that the major challenge is to determine the subset of PP which makes the problem a challenging combinatorial optimization. Therefore we relax our goal to its approximation as follows. For the sake of convenience, we always use PoptP_{\text{opt}} to denote the optimal subset of PP, that is, Popt=argPmin{P_{\text{opt}}=\arg_{P^{\prime}}\min\{ the radius of MEB(P)PP,|P|(1γ)n}MEB(P^{\prime})\mid P^{\prime}\subset P,\left|P^{\prime}\right|\geq(1-\gamma)n\}, and roptr_{\text{opt}} to denote the radius of MEB(Popt)MEB(P_{\text{opt}}).

Definition 3 (Bi-criteria Approximation)

Given an instance (P,γ)(P,\gamma) for MEB with outliers and two small parameters 0<ϵ,δ<10<\epsilon,\delta<1, an (ϵ,δ)(\epsilon,\delta)-approximation is a ball that covers at least (1(1+δ)γ)n(1-(1+\delta)\gamma)n points and has the radius at most (1+ϵ)ropt(1+\epsilon)r_{\text{opt}}.

When both ϵ\epsilon and δ\delta are small, the bi-criteria approximation is very close to the optimal solution with only a slight violation on the number of covering points and radius.

The rest of the paper is organized as follows. We introduce our main algorithm and the theoretical analyses in Section 2. The experimental results are shown in Section 3. Finally, we extend our method to handle multi-class inliers in Section 4.

2 Our Algorithm and Analyses

In this section, we present our method in detail. For the sake of completeness, we first briefly introduce the core-set for MEB based on the idea of [4].

The algorithm is a simple iterative procedure with an elegant analysis: initially, it selects an arbitrary point and places it into a set SS that is empty at the beginning; in each of the following 2/ϵ\lceil{2/\epsilon}\rceil steps, the algorithm updates the center of MEB(S)MEB(S) and adds the farthest point to SS; finally, the center of MEB(S)MEB(S) induces a (1+ϵ)(1+\epsilon)-approximation for MEB of the whole input point set. The selected 2/ϵ\lceil{2/\epsilon}\rceil points are also called the core-set for MEB. To ensure there is at least certain extent of improvement achieved in each iteration, [4] showed that the following two inequalities would hold if the algorithm always selects the farthest point to the temporary center of MEB(S)MEB(S):

ri+1\displaystyle r_{i+1} (1+ϵ)roptLi,\displaystyle\geq(1+\epsilon)r_{\text{opt}}-L_{i}, (1)
ri+1\displaystyle r_{i+1} ri2+Li2,\displaystyle\geq\sqrt{r^{2}_{i}+L^{2}_{i}}, (2)

where rir_{i} and ri+1r_{i+1} are the radii of the ii-th and the (i+1)(i+1)-th iterations respectively, roptr_{\text{opt}} is the optimal radius of the MEB, and LiL_{i} is the shifting distance of the center of MEB(S)MEB(S).

Algorithm 1 (ϵ,δ)(\epsilon,\delta)-approximation Algorithm of Outlier Recognition Problem
0: A point set PP with nn points in d\mathbb{R}^{d}, the fraction of outliers γ(0,1)\gamma\in(0,1) and four parameters 0<ϵ,δ,μ<1,h+0<\epsilon,\delta,\mu<1,h\in\mathbb{Z}^{+}.
0: A tree with each node whose attached point is a candidate for the (ϵ,δ)(\epsilon,\delta)-approximation solutions.
1: Each node vv in the tree is associated with a point (with a slight abuse of notation, we also use vv to denote the point). Initially, randomly pick a point from PP as root node rr.
2: Starting with root, grow each node as follows:
  1. (1)

    Let vv be the current node.

  2. (2)

    If the height of vv is hh, vv becomes a leaf node. Otherwise, perform the following steps:

    1. (a)

      Let 𝒫vr\mathcal{P}^{r}_{v} denote the set of points along the path from root rr to node vv, and cvc_{v} denote the center of MEB(𝒫vr)MEB(\mathcal{P}^{r}_{v}). We say that cvc_{v} is the attached point of vv.

    2. (b)

      Let k=(1+δ)γnk=(1+\delta)\gamma n. Compute the point set PvP_{v} containing the top kk points which have the largest distances to cvc_{v}.

    3. (c)

      Take a random sample SvS_{v} of size (1+1δ)lnhμ(1+\frac{1}{\delta})\ln\frac{h}{\mu} from PvP_{v}, and let each point vSvv^{\prime}\in S_{v} be a child node of vv.

2.1 Algorithm for MEB with Outliers

We present our (ϵ,δ)(\epsilon,\delta)-approximation algorithm for MEB with outliers in this section. Although the outlier recognition problem belongs to unsupervised learning, we can estimate the fraction of outliers in the given data before executing our algorithm. In practice, we can randomly collect a small set of samples from the given data, and manually identify the outliers and estimate the outlier ratio γ\gamma. Therefore, in this paper we assume that the outlier ratio is known.

To better understand our algorithm, we first illustrate the high-level idea. If taking a more careful analysis on the previously mentioned core-set construction algorithm [4], we can find that it is not necessary to select the farthest point to the center of MEB(S)MEB(S) in each step. Instead, as long as the selected point has a distance larger than (1+ϵ)ropt(1+\epsilon)r_{\text{opt}}, the minimal extent of improvement would always be guaranteed [10]. As a consequence, we investigate the following approach.

We denote the ball centered at point cc with radius r>0r>0 as Ball(c,r)Ball(c,r). Recall that PoptP_{\text{opt}} is the subset of PP yielding the optimal MEB with outliers, and roptr_{\text{opt}} is the radius of MEB(Popt)MEB(P_{\text{opt}}) (see Section 1.2). In the ii-th step, we add an arbitrary point from PoptBall(ci,(1+ϵ)ropt)P_{opt}\setminus Ball(c_{i},(1+\epsilon)r_{\text{opt}}) to SS where cic_{i} is the current center of SS. Based on the above observation, we know that a (1+ϵ)(1+\epsilon)-approximation is obtained after at most 2/ϵ\lceil{2/\epsilon}\rceil steps, that is, |PBall(ci,(1+ϵ)ropt)|(1γ)n\left|P\cap Ball(c_{i},(1+\epsilon)r_{\text{opt}})\right|\geq(1-\gamma)n when i2/ϵi\geq\lceil{2/\epsilon}\rceil.

However, in order to carry out the above approach we need to solve two key issues: how to determine the value of roptr_{\text{opt}} and how to select a point belonging to PoptBall(ci,(1+ϵ)ropt)P_{opt}\setminus Ball(c_{i},(1+\epsilon)r_{\text{opt}}). Actually, we can implicitly avoid the first issue via replacing the radius (1+ϵ)ropt(1+\epsilon)r_{\text{opt}} by the kk-th largest distance from the points of PP to cic_{i}, where kk is some appropriate number that will be determined in our following analysis. For the second issue, we have to take a small random sample instead of a single point from PoptBall(ci,(1+ϵ)ropt)P_{opt}\setminus Ball(c_{i},(1+\epsilon)r_{opt}) and try each of the sampled points; this operation will result in a tree structure that is similar to the RGD Tree introduced by [12] for SVM. We present the algorithm in Algorithm 1 and place the detailed parameter settings, proof of correctness, and complexity analyses in Sections 2.22.3.

Refer to caption
Figure 1: The blue links represent the path from root rr to node vv, and 𝒫vr\mathcal{P}^{r}_{v} contains the four points along the path. The point set SvS_{v} corresponds to the child nodes of vv.

We illustrate Step 2(2)(a-c) of Algorithm 1 in Fig. 1.

2.2 Parameter Settings and Quality Guarantee

We denote the tree constructed by Algorithm 1 as \mathbb{H}. The following theorem shows the success probability of Algorithm 1.

Theorem 2.1

If we set h=2ϵ+1h=\lceil{\frac{2}{\epsilon}}\rceil+1, then with probability at least (1μ)(1γ)(1-\mu)(1-\gamma) there exists at least one node of \mathbb{H} yielding an (ϵ,δ)(\epsilon,\delta)-approximation for the problem of MEB with outliers.

Before proving Theorem 2.1, we need to introduce several important lemmas.

Lemma 1

[11] Let QQ be a set of elements, and QQ^{\prime} be a subset of QQ with size |Q|=β|Q|\left|Q^{\prime}\right|=\beta\left|Q\right| for some β(0,1)\beta\in(0,1). If one randomly samples 1βln1η\frac{1}{\beta}\ln\frac{1}{\eta} elements from QQ, then with probability at least 1η1-\eta, the sample contains at least one element in QQ^{\prime} for any 0<η<10<\eta<1.

Lemma 2

For each node vv, the set SvS_{v} in Algorithm 1 contains at least one point from PoptP_{\text{opt}} with probability 1μh1-\frac{\mu}{h}.

Proof

Since |Pv|=(1+δ)nγ\left|P_{v}\right|=(1+\delta)n\gamma and |P\Popt|=nγ\left|P\backslash P_{\text{opt}}\right|=n\gamma, we have

|PvPopt||Pv|\displaystyle\frac{\left|{P_{v}\cap P_{\text{opt}}}\right|}{\left|P_{v}\right|} =1|Pv\Popt||Pv|\displaystyle=1-\frac{\left|{P_{v}\backslash P_{\text{opt}}}\right|}{\left|P_{v}\right|} (3)
1|P\Popt||Pv|=δ1+δ.\displaystyle\geq 1-\frac{\left|{P\backslash P_{\text{opt}}}\right|}{\left|P_{v}\right|}=\frac{\delta}{1+\delta}.

Note that the size of SvS_{v} is (1+1δ)lnhμ(1+\frac{1}{\delta})\ln\frac{h}{\mu}. If we apply Lemma 1 via setting β=δ1+δ\beta=\frac{\delta}{1+\delta} and η=μh\eta=\frac{\mu}{h}, it is easy to know that SvS_{v} contains at least one point from PoptP_{\text{opt}} with probability 1μh1-\frac{\mu}{h}.

Lemma 3

With probability (1γ)(1μ)(1-\gamma)(1-\mu), there exists a leaf node uu\in\mathbb{H} such that the corresponding set 𝒫urPopt\mathcal{P}^{r}_{u}\subset P_{\text{opt}}.

Proof

Lemma 2 indicates that each node vv has a child node corresponding to a point from PoptP_{\text{opt}} with probability 1μh1-\frac{\mu}{h}. In addition, the probability of root rr belonging to PoptP_{\text{opt}} is 1γ1-\gamma (recall that γ\gamma is the fraction of outliers). Note that the height of \mathbb{H} is hh, then with probability at least

(1γ)(1μh)h>(1γ)(1μ),(1-\gamma)\left({1-\frac{\mu}{h}}\right)^{h}>(1-\gamma)(1-\mu), (4)

there exists one leaf node uu\in\mathbb{H} satisfying 𝒫urPopt\mathcal{P}^{r}_{u}\subset P_{\text{opt}}.

In the remaining analyses, we always assume that such a root-to-leaf path 𝒫ur\mathcal{P}^{r}_{u} described in Lemma 3 exists and only focus on the nodes along this path. We denote R^=(1+ϵ)ropt\hat{R}=(1+\epsilon)r_{\text{opt}} where roptr_{\text{opt}} is the optimal radius of the MEB with outliers. Let Ball(cv,rv)Ball(c_{v},r_{v}) be the MEB covering PPvP\setminus P_{v} centered at cvc_{v}, and the radii of MEB(𝒫vr)MEB(\mathcal{P}^{r}_{v}) and MEB(𝒫vr)MEB(\mathcal{P}^{r}_{v^{\prime}}) be r~v\tilde{r}_{v} and r~v\tilde{r}_{v^{\prime}} respectively. Readers can refer to Fig. 2. The following lemma is a key observation for MEB.

Lemma 4

[4] Given a set PP of points in d\mathbb{R}^{d}, let rPr_{P} and cPc_{P} be the radius and center of MEB(P)MEB(P) respectively. Then for any point pdp\in\mathbb{R}^{d} with a distance K0K\geq 0 to cPc_{P}, there exists a point qPq\in P such that pqrP2+K2\left\|p-q\right\|\geq\sqrt{r^{2}_{P}+K^{2}}.

The following lemma is a key for proving the quality guarantee of Algorithm 1. As mentioned in Section 2.1, the main idea follows the previous works [4, 10]. For the sake of completeness, we present the detailed proof here.

Lemma 5

For each node v𝒫urv\in\mathcal{P}^{r}_{u}, at least one of the following two events happens: (1) cvc_{v} is an (ϵ,δ)(\epsilon,\delta)-approximation; (2) its child vv^{\prime} on the path 𝒫ur\mathcal{P}^{r}_{u} satisfies

r~vR^2+r~v22R^.\tilde{r}_{v^{\prime}}\geq\frac{\hat{R}}{2}+\frac{\tilde{r}_{v}^{2}}{2\hat{R}}. (5)
Proof

If rvR^r_{v}\leq\hat{R}, then we are done; that is, Ball(cv,rv)Ball(c_{v},r_{v}) covers (1(1+δ)γ)n(1-(1+\delta)\gamma)n points and rv(1+ϵ)roptr_{v}\leq(1+\epsilon)r_{\text{opt}}. Otherwise, rv>R^r_{v}>\hat{R} and we consider the second case.

By triangle inequality and the fact that vv^{\prime} (i.e., the point associating the node “vv^{\prime}”) lies outside Ball(cv,rv)Ball(c_{v},r_{v}), we have

cvcv+cvvcvv>rv>R^.\left\|{c_{v}-c_{v^{\prime}}}\right\|+\left\|{c_{v^{\prime}}-v^{\prime}}\right\|\geq\left\|{c_{v}-v^{\prime}}\right\|>r_{v}>\hat{R}. (6)

Let cvcv=Kv\left\|c_{v}-c_{v^{\prime}}\right\|=K_{v}. Combining the fact that cvvr~v\left\|c_{v^{\prime}}-v^{\prime}\right\|\leq\tilde{r}_{v^{\prime}}, we have

r~v>R^Kv.\tilde{r}_{v^{\prime}}>\hat{R}-K_{v}. (7)
Refer to caption
Figure 2: The illustration of MEB(𝒫vr)MEB(\mathcal{P}^{r}_{v}) and MEB(𝒫vr)MEB(\mathcal{P}^{r}_{v^{\prime}}); the blue and red points represent the inliers and outliers, respectively.

By Lemma 4, we know that there exists one point qq (see Fig. 2) in MEB(𝒫vr)MEB(\mathcal{P}^{r}_{v}) satisfying qcvr~v2+Kv2\left\|q-c_{v^{\prime}}\right\|\geq\sqrt{\tilde{r}_{v}^{2}+K_{v}^{2}}. Since qq is also inside MEB(𝒫vr)MEB(\mathcal{P}^{r}_{v^{\prime}}), qcvr~v\left\|q-c_{v^{\prime}}\right\|\leq\tilde{r}_{v^{\prime}}. Then, we have

r~vr~v2+Kv2.\tilde{r}_{v^{\prime}}\geq\sqrt{\tilde{r}_{v}^{2}+K_{v}^{2}}. (8)

Combining (7) and (8), we obtain

r~vmax{R^Kv,r~v2+Kv2}.\tilde{r}_{v^{\prime}}\geq\max\left\{\hat{R}-K_{v},\sqrt{\tilde{r}_{v}^{2}+K_{v}^{2}}\right\}. (9)

Because R^Kv\hat{R}-K_{v} and r~v2+Kv2\sqrt{\tilde{r}_{v}^{2}+K_{v}^{2}} are decreasing and increasing on KvK_{v} respectively, we let R^Kv=r~v2+Kv2\hat{R}-K_{v}=\sqrt{\tilde{r}_{v}^{2}+K_{v}^{2}} to achieve the lower bound (i.e., Kv=R^2r~v22R^K_{v}=\frac{\hat{R}^{2}-\tilde{r}_{v}^{2}}{2\hat{R}}). Substituting the value of KvK_{v} to (9), we have r~vR^2+r~v22R^\tilde{r}_{v^{\prime}}\geq\frac{\hat{R}}{2}+\frac{\tilde{r}_{v}^{2}}{2\hat{R}}. As a consequence, the second event happens and the proof is completed.

Now we prove Theorem 2.1 by the idea from [4]. Suppose no node in 𝒫ur\mathcal{P}^{r}_{u} makes the first event of Lemma 5 occur. As a consequence, we obtain a series of inequalities for each pair of radii r~v\tilde{r}_{v^{\prime}} and r~v\tilde{r}_{v} (see (5)). For ease of analysis, we denote r~v=λiR^\tilde{r}_{v}=\lambda_{i}\hat{R} if the height of vv is ii in \mathbb{H}. By Inequality (5), we have

λi+11+λi22.\lambda_{i+1}\geq\frac{1+\lambda_{i}^{2}}{2}. (10)

Combining the initial case λ1=0\lambda_{1}=0 and (10), we obtain

λi12i+1\lambda_{i}\geq 1-\frac{2}{i+1} (11)

by induction [4]. Note that the equality in (11) holds only when i=1i=1, therefore,

λh>12h+1=122ϵ+2122ϵ+2=11+ϵ.\lambda_{h}>1-\frac{2}{h+1}=1-\frac{2}{\lceil{\frac{2}{\epsilon}}\rceil+2}\geq 1-\frac{2}{\frac{2}{\epsilon}+2}=\frac{1}{1+\epsilon}. (12)

Then, r~u=λhR^>ropt\tilde{r}_{u}=\lambda_{h}\hat{R}>r_{\text{opt}} (recall that uu is the leaf node on the path 𝒫ur\mathcal{P}^{r}_{u}), which is a contradiction to our assumption 𝒫urPopt\mathcal{P}^{r}_{u}\subset P_{\text{opt}}. The success probability directly comes from Lemma 3. Overall, we obtain Theorem 2.1.

2.3 Complexity Analyses

We analyze the time and space complexities of Algorithm 1 in this section.

Time Complexity. For each node vv, we need to compute the corresponding approximate MEB(𝒫vr)MEB(\mathcal{P}^{r}_{v}). To avoid computing the exact MEB costly, we apply the approximation algorithm proposed by [4]. See Algorithm 2 for details.

Algorithm 2 Approximation Algorithm of MEB
0: A point set QQ in d\mathbb{R}^{d}, and NZ+N\in Z^{+}.
1: Start with an arbitrary point c1Qc_{1}\in Q, t1t\leftarrow 1.
2:while t<Nt<N do
3:  Find the point qQq\in Q farthest away from ctc_{t}.
4:  ct+1ct+1t+1(qct)c_{t+1}\leftarrow c_{t}+\frac{1}{t+1}(q-c_{t}).
5:  tt+1t\leftarrow t+1.
6:end while
7:return ctc_{t}.

For Algorithm 2, we have the following theorem.

Theorem 2.2

[4] Let the center and radius of MEB(Q)MEB(Q) be cQc_{Q} and rQr_{Q} respectively, then t\forall t, cQctrQt\left\|{c_{Q}-c_{t}}\right\|\leq\frac{r_{Q}}{\sqrt{t}}.

From Theorem 2.2, we know that a (1+ε)(1+\varepsilon)-approximation for MEB can be obtained when N=1/ε2N=1/\varepsilon^{2} with the time complexity O(|Q|dε2)O\left(\frac{\left|{Q}\right|d}{\varepsilon^{2}}\right). Suppose the height of node vv is ii, then the complexity for computing the corresponding approximate MEB(𝒫vr)MEB(\mathcal{P}^{r}_{v}) is O(idε2)O\left(\frac{id}{\varepsilon^{2}}\right). Further, in order to obtain the point set PvP_{v}, we need to find the pivot point that has the (nk)(n-k)-th smallest distance to cvc_{v}. Here we apply the PICK algorithm [6] which can find the ll-th smallest from a set of nn (lnl\leq n) numbers in linear time. Consequently, the complexity for each node vv at the ii-th layer is O((n+iε2)d)O\left({\left({n+\frac{i}{\varepsilon^{2}}}\right)d}\right). Recall that there are |Sv|i1\left|{S_{v}}\right|^{i-1} nodes at the ii-th layer of \mathbb{H}. In total, the time complexity of our algorithm is

T=i=1h((1+1δ)lnhμ)i1(n+iε2)d.T=\sum_{i=1}^{h}\left({\left({1+\frac{1}{\delta}}\right)\ln\frac{h}{\mu}}\right)^{i-1}\left({n+\frac{i}{\varepsilon^{2}}}\right)d. (13)

If we assume 1/ε1/\varepsilon is a constant, the complexity T=O(Cnd)T=O(Cnd) is linear in nn and dd, where the hidden constant C=((1+1δ)lnhμ)h1C=\left({\left({1+\frac{1}{\delta}}\right)\ln\frac{h}{\mu}}\right)^{h-1}. In our experiment, we can carefully choose the parameters δ,ϵ,μ\delta,\epsilon,\mu so as to keep the value of CC not too large.

Space Complexity. In our implementation, we use a queue 𝒬\mathcal{Q} to store the nodes in the tree. When the head of 𝒬\mathcal{Q} is popped, its |Sv|\left|S_{v}\right| child nodes are pushed into 𝒬\mathcal{Q}. In other words, we simulate breadth first search on the tree \mathbb{H}. Therefore, 𝒬\mathcal{Q} always keeps its size at most C=((1+1δ)lnhμ)h1C=\left({\left({1+\frac{1}{\delta}}\right)\ln\frac{h}{\mu}}\right)^{h-1}. Note that each node vv needs to store 𝒫vr\mathcal{P}^{r}_{v} to compute its corresponding MEB, but actually we only need to record the pointers to link the points in 𝒫vr\mathcal{P}^{r}_{v}. Therefore, the space complexity of 𝒬\mathcal{Q} is O(Ch)O(Ch). Together with the space complexity of the input data, the total space complexity of our algorithm is O(Ch+nd)O(Ch+nd).

2.4 Boosting

By Theorem 2.1, we know that with probability at least (1μ)(1γ)(1-\mu)(1-\gamma) there exists an (ϵ,δ)(\epsilon,\delta)-approximation in the resulting tree. However, when outlier ratio is high, say γ=0.5\gamma=0.5, the success probability (1γ)(1μ)(1-\gamma)(1-\mu) will become small. To further improve the performance of our algorithm, we introduce the following two boosting methods.

  1. 1.

    Constructing a forest. Instead of building a single tree, we randomly initialize several root nodes and grow each root node to be a tree. Suppose the number of root nodes is κ\kappa. The probability that there exists an (ϵ,δ)(\epsilon,\delta)-approximation in the forest is at least 1(1(1γ)(1μ))κ1-(1-(1-\gamma)(1-\mu))^{\kappa} which is much larger than (1γ)(1μ)(1-\gamma)(1-\mu).

  2. 2.

    Sequentialization. First, initialize one root node and build a tree. Then select the best node in the tree and set it to be the root node for the next tree. After iteratively performing the procedure for several rounds, we can obtain a much more robust solution.

3 Experiments

From our analysis in Section 2.2, we know that Algorithm 1 results in a tree \mathbb{H} where each node vv has a candidate cvc_{v} for the desired (ϵ,δ)(\epsilon,\delta)-approximation for the problem of MEB with outliers. For each candidate, we identify the nearest (1(1+δ)γ)n(1-(1+\delta)\gamma)n points to cvc_{v} as the inliers. To determine the final solution, we select the candidate that has the smallest variance of the inliers.

3.1 Datasets and Methods to Be Compared

In our experiment, we test the algorithms on two random datasets and two benchmark image datasets. In terms of the random datasets, we generate the data points based on normal and uniform distributions under the assumption that the inliers usually locate in dense regions while the outliers are scattered in the space. The benchmark image datasets include the popular MNIST [25] and Caltech [14].

To make our experiment more convincing, we compare our algorithm with three well known methods for outlier recognition: angle-based outlier detection (ABOD) [23], one-class SVM (OCSVM) [31], and discriminative reconstructions in an autoencoder (DRAE) [34]. Specifically, ABOD distinguishes the inliers and outliers by assessing the distribution of the angles determined by each 3-tuple data points; OCSVM models the problem of outlier recognition as a soft-margin one-class SVM; DRAE applies autoencoder to separate the inliers and outliers based on their reconstruction errors.

The performances of the algorithms are measured by the commonly used F1F1 score =2PrecisionRecallPrecision+Recall=\frac{2*\text{Precision}*\text{Recall}}{\text{Precision}+\text{Recall}}, where precision is the proportion of the correctly identified positives relative to the total number of identified positives, and recall is the proportion of the correctly identified positives relative to the total number of positives in the dataset.

3.2 Random Datasets

We validate our algorithm on the following two random datasets.

A toy example in 22D. To better illustrate the intuition of our algorithm, we first run it on a random dataset in 2D. We generate an instance of 10,00010,000 points with the outlier ratio γ=0.4\gamma=0.4. The inliers are generated by a normal distribution; the outliers consist of four groups where the first three are generated by normal distributions and the last is generated by a uniform distribution. The four groups of outliers contains 800800, 12001200, 800800, and 12001200 points, respectively. See Fig. 3. The red circle obtained by our algorithm is the boundary to distinguish the inliers and outliers where the resulting F1F1 score is 0.9440.944. From this case, we can see that our algorithm can efficiently recognize the densest region even if the outlier ratio is high and the outliers also form some dense regions in the space.

Refer to caption

Figure 3: The illustration of our algorithm on a 22-dimensional point set.

High-Dimensional Points. We further test our algorithm and the other three methods on high-dimensional dataset. Similar to the previous 2D case, we generate 20,00020,000 points with four groups of outliers in 100\mathbb{R}^{100}; the outlier ratio γ\gamma varies from 0.10.1 to 0.50.5. The F1F1 scores are displayed in Table 1, from which we can see that our algorithm significantly outperforms the other three methods for all the levels of outlier ratio.

Table 1: The F1F1 scores for the high-dimensional random dataset.
Methods γ\gamma 0.10.1 0.20.2 0.30.3 0.40.4 0.50.5
ABOD 0.9070.907 0.8150.815 0.7050.705 0.5860.586 0.4190.419
OCSVM 0.9670.967 0.9260.926 0.8800.880 0.8270.827 0.7450.745
DRAE 0.9510.951 0.8890.889 0.8090.809 0.7090.709 0.5720.572
Ours 0.984 0.965 0.939 0.938 0.898

3.3 Benchmark Image Datasets

In this section, we evaluate all the four methods on two benchmark image datasets.

3.3.1 MNIST Dataset

MNIST contains 70,00070,000 handwritten digits (0 to 99) composed of both training and test datasets. For each of the 10 digits, we add the outliers by randomly selecting the images from the other 99 digits. For each outlier ratio γ\gamma, we compute the average F1F1 score over all the 10 digits. To map the images to a feature (Euclidean) space, we use two kinds of image features: PCA-grayscale and autoencoder feature.

Table 2: The F1F1 scores of the four methods on MNIST by using PCA-grayscale; the three columns for each γ\gamma correspond to PCA-0.950.95, PCA-0.50.5, and PCA-0.10.1, respectively.
Methods γ\gamma 0.10.1 0.20.2 0.30.3 0.40.4 0.50.5
ABOD 0.898,0.895,0.8920.898,0.895,0.892 0.775,0.774,0.7710.775,0.774,0.771 0.648,0.617,0.6420.648,0.617,0.642 0.500,0.470,0.4960.500,0.470,0.496 0.346,0.329,0.3640.346,0.329,0.364
OCSVM 0.937,0.941,0.9340.937,\textbf{0.941},0.934 0.874,0.883,0.8670.874,0.883,0.867 0.804,0.817,0.7980.804,0.817,0.798 0.725,0.740,0.7130.725,0.740,0.713 0.648,0.639,0.605\textbf{0.648},0.639,0.605
DRAE 0.913,0.908,0.9110.913,0.908,0.911 0.822,0.818,0.8160.822,0.818,0.816 0.726,0.722,0.7110.726,0.722,0.711 0.620,0.617,0.6020.620,0.617,0.602 0.531,0.501,0.4880.531,0.501,0.488
Ours 0.939,0.941,0.936\textbf{0.939},\textbf{0.941},\textbf{0.936} 0.881,0.891,0.880\textbf{0.881},\textbf{0.891},\textbf{0.880} 0.822,0.853,0.823\textbf{0.822},\textbf{0.853},\textbf{0.823} 0.760,0.778,0.773\textbf{0.760},\textbf{0.778},\textbf{0.773} 0.633,0.658,0.6510.633,\textbf{0.658},\textbf{0.651}
  1. (1)

    PCA-grayscale Feature. Each image in MNIST has a 28×2828\times 28 grayscale which is represented by a 784784-dimensional vector. Note that the images of MNIST have massive redundancy. For example, the digits often locate in the middle of the images and all the background pixels have the value of 0. Therefore, we apply principle component analysis (PCA) to reduce the redundancy by trying multiple projection matrices which preserve 95%95\%, 50%50\%, and 10%10\% energy of the original grayscale features. These three features are denoted as PCA-0.950.95, PCA-0.50.5 and PCA-0.10.1, respectively. The results are shown in Table 2. We notice that our F1F1 scores always achieve the highest by PCA-0.50.5; this is due to the fact that PCA-0.50.5 can significantly reduce the redundancy as well as preserve the most useful information (comparing with PCA-0.950.95 and PCA-0.10.1).

  2. (2)

    Autoencoder Feature. Autoencoder [29] is often adopted to extract the features of grayscale images. The autoencoder model trained in our experiment has seven symmetrical hidden layers (10001000-500500-250250-6060-250250-500500-10001000), and the input layer is a 784784-dimensional grayscale. We use the middle hidden layer as image feature. The results are shown in Table 3 and our method achieves the best for most of the cases.

    Table 3: The F1F1 scores of the four methods on MNIST by using autoencoder feature.
    Methods γ\gamma 0.10.1 0.20.2 0.30.3 0.40.4 0.50.5
    ABOD 0.8940.894 0.7780.778 0.6370.637 0.4790.479 0.3130.313
    OCSVM 0.9060.906 0.8070.807 0.7060.706 0.5980.598 0.4960.496
    DRAE 0.933 0.8830.883 0.8190.819 0.7370.737 0.6250.625
    Ours 0.9320.932 0.885 0.831 0.770 0.694

3.3.2 Caltech Dataset

The Caltech-256256 dataset 111http://www.vision.caltech.edu/Image_Datasets/Caltech256/ includes 256256 image sets. We choose 1111 concepts as the inliers in our experiment, which are airplane, binocular, bonsai, cup, face, ketch, laptop, motorbike, sneaker, t-shirt, and watch. We apply VGG net [32] to extract the image features, which is the 40964096-dimensional output of the second fully-connected layer. The results are shown in Table 4.

Table 4: The F1F1 scores of the four methods on Caltech-256256 by using VGG net feature.
Methods γ\gamma 0.10.1 0.20.2 0.30.3 0.40.4 0.50.5
ABOD 0.9450.945 0.8380.838 0.7070.707 0.4990.499 0.2330.233
OCSVM 0.9300.930 0.8850.885 0.8390.839 0.7830.783 0.7390.739
DRAE 0.9550.955 0.9370.937 0.9300.930 0.927 0.912
Ours 0.964 0.948 0.932 0.9240.924 0.9060.906

Unlike the random data, the distribution of real data in the space is much more complicated. To alleviate this problem, we try to capture the separate parts of the original VGG net feature. Similar to Section 3.3.1, we apply PCA to reduce the redundancy of VGG net feature and preserve its key parts. Three matrices are obtained to preserve 95%95\%, 50%50\%, and 10%10\% energy respectively. The results are shown in Table 5. We can see that our method achieves the best for all the cases, especially when using PCA-0.50.5 (marked by underlines). More importantly, PCA-0.50.5 considerably improves the results by using the original VGG net feature (see Table 4), and the dimensionality is only 5050 which results in a significant reduction on the complexities.

Table 5: The F1F1 scores of the four methods on Caltech-256256 by using PCA-VGG feature; the three columns for each γ\gamma correspond to PCA-0.950.95, PCA-0.50.5, and PCA-0.10.1, respectively.
Methods γ\gamma 0.10.1 0.20.2 0.30.3 0.40.4 0.50.5
ABOD 0.944,0.942,0.9410.944,0.942,0.941 0.837,0.832,0.8690.837,0.832,0.869 0.707,0.708,0.7150.707,0.708,0.715 0.497,0.489,0.5250.497,0.489,0.525 0.223,0.199,0.2880.223,0.199,0.288
OCSVM 0.932,0.914,0.9210.932,0.914,0.921 0.884,0.894,0.8670.884,0.894,0.867 0.837,0.869,0.8270.837,0.869,0.827 0.782,0.830,0.7710.782,0.830,0.771 0.717,0.790,0.6990.717,0.790,0.699
DRAE 0.955,0.947,0.9280.955,0.947,0.928 0.918,0.924,0.8780.918,0.924,0.878 0.873,0.914,0.8350.873,0.914,0.835 0.873,0.902,0.7730.873,0.902,0.773 0.869,0.887,0.6920.869,0.887,0.692
Ours 0.966,0.986¯,0.949\textbf{0.966},\underline{\textbf{0.986}},\textbf{0.949} 0.950,0.984¯,0.923\textbf{0.950},\underline{\textbf{0.984}},\textbf{0.923} 0.934,0.978¯,0.897\textbf{0.934},\underline{\textbf{0.978}},\textbf{0.897} 0.916,0.973¯,0.871\textbf{0.916},\underline{\textbf{0.973}},\textbf{0.871} 0.899,0.958¯,0.844\textbf{0.899},\underline{\textbf{0.958}},\textbf{0.844}

3.4 Comparisons of Time Complexities

From Section 3.2 and Section 3.3 we know that our method achieves the robust and competitive performances in terms of accuracy. In this section, we compare the time complexities of all the four algorithms.

ABOD has the time complexity O(n3d)O(n^{3}d). In the experiment, we use its speed-up edition FastABOD which has the reduced time complexity O((n2+nk2)d)O((n^{2}+nk^{2})d) where kk is some specified parameter.

OCSVM is formulated as a quadratic programming with the time complexity O(n3)O(n^{3}).

DRAE alternatively executes the following two steps: discriminative labeling and reconstruction learning. Suppose it runs in N1N_{1} rounds; actually the two inner steps are also iterative procedures which both run N2N_{2} iterations. Thus, the total time complexity of DRAE is O(N1N2hdn)O(N_{1}N_{2}hdn), where hh is the number of the hidden layer nodes that can be generally expressed as d/md/m (mm is a constant); then the total time complexity becomes O(C~nd2)O(\tilde{C}nd^{2}) where C~\tilde{C} is a large constant depending on N1N_{1}, N2N_{2} and mm.

When the number of points nn is large, FastABOD, OCSVM, and DRAE will be very time-consuming. On the contrary, our algorithm takes only linear running time (see Section 2.3) and usually runs much faster in practice. For example, our algorithm often takes less than 1/21/2 of the time consumed by the other three methods in our experiment.

4 Extension for Multi-class Inliers

All the three compared methods in Section 3 can only handle one-class inliers. However, in many real scenarios the data could contain multiple classes of inliers. For example, a given image dataset may contain the images of “dog” and “cat”, as well as a certain fraction of outliers. So it is necessary to recognize multiple dense regions in the feature space. Fortunately, our proposed algorithm for MEB with outliers can be naturally extended for multi-class inliers. Instead of building one ball, we can perform the following greedy peeling strategy to extract multiple balls: first we can take a small random sample from the input to roughly estimate the fractions for the classes; then we iteratively run the algorithm for MEB with outliers and remove the covered points each time, until the desired number of balls are obtained. Roughly speaking, we reduce the problem of multi-class inliers to a series of the problems of one-class inliers. The extended algorithm for multi-class inliers is evaluated on two datasets, a random dataset in 100\mathbb{R}^{100} and Caltech-256256.

Random dataset. We generate three classes of inliers following different normal distributions and the outliers following uniform distribution in 100\mathbb{R}^{100}. For each outlier ratio γ\gamma, we report the three F1F1 scores (with respect to the three classes of inliers) and their average in Table 6 (a).

Caltech-256. We randomly select three image sets from Caltech-256256 as the three classes of inliers, and an extra set of mixed images from the remaining image sets as the outliers. Moreover, we point out that recognizing multi-class inliers from real image sets is much more challenging than single class; we believe that it is due to the following two reasons: (1) the multiple classes of inliers could mutually overlap in the feature space and (2) the outlier ratio with respect to each class usually is large (for example, the outlier ratio for class 1 should also take into account of the fractions of the remaining class 2 and 3, if there are 3 classes in total). We use PCA-VGG-0.50.5 feature in our experiment and the performance is very robust (see Table 6 (b)).

Table 6: The F1F1 scores of our extended algorithm for multi-class inliers.
γ\gamma 0.10.1 0.20.2 0.30.3 0.40.4
F1F1
0.9700.970
0.9950.995
0.9940.994
0.9850.985
0.9950.995
0.9940.994
0.9470.947
0.9950.995
0.9430.943
0.9950.995
0.9950.995
0.9630.963
AVG 0.9860.986 0.9910.991 0.9620.962 0.9840.984

(a) Random dataset
γ\gamma 0.10.1 0.20.2 0.30.3 0.40.4 F1F1 0.9950.995 0.9930.993 0.9600.960 0.9950.995 0.9530.953 0.9510.951 0.9950.995 0.9130.913 0.9680.968 0.9940.994 0.9280.928 0.8700.870 AVG 0.9830.983 0.9660.966 0.9590.959 0.9310.931
(b) Caltech-256256

5 Conclusion

In this paper, we present a new approach for outlier recognition in high dimension. Most existing methods have high time and space complexities or cannot achieve a quality guaranteed solution. On the contrary, we show that our algorithm yields a nearly optimal solution with the time and space complexities linear on the input size and dimensionality. More importantly, our algorithm can be extended to efficiently solve the instances with multi-class inliers. Furthermore, our experimental results suggest that our approach outperforms several popular existing methods in terms of accuracy.

References

  • [1] P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan. Geometric approximation via coresets. Combinatorial and Computational Geometry, 52:1–30, 2005.
  • [2] P. K. Agarwal, S. Har-Peled, and H. Yu. Robust shape fitting via peeling and grating coresets. Discrete & Computational Geometry, 39(1-3):38–58, 2008.
  • [3] C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. ACM Sigmod Record, 30(2):37–46, 2001.
  • [4] M. Badoiu and K. L. Clarkson. Smaller core-sets for balls. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 801–802, 2003.
  • [5] M. Badoiu, S. Har-Peled, and P. Indyk. Approximate clustering via core-sets. In Proceedings of the ACM Symposium on Theory of Computing (STOC), pages 250–257, 2002.
  • [6] M. Blum, R. W. Floyd, V. Pratt, R. L. Rivest, and R. E. Tarjan. Time bounds for selection. Journal of Computer and System Sciences, 7(4):448–461, 1973.
  • [7] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Lof: identifying density-based local outliers. ACM Sigmod Record, 29(2):93–104, 2000.
  • [8] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys (CSUR), 41(3):15, 2009.
  • [9] K. L. Clarkson. Coresets, sparse greedy approximation, and the frank-wolfe algorithm. ACM Transactions on Algorithms, 6(4):63, 2010.
  • [10] H. Ding and J. Xu. Solving the chromatic cone clustering problem via minimum spanning sphere. In Proceedings of the International Colloquium on Automata, Languages, and Programming (ICALP), pages 773–784, 2011.
  • [11] H. Ding and J. Xu. Sub-linear time hybrid approximations for least trimmed squares estimator and related problems. In Proceedings of the International Symposium on Computational geometry (SoCG), page 110, 2014.
  • [12] H. Ding and J. Xu. Random gradient descent tree: A combinatorial approach for svm with outliers. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 2561–2567, 2015.
  • [13] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 226–231, 1996.
  • [14] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer vision and Image understanding, 106(1):59–70, 2007.
  • [15] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(1-2):95–110, 1956.
  • [16] B. Gärtner and M. Jaggi. Coresets for polytope distance. In Proceedings of the International Symposium on Computational geometry (SoCG), pages 33–42, 2009.
  • [17] E. G. Gilbert. An iterative procedure for computing the minimum of a quadratic form on a convex set. SIAM Journal on Control, 4(1):61–80, 1966.
  • [18] M. Gupta, J. Gao, C. Aggarwal, and J. Han. Outlier detection for temporal data. Synthesis Lectures on Data Mining and Knowledge Discovery, 5(1):1–129, 2014.
  • [19] S. Har-Peled and Y. Wang. Shape fitting with outliers. SIAM Journal on Computing, 33(2):269–285, 2004.
  • [20] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
  • [21] H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Outlier detection in axis-parallel subspaces of high dimensional data. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pages 831–838, 2009.
  • [22] H.-P. Kriegel, P. Kröger, and A. Zimek. Outlier detection techniques. Tutorial at PAKDD, 2009.
  • [23] H.-p. Kriegel, M. Schubert, and A. Zimek. Angle-based outlier detection in high-dimensional data. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 444–452, 2008.
  • [24] P. Kumar, J. S. B. Mitchell, and E. A. Yildirim. Approximate minimum enclosing balls in high dimensions using core-sets. ACM Journal of Experimental Algorithmics, 8, 2003.
  • [25] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [26] W. Liu, G. Hua, and J. R. Smith. Unsupervised one-class learning for automatic outlier removal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3826–3833, 2014.
  • [27] C. Lu, J. Shi, and J. Jia. Abnormal event detection at 150 fps in matlab. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2720–2727, 2013.
  • [28] J. M. Phillips. Coresets and sketches. Computing Research Repository, 2016.
  • [29] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Cognitive Modeling, 5(3):1, 1988.
  • [30] M. Sakurada and T. Yairi. Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proceedings of the Workshop on Machine Learning for Sensory Data Analysis (MLSDA), page 4, 2014.
  • [31] B. Schölkopf, R. Williamson, A. Smola, J. Shawe-Taylor, and J. Platt. Support vector method for novelty detection. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), pages 582–588, 1999.
  • [32] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [33] P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. 2006.
  • [34] Y. Xia, X. Cao, F. Wen, G. Hua, and J. Sun. Learning discriminative reconstructions for unsupervised outlier removal. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1511–1519, 2015.
  • [35] H. Zarrabi-Zadeh and A. Mukhopadhyay. Streaming 1-center with outliers in high dimensions. In Proceedings of the Canadian Conference on Computational Geometry (CCCG), pages 83–86, 2009.