This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\acmVolume

9 \acmNumber4 \acmArticle39 \acmYear2010 \acmMonth3

Feature Selection: A Data Perspective

Jundong Li Kewei Cheng Suhang Wang Fred Morstatter Robert P. Trevino Jiliang Tang Huan Liu Arizona State University Arizona State University Arizona State University Arizona State University Arizona State University Michigan State University Arizona State University
Abstract

Feature selection, as a data preprocessing strategy, has been proven to be effective and efficient in preparing data (especially high-dimensional data) for various data mining and machine learning problems. The objectives of feature selection include: building simpler and more comprehensible models, improving data mining performance, and preparing clean, understandable data. The recent proliferation of big data has presented some substantial challenges and opportunities to feature selection. In this survey, we provide a comprehensive and structured overview of recent advances in feature selection research. Motivated by current challenges and opportunities in the era of big data, we revisit feature selection research from a data perspective and review representative feature selection algorithms for conventional data, structured data, heterogeneous data and streaming data. Methodologically, to emphasize the differences and similarities of most existing feature selection algorithms for conventional data, we categorize them into four main groups: similarity based, information theoretical based, sparse learning based and statistical based methods. To facilitate and promote the research in this community, we also present an open-source feature selection repository that consists of most of the popular feature selection algorithms (http://featureselection.asu.edu/). Also, we use it as an example to show how to evaluate feature selection algorithms. At the end of the survey, we present a discussion about some open problems and challenges that require more attention in future research.

keywords:
Feature Selection
{CCSXML}

¡ccs2012¿ ¡concept¿ ¡concept_id¿10010147.10010257.10010321.10010336¡/concept_id¿ ¡concept_desc¿Computing methodologies Feature selection¡/concept_desc¿ ¡concept_significance¿500¡/concept_significance¿ ¡/concept¿ ¡/ccs2012¿

\ccsdesc

[500]Computing methodologies Feature selection

\acmformat

Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, and Huan Liu, 2017. Feature Selection: A Data Perspective

{bottomstuff}

ŒThis material is based upon work supported by, or in part by, the NSF grant 1217466 and 1614576.

Author’s addresses: J. Li, K. Cheng, S. Wang, F. Morstatter, R.P. Trevino, H. Liu, Computer Science and Engineering, Arizona State University, Tempe, AZ, 85281; email: {jundongl, kcheng18, swang187, fmorstat, rptrevin, huan.liu}@asu.edu; J. Tang, Michigan State University, East Lansing, MI 48824; email: tangjili@msu.edu.

1 Introduction

We are now in the era of big data, where huge amounts of high-dimensional data become ubiquitous in a variety of domains, such as social media, healthcare, bioinformatics and online education. The rapid growth of data presents challenges for effective and efficient data management. It is desirable to apply data mining and machine learning techniques to automatically discover knowledge from data of various sorts.

When data mining and machine learning algorithms are applied on high-dimensional data, a critical issue is known as the curse of dimensionality. It refers to the phenomenon that data becomes sparser in high-dimensional space, adversely affecting algorithms designed for low-dimensional space [Hastie et al. (2005)]. Also, with a large number of features, learning models tend to overfit which may cause performance degradation on unseen data. Data of high dimensionality can significantly increase the memory storage requirements and computational costs for data analytics.

Dimensionality reduction is one of the most powerful tools to address the previously described issues. It can be mainly categorized into two main components: feature extraction and feature selection. Feature extraction projects the original high-dimensional features to a new feature space with low dimensionality. The newly constructed feature space is usually a linear or nonlinear combination of the original features. Feature selection, on the other hand, directly selects a subset of relevant features for model construction [Guyon and Elisseeff (2003), Liu and Motoda (2007)].

Both feature extraction and feature selection have the advantages of improving learning performance, increasing computational efficiency, decreasing memory storage, and building better generalization models. Hence, they are both regarded as effective dimensionality reduction techniques. On one hand, for many applications where the raw input data does not contain any features understandable to a given learning algorithm, feature extraction is preferred. On the other hand, as feature extraction creates a set of new features, further analysis is problematic as we cannot retain the physical meanings of these features. In contrast, by keeping some of the original features, feature selection maintains physical meanings of the original features and gives models better readability and interpretability. Therefore, feature selection is often preferred in many applications such as text mining and genetic analysis. It should be noted that in some cases even though feature dimensionality is often not that high, feature extraction/selection still plays an essential role such as improving learning performance, preventing overfitting, and reducing computational costs.

Refer to caption
(a) relevant feature f1f_{1}
Refer to caption
(b) redundant feature f2f_{2}
Refer to caption
(c) irrelevant feature f3f_{3}
Figure 1: An illustrative example of relevant, redundant and irrelevant features.

Real-world data contains a lot of irrelevant, redundant and noisy features. Removing these features by feature selection reduces storage and computational cost while avoiding significant loss of information or degradation of learning performance. For example, in Fig. 1(a), feature f1f_{1} is a relevant feature that is able to discriminate two classes (clusters). However, given feature f1f_{1}, feature f2f_{2} in Fig. 1(b) is redundant as f2f_{2} is strongly correlated with f1f_{1}. In Fig. 1(c), feature f3f_{3} is an irrelevant feature as it cannot separate two classes (clusters) at all. Therefore, the removal of f2f_{2} and f3f_{3} will not negatively impact the learning performance.

1.1 Traditional Categorization of Feature Selection Algorithms

1.1.1 Supervision Perspective

According to the availability of supervision (such as class labels in classification problems), feature selection can be broadly classified as supervised, unsupervised and semi-supervised methods.

Supervised feature selection is generally designed for classification or regression problems. It aims to select a subset of features that are able to discriminate samples from different classes (classification) or to approximate the regression targets (regression). With supervision information, feature relevance is usually assessed via its correlation with the class labels or the regression target. The training phase highly depends on the selected features: after splitting the data into training and testing sets, classifiers or regression models are trained based on a subset of features selected by supervised feature selection. Note that the feature selection phase can be independent of the learning algorithms (filter methods); or it may iteratively take advantage of the learning performance of a classifier or a regression model to assess the quality of selected features so far (wrapper methods); or make use of the intrinsic structure of a learning algorithm to embed feature selection into the underlying model (embedded methods). Finally, the trained classifier or regression model predicts class labels or regression targets of unseen samples in the test set with the selected features. In the following context, for supervised methods, we mainly focus on classification problems, and use label information, supervision information interchangeably.

Unsupervised feature selection is generally designed for clustering problems. As acquiring labeled data is particularly expensive in both time and efforts, unsupervised feature selection has gained considerable attention recently. Without label information to evaluate the importance of features, unsupervised feature selection methods seek alternative criteria to define feature relevance. Different from supervised feature selection, unsupervised feature selection usually uses all instances that are available in the feature selection phase. The feature selection phase can be independent of the unsupervised learning algorithms (filter methods); or it relies on the learning algorithms to iteratively improve the quality of selected features (wrapper methods); or embed the feature selection phase into unsupervised learning algorithms (embedded methods). After the feature selection phase, it outputs the cluster structure of all data samples on the selected features by using a standard clustering algorithm [Guyon and Elisseeff (2003), Liu and Motoda (2007), Tang et al. (2014a)].

Supervised feature selection works when sufficient label information is available while unsupervised feature selection algorithms do not require any class labels. However, in many real-world applications, we usually have a limited number of labeled data. Therefore, it is desirable to develop semi-supervised methods by exploiting both labeled and unlabeled data samples.

1.1.2 Selection Strategy Perspective

Concerning different selection strategies, feature selection methods can be broadly categorized as wrapper, filter and embedded methods.

Wrapper methods rely on the predictive performance of a predefined learning algorithm to evaluate the quality of selected features. Given a specific learning algorithm, a typical wrapper method performs two steps: (1) search for a subset of features; and (2) evaluate the selected features. It repeats (1) and (2) until some stopping criteria are satisfied. Feature set search component first generates a subset of features; then the learning algorithm acts as a black box to evaluate the quality of these features based on the learning performance. For example, the whole process works iteratively until such as the highest learning performance is achieved or the desired number of selected features is obtained. Then the feature subset that gives the highest learning performance is returned as the selected features. Unfortunately, a known issue of wrapper methods is that the search space for dd features is 2d2^{d}, which is impractical when dd is very large. Therefore, different search strategies such as sequential search [Guyon and Elisseeff (2003)], hill-climbing search, best-first search [Kohavi and John (1997), Arai et al. (2016)], branch-and-bound search [Narendra and Fukunaga (1977)] and genetic algorithms [Golberg (1989)] are proposed to yield a local optimum learning performance. However, the search space is still extremely huge for high-dimensional datasets. As a result, wrapper methods are seldom used in practice.

Filter methods are independent of any learning algorithms. They rely on characteristics of data to assess feature importance. Filter methods are typically more computationally efficient than wrapper methods. However, due to the lack of a specific learning algorithm guiding the feature selection phase, the selected features may not be optimal for the target learning algorithms. A typical filter method consists of two steps. In the first step, feature importance is ranked according to some feature evaluation criteria. The feature importance evaluation process can be either univariate or multivariate. In the univariate scheme, each feature is ranked individually regardless of other features, while the multivariate scheme ranks multiple features in a batch way. In the second step of a typical filter method, lowly ranked features are filtered out. In the past decades, different evaluation criteria for filter methods have been proposed. Some representative criteria include feature discriminative ability to separate samples [Kira and Rendell (1992), Robnik-Šikonja and Kononenko (2003), Yang et al. (2011), Du et al. (2013), Tang et al. (2014b)], feature correlation [Koller and Sahami (1995), Guyon and Elisseeff (2003)], mutual information [Yu and Liu (2003), Peng et al. (2005), Nguyen et al. (2014), Shishkin et al. (2016), Gao et al. (2016)], feature ability to preserve data manifold structure [He et al. (2005), Zhao and Liu (2007), Gu et al. (2011b), Jiang and Ren (2011)], and feature ability to reconstruct the original data [Masaeli et al. (2010), Farahat et al. (2011), Li et al. (2017b)].

Embedded methods is a trade-off between filter and wrapper methods which embed the feature selection into model learning. Thus they inherit the merits of wrapper and filter methods – (1) they include the interactions with the learning algorithm; and (2) they are far more efficient than the wrapper methods since they do not need to evaluate feature sets iteratively. The most widely used embedded methods are the regularization models which target to fit a learning model by minimizing the fitting errors and forcing feature coefficients to be small (or exact zero) simultaneously. Afterwards, both the regularization model and selected feature sets are returned as the final results.

It should be noted that some literature classifies feature selection methods into four categories (from the selection strategy perspective) by including the hybrid feature selection methods [Saeys et al. (2007), Shen et al. (2012), Ang et al. (2016)]. Hybrid methods can be regarded as a combination of multiple feature selection algorithms (e.g., wrapper, filter, and embedded). The main target is to tackle the instability and perturbation issues of many existing feature selection algorithms. For example, for small-sized high-dimensional data, a small perturbation on the training data may result in totally different feature selection results. By aggregating multiple selected feature subsets from different methods together, the results are more robust and hence the credibility of the selected features is enhanced.

1.2 Feature Selection Algorithms from a Data Perspective

Refer to caption
Figure 2: Feature selection algorithms from the data perspective.

The recent popularity of big data presents unique challenges for traditional feature selection [Li and Liu (2017)], and some characteristics of big data such as velocity and variety necessitate the development of novel feature selection algorithms. Here we briefly discuss some major concerns when applying feature selection algorithms.

Streaming data and features have become more and more prevalent in real-world applications. It poses challenges to traditional feature selection algorithms, which are designed for static datasets with fixed data samples and features. For example in Twitter, new data like posts and new features like slang words are continuously being user-generated. It is impractical to apply traditional batch-mode feature selection algorithms to find relevant features from scratch when new data or new feature arrives. Moreover, the volume of data may be too large to be loaded into memory. In many cases, a single scan of data is desired as further scans is either expensive or impractical. Given the reasons mentioned above, it is appealing to apply feature selection in a streaming fashion to dynamically maintain a set of relevant features.

Most existing algorithms of feature selection are designed to handle tasks with a single data source and always assume that data is independent and identically distributed (i.i.d.). However, data could come from multiple sources in many applications. For example, in social media, data comes from heterogeneous sources such as text, images, tags, videos. In addition, linked data is ubiquitous and presents in various forms such as user-post relations and user-user relations. The availability of multiple data sources brings unprecedented opportunities as we can leverage shared intrinsic characteristics and correlations to find more relevant features. However, challenges are also unequivocally presented. For instance, with link information, the widely adopted i.i.d. assumption in most learning algorithms does not hold. How to appropriately utilize link information for feature selection is still a challenging problem.

Features can also exhibit certain types of structures. Some well-known structures among features are group, tree, and graph structures. When performing feature selection, if the feature structure is not taken into consideration, the intrinsic dependencies may not be captured, thus the selected features may not be suitable for the target application. Incorporating prior knowledge of feature structures can help select relevant features to improve the learning performance greatly.

The aforementioned reasons motivate the investigation of feature selection algorithms from a different view. In this survey, we revisit feature selection algorithms from a data perspective; the categorization is illustrated in Fig. 2. It is shown that data consists of static data and streaming data. For the static data, it can be grouped into conventional data and heterogeneous data. In conventional data, features can either be flat or possess some inherent structures. Traditional feature selection algorithms are proposed to deal with these flat features in which features are considered to be independent. The past few decades have witnessed hundreds of feature selection algorithms. Based on their technical characteristics, we propose to classify them into four main groups, i.e., similarity based, information theoretical based, sparse learning based and statistical based methods. It should be noted that this categorization only involves filter methods and embedded methods while the wrapper methods are excluded. The reason for excluding wrapper methods is that they are computationally expensive and are usually used in specific applications. More details about these four categories will be presented later. We present other methods that cannot be fitted into these four categories, such as hybrid methods, deep learning based methods and reconstruction based methods. When features express some structures, specific feature selection algorithms are more desired. Data can be heterogeneous such that data could come from multiple sources and could be linked. Hence, we also show how new feature selection algorithms cope with these situations. Second, in the streaming settings, data arrives sequentially in a streaming fashion where the size of data instances is unknown, feature selection algorithms that make only one pass over the data is proposed accordingly. Similarly, in an orthogonal setting, features can also be generated dynamically. Streaming feature selection algorithms are designed to determine if one should accept the newly added features and remove existing but outdated features.

1.3 Differences with Existing Surveys

Currently, there exist some other surveys which give a summarization of feature selection algorithms, such as those in [Guyon and Elisseeff (2003), Alelyani et al. (2013), Chandrashekar and Sahin (2014), Tang et al. (2014a)]. These studies either focus on traditional feature selection algorithms or specific learning tasks like classification and clustering. However, none of them provide a comprehensive and structured overview of traditional feature selection algorithms in conjunction with recent advances in feature selection from a data perspective. In this survey, we will introduce representative feature selection algorithms to cover all components mentioned in Fig. 2. We also release a feature selection repository in Python named scikit-feature which is built upon the widely used machine learning package scikit-learn (http://scikit-learn.org/stable/) and two scientific computing packages Numpy (http://www.numpy.org/) and Scipy (http://www.scipy.org/). It includes near 40 representative feature selection algorithms. The web page of the repository is available at http://featureselection.asu.edu/.

1.4 Organization of the Survey

We present this survey in seven parts, and the covered topics are listed as follows:

  1. 1.

    Traditional Feature Selection for Conventional Data (Section 2)

    1. (a)

      Similarity based Feature Selection Methods

    2. (b)

      Information Theoretical based Feature Selection Methods

    3. (c)

      Sparse Learning based Feature Selection Methods

    4. (d)

      Statistical based Feature Selection Methods

    5. (e)

      Other Methods

  2. 2.

    Feature Selection with Structured Features (Section 3)

    1. (a)

      Feature Selection Algorithms with Group Structure Features

    2. (b)

      Feature Selection Algorithms with Tree Structure Features

    3. (c)

      Feature Selection Algorithms with Graph Structure Features

  3. 3.

    Feature Selection with Heterogeneous Data (Section 4)

    1. (a)

      Feature Selection Algorithms with Linked Data

    2. (b)

      Multi-Source Feature Selection

    3. (c)

      Multi-View Feature Selection

  4. 4.

    Feature Selection with Streaming Data (Section 5)

    1. (a)

      Feature Selection Algorithms with Data Streams

    2. (b)

      Feature Selection Algorithms with Feature Streams

  5. 5.

    Performance Evaluation (Section 6)

  6. 6.

    Open Problems and Challenges (Section 7)

  7. 7.

    Summary of the Survey (Section 8)

1.5 Notations

We summarize some symbols used throughout this survey in Table 1. We use bold uppercase characters for matrices (e.g., 𝐀{\bf A}), bold lowercase characters for vectors (e.g., 𝐚{\bf a}), calligraphic fonts for sets (e.g., \mathcal{F}). We follow the matrix settings in Matlab to represent ii-th row of matrix 𝐀{\bf A} as 𝐀(i,:){\bf A}(i,:), jj-th column of 𝐀{\bf A} as 𝐀(:,j){\bf A}(:,j), (i,j)(i,j)-th entry of 𝐀{\bf A} as 𝐀(i,j){\bf A}(i,j), transpose of 𝐀{\bf A} as 𝐀{\bf A}^{\prime}, and trace of 𝐀{\bf A} as tr(𝐀)tr({\bf A}). For any matrix 𝐀n×d{\bf A}\in\mathbb{R}^{n\times d}, its Frobenius norm is defined as 𝐀F=i=1nj=1d𝐀(i,j)2\|{\bf A}\|_{F}=\sqrt{\sum_{i=1}^{n}\sum_{j=1}^{d}{\bf A}(i,j)^{2}}, and its 2,1\ell_{2,1}-norm is 𝐀2,1=i=1nj=1d𝐀(i,j)2\|{\bf A}\|_{2,1}=\sum_{i=1}^{n}\sqrt{\sum_{j=1}^{d}{\bf A}(i,j)^{2}}. For any vector 𝐚=[a1,a2,,an]{\bf a}=[a_{1},a_{2},...,a_{n}]^{\prime}, its 2\ell_{2}-norm is defined as 𝐚2=i=1nai2\|{\bf a}\|_{2}=\sqrt{\sum_{i=1}^{n}a_{i}^{2}}, and its 1\ell_{1}-norm is 𝐚1=i=1n|ai|\|{\bf a}\|_{1}=\sum_{i=1}^{n}|a_{i}|. 𝐈{\bf I} is an identity matrix and 𝟏{\bf 1} is a vector whose elements are all 1’s.

Notations Definitions or Descriptions
nn number of instances in the data
dd number of features in the data
kk number of selected features
cc number of classes (if exist)
\mathcal{F} original feature set which contains dd features
𝒮\mathcal{S} selected feature set which contains kk selected features
{i1,i2,,ik}\{i_{1},i_{2},...,i_{k}\} index of kk selected features in 𝒮\mathcal{S}
f1,f2,,fdf_{1},f_{2},...,f_{d} dd original features
fi1,fi2,,fikf_{i_{1}},f_{i_{2}},...,f_{i_{k}} kk selected features
x1,x2,,xnx_{1},x_{2},...,x_{n} nn data instances
𝐟1,𝐟2,,𝐟d{\bf f}_{1},{\bf f}_{2},...,{\bf f}_{d} dd feature vectors corresponding to f1,f2,,fdf_{1},f_{2},...,f_{d}
𝐟i1,𝐟i2,,𝐟ik{\bf f}_{i_{1}},{\bf f}_{i_{2}},...,{\bf f}_{i_{k}} kk feature vectors corresponding to fi1,fi2,,fikf_{i_{1}},f_{i_{2}},...,f_{i_{k}}
𝐱1,𝐱2,,𝐱n{\bf x}_{1},{\bf x}_{2},...,{\bf x}_{n} nn data vectors corresponding to x1,x2,,xnx_{1},x_{2},...,x_{n}
y1,y2,,yny_{1},y_{2},...,y_{n} class labels of all nn instances (if exist)
𝐗n×d{\bf X}\in\mathbb{R}^{n\times d} data matrix with nn instances and dd features
𝐗𝒮n×k{\bf X}_{\mathcal{S}}\in\mathbb{R}^{n\times k} data matrix on the selected kk features
𝐲n{\bf y}\in\mathbb{R}^{n} class label vector for all nn instances (if exist)
Table 1: Symbols.

2 Feature Selection on Conventional Data

Over the past two decades, hundreds of feature selection algorithms have been proposed. In this section, we broadly group traditional feature selection algorithms for conventional data as similarity based, information theoretical based, sparse learning based and statistical based methods, and other methods according to the used techniques.

2.1 Similarity based Methods

Different feature selection algorithms exploit various types of criteria to define the relevance of features. Among them, there is a family of methods assessing feature importance by their ability to preserve data similarity. We refer them as similarity based methods. For supervised feature selection, data similarity can be derived from label information; while for unsupervised feature selection methods, most methods take advantage of different distance metric measures to obtain data similarity.

Given a dataset 𝐗n×d{\bf X}\in\mathbb{R}^{n\times d} with nn instances and dd features, pairwise similarity among instances can be encoded in an affinity matrix 𝐒n×n{\bf S}\in\mathbb{R}^{n\times n}. Suppose that we want to select kk most relevant features 𝒮\mathcal{S}, one way is to maximize their utility: max𝒮U(𝒮)\max_{\mathcal{S}}U(\mathcal{S}), where U(𝒮)U(\mathcal{S}) denotes the utility of the feature subset 𝒮\mathcal{S}. As algorithms in this family often evaluate features individually, the utility maximization over feature subset 𝒮\mathcal{S} can be further decomposed into the following form:

max𝒮U(𝒮)=max𝒮f𝒮U(f)=max𝒮𝐟𝒮𝐟^𝐒^𝐟^,\small\max_{\mathcal{S}}U(\mathcal{S})=\max_{\mathcal{S}}\sum_{f\in\mathcal{S}}U(f)=\max_{\mathcal{S}}\sum_{{\bf f}\in\mathcal{S}}{\bf\hat{f}}^{\prime}{\bf\hat{S}}{\bf\hat{f}},\vskip-3.61371pt (1)

where U(f)U(f) is a utility function for feature ff. 𝐟^{\bf\hat{f}} denotes the transformation (e.g., scaling, normalization, etc) result of the original feature vector 𝐟{\bf f}. 𝐒^{\bf\hat{S}} is a new affinity matrix obtained from affinity matrix 𝐒{\bf S}. The maximization problem in Eq. (1) shows that we would select a subset of features from 𝒮\mathcal{S} such that they can well preserve the data manifold structure encoded in 𝐒^{\bf\hat{S}}. This problem is usually solved by greedily selecting the top kk features that maximize their individual utility. Methods in this category vary in the way the affinity matrix 𝐒^{\bf\hat{S}} is designed. We subsequently discuss some representative algorithms in this group that can be reformulated under the unified utility maximization framework.

2.1.1 Laplacian Score

Laplacian Score [He et al. (2005)] is an unsupervised feature selection algorithm which selects features that can best preserve the data manifold structure. It consists of three phases. First, it constructs the affinity matrix such that 𝐒(i,j)=e𝐱i𝐱j22t{\bf S}(i,j)=e^{-\frac{\|{\bf x}_{i}-{\bf x}_{j}\|_{2}^{2}}{t}} if xix_{i} is among the pp-nearest neighbor of xjx_{j}; otherwise 𝐒(i,j)=0{\bf S}(i,j)=0. Then, the diagonal matrix 𝐃{\bf D} is defined as 𝐃(i,i)=j=1n𝐒(i,j){\bf D}(i,i)=\sum_{j=1}^{n}{\bf S}(i,j) and the Laplacian matrix 𝐋{\bf L} is 𝐋=𝐃𝐒{\bf L}={\bf D}-{\bf S}. Lastly, the Laplacian Score of each feature fif_{i} is computed as:

laplacian_score(fi)=𝐟~i𝐋𝐟~i𝐟~i𝐃𝐟~i, where 𝐟i~=𝐟i𝐟i𝐃𝟏𝟏𝐃𝟏𝟏.\small laplacian\_score(f_{i})=\frac{\tilde{{\bf f}}_{i}^{\prime}{\bf L}\tilde{{\bf f}}_{i}}{\tilde{{\bf f}}_{i}^{\prime}{\bf D}\tilde{{\bf f}}_{i}},\mbox{ where }\tilde{{\bf f}_{i}}={\bf f}_{i}-\frac{{\bf f}_{i}^{\prime}{\bf D}{\bf 1}}{{\bf 1}^{\prime}{\bf D}{\bf 1}}{\bf 1}.\vskip-3.61371pt (2)

As Laplacian Score evaluates each feature individually, the task of selecting the kk features can be solved by greedily picking the top kk features with the smallest Laplacian Scores. The Laplacian Score of each feature can be reformulated as:

laplacian_score(fi)=1(𝐟~i𝐃12𝐟~i2)𝐒(𝐟~i𝐃12𝐟~i2),\small laplacian\_score(f_{i})=1-\left(\frac{\tilde{{\bf f}}_{i}}{\|{\bf D}^{\frac{1}{2}}\tilde{{\bf f}}_{i}\|_{2}}\right)^{\prime}{\bf S}\left(\frac{\tilde{{\bf f}}_{i}}{\|{\bf D}^{\frac{1}{2}}\tilde{{\bf f}}_{i}\|_{2}}\right),\vskip-3.61371pt (3)

where 𝐃12𝐟~i2\|{\bf D}^{\frac{1}{2}}\tilde{{\bf f}}_{i}\|_{2} is the standard data variance of feature fif_{i}, and the term 𝐟~i/𝐃12𝐟~i2\tilde{{\bf f}}_{i}/\|{\bf D}^{\frac{1}{2}}\tilde{{\bf f}}_{i}\|_{2} is interpreted as a normalized feature vector of 𝐟i{\bf f}_{i}. Therefore, it is obvious that Laplacian Score is a special case of utility maximization in Eq. (1).

2.1.2 SPEC

SPEC [Zhao and Liu (2007)] is an extension of Laplacian Score that works for both supervised and unsupervised scenarios. For example, in the unsupervised scenario, the data similarity is measured by RBF kernel; while in the supervised scenario, data similarity can be defined by: 𝕊(i,j)={1nlif yi=yj=l0otherwise\small\mathbb{S}(i,j)=\left\{\begin{array}[]{ll}\frac{1}{n_{l}}&\text{if }y_{i}=y_{j}=l\\ 0&\text{otherwise}\\ \end{array}\right. , where nln_{l} is the number of data samples in the ll-th class. After obtaining the affinity matrix 𝐒{\bf S} and the diagonal matrix 𝐃{\bf D}, the normalized Laplacian matrix 𝐋norm=𝐃12(𝐃𝐒)𝐃12{\bf L}_{norm}={\bf D}^{-\frac{1}{2}}({\bf D}-{\bf S}){\bf D}^{-\frac{1}{2}}. The basic idea of SPEC is similar to Laplacian Score: a feature that is consistent with the data manifold structure should assign similar values to instances that are near each other. In SPEC, the feature relevance is measured by three different criteria:

SPEC_score1(fi)=𝐟i^γ(𝐋norm)𝐟i^=j=1nαj2γ(λj)SPEC_score2(fi)=𝐟i^γ(𝐋norm)𝐟i^1(𝐟i^ξ1)2=j=2nαj2γ(λj)j=2nαj2SPEC_score3(fi)=j=1m(γ(2)γ(λj))αj2.\small\begin{split}SPEC\_score1(f_{i})&=\hat{{\bf f}_{i}}^{\prime}\gamma({\bf L}_{norm})\hat{{\bf f}_{i}}=\sum_{j=1}^{n}\alpha_{j}^{2}\gamma(\lambda_{j})\\ SPEC\_score2(f_{i})&=\frac{\hat{{\bf f}_{i}}^{\prime}\gamma({\bf L}_{norm})\hat{{\bf f}_{i}}}{1-(\hat{{\bf f}_{i}}^{\prime}\xi_{1})^{2}}=\frac{\sum_{j=2}^{n}\alpha_{j}^{2}\gamma(\lambda_{j})}{\sum_{j=2}^{n}\alpha_{j}^{2}}\\ SPEC\_score3(f_{i})&=\sum_{j=1}^{m}(\gamma(2)-\gamma(\lambda_{j}))\alpha_{j}^{2}.\vskip-7.22743pt\end{split} (4)

In the above equations, 𝐟i^=𝐃12𝐟i/𝐃12𝐟i2\hat{{\bf f}_{i}}={\bf D}^{\frac{1}{2}}{\bf f}_{i}/\|{\bf D}^{\frac{1}{2}}{\bf f}_{i}\|_{2}; (λj,ξj)(\lambda_{j},\xi_{j}) is the jj-th eigenpair of the normalized Laplacian matrix 𝐋norm{\bf L}_{norm}; αj=cosθj\alpha_{j}=\cos\theta_{j}, θj\theta_{j} is the angle between ξj\xi_{j} and 𝐟i{\bf f}_{i}; γ(.)\gamma(.) is an increasing function to penalize high frequency components of the eigensystem to reduce noise. If the data is noise free, the function γ(.)\gamma(.) can be removed and γ(x)=x\gamma(x)=x. When the second evaluation criterion SPEC_score2(fi)SPEC\_score2(f_{i}) is used, SPEC is equivalent to the Laplacian Score. For SPEC_score3(fi)SPEC\_score3(f_{i}), it uses the top mm eigenpairs to evaluate the importance of feature fif_{i}.

All these three criteria can be reduced to the the unified similarity based feature selection framework in Eq. (1) by setting 𝐟^i{\bf\hat{f}}_{i} as 𝐟i𝐃12𝐟i2{\bf f}_{i}\|{\bf D}^{\frac{1}{2}}{\bf f}_{i}\|_{2}, (𝐟iμ𝟏)/𝐃12𝐟i2({\bf f}_{i}-\mu{\bf 1})/\|{\bf D}^{\frac{1}{2}}{\bf f}_{i}\|_{2}, 𝐟i𝐃12𝐟i2{\bf f}_{i}\|{\bf D}^{\frac{1}{2}}{\bf f}_{i}\|_{2}; and 𝐒^{\bf\hat{S}} as 𝐃12𝐔(𝐈γ(𝚺))𝐔𝐃12{\bf D}^{\frac{1}{2}}{\bf U}({\bf I}-\gamma({\bf\Sigma})){\bf U}^{\prime}{\bf D}^{\frac{1}{2}}, 𝐃12𝐔(𝐈γ(𝚺))𝐔𝐃12{\bf D}^{\frac{1}{2}}{\bf U}({\bf I}-\gamma({\bf\Sigma})){\bf U}^{\prime}{\bf D}^{\frac{1}{2}}, 𝐃12𝐔m(γ(2𝐈)γ(𝚺m))𝐔m𝐃12{\bf D}^{\frac{1}{2}}{\bf U}_{m}(\gamma(2{\bf I})-\gamma({\bf\Sigma}_{m})){\bf U}_{m}^{\prime}{\bf D}^{\frac{1}{2}} in SPEC_score1, SPEC_score2, SPEC_score3, respectively. 𝐔{\bf U} and 𝚺{\bf\Sigma} are the singular vectors and singular values of the normalized Laplacian matrix 𝐋norm{\bf L}_{norm}.

2.1.3 Fisher Score

Fisher Score [Duda et al. (2012)] is a supervised feature selection algorithm. It selects features such that the feature values of samples within the same class are similar while the feature values of samples from different classes are dissimilar. The Fisher Score of each feature fif_{i} is evaluated as follows:

fisher_score(fi)=j=1cnj(μijμi)2j=1cnjσij2,\small fisher\_score(f_{i})=\frac{\sum_{j=1}^{c}n_{j}(\mu_{ij}-\mu_{i})^{2}}{\sum_{j=1}^{c}n_{j}\sigma_{ij}^{2}},\vskip-3.61371pt (5)

where njn_{j}, μi\mu_{i}, μij\mu_{ij} and σij2\sigma_{ij}^{2} indicate the number of samples in class jj, mean value of feature fif_{i}, mean value of feature fif_{i} for samples in class jj, variance value of feature fif_{i} for samples in class jj, respectively. Similar to Laplacian Score, the top kk features can be obtained by greedily selecting the features with the largest Fisher Scores.

According to [He et al. (2005)], Fisher Score can be considered as a special case of Laplacian Score as long as the affinity matrix is 𝕊(i,j)={1nlif yi=yj=l0otherwise,\mathbb{S}(i,j)=\left\{\begin{array}[]{ll}\frac{1}{n_{l}}&\text{if }y_{i}=y_{j}=l\\ 0&\text{otherwise,}\\ \end{array}\right.. In this way, the relationship between Fisher Score and Laplacian Score is fisher_score(fi)=11laplacian_score(fi)fisher\_score(f_{i})=1-\frac{1}{laplacian\_score(f_{i})}. Hence, the computation of Fisher Score can also be reduced to the unified utility maximization framework.

2.1.4 Trace Ratio Criterion

The trace ratio criterion [Nie et al. (2008)] directly selects the global optimal feature subset based on the corresponding score, which is computed by a trace ratio norm. It builds two affinity matrices 𝐒w{\bf S}_{w} and 𝐒b{\bf S}_{b} to characterize within-class and between-class data similarity. Let 𝐖=[𝐰i1,𝐰i2,,𝐰ik]d×k{\bf W}=[{\bf w}_{i_{1}},{\bf w}_{i_{2}},...,{\bf w}_{i_{k}}]\in\mathbb{R}^{d\times k} be the selection indicator matrix such that only the iji_{j}-th entry in 𝐰ij{\bf w}_{i_{j}} is 1 and all the other entries are 0. With these, the trace ratio score of the selected kk features in 𝒮\mathcal{S} is:

trace_ratio(𝒮)=tr(𝐖𝐗𝐋b𝐗𝐖)tr(𝐖𝐗𝐋w𝐗𝐖),\small trace\_ratio(\mathcal{S})=\frac{tr({\bf W}^{\prime}{\bf X}^{\prime}{\bf L}_{b}{\bf X}{\bf W})}{tr({\bf W}^{\prime}{\bf X}^{\prime}{\bf L}_{w}{\bf X}{\bf W})}, (6)

where 𝐋b{\bf L}_{b} and 𝐋w{\bf L}_{w} are Laplacian matrices of 𝐒a{\bf S}_{a} and 𝐒b{\bf S}_{b} respectively. The basic idea is to maximize the data similarity for instances from the same class while minimize the data similarity for instances from different classes. However, the trace ratio problem is difficult to solve as it does not have a closed-form solution. Hence, the trace ratio problem is often converted into a more tractable format called the ratio trace problem by maximizing tr[(𝐖𝐗𝐋w𝐗𝐖)1(𝐖𝐗𝐋b𝐗𝐖)]tr[({\bf W}^{\prime}{\bf X}^{\prime}{\bf L}_{w}{\bf X}{\bf W})^{-1}({\bf W}^{\prime}{\bf X}^{\prime}{\bf L}_{b}{\bf X}{\bf W})]. As an alternative, [Wang et al. (2007)] propose an iterative algorithm called ITR to solve the trace ratio problem directly and was later applied in trace ratio feature selection [Nie et al. (2008)].

Different 𝐒b{\bf S}_{b} and 𝐒w{\bf S}_{w} lead to different feature selection algorithms such as batch-mode Lalpacian Score and batch-mode Fisher Score. For example, in batch-mode Fisher Score, the within-class data similarity and the between-class data similarity are 𝐒w(i,j)={1/nlif yi=yj=l0otherwise{\bf S}_{w}(i,j)=\left\{\begin{array}[]{ll}1/n_{l}&\text{if }y_{i}=y_{j}=l\\ 0&\text{otherwise}\\ \end{array}\right. and 𝐒b(i,j)={1/n1/nlif yi=yj=l1/notherwise{\bf S}_{b}(i,j)=\left\{\begin{array}[]{ll}1/n-1/n_{l}&\text{if }y_{i}=y_{j}=l\\ 1/n&\text{otherwise}\\ \end{array}\right. respectively. Therefore, maximizing the trace ratio criterion is equivalent to maximizing s=1k𝐟is𝐒w𝐟iss=1k𝐟is𝐟is=𝐗𝒮𝐒w𝐗𝒮𝐗𝒮𝐗𝒮.\frac{\sum_{s=1}^{k}{\bf f}_{i_{s}}^{\prime}{\bf S}_{w}{\bf f}_{i_{s}}}{\sum_{s=1}^{k}{\bf f}_{i_{s}}^{\prime}{\bf f}_{i_{s}}}=\frac{{\bf X}_{\mathcal{S}}^{\prime}{\bf S}_{w}{\bf X}_{\mathcal{S}}}{{\bf X}_{\mathcal{S}}^{\prime}{\bf X}_{\mathcal{S}}}. Since 𝐗𝒮𝐗𝒮{\bf X}_{\mathcal{S}}^{\prime}{\bf X}_{\mathcal{S}} is constant, it can be further reduced to the unified similarity based feature selection framework by setting 𝐟^=𝐟/𝐟2{\bf\hat{f}}={\bf f}/\|{\bf f}\|_{2} and 𝐒^=𝐒w{\bf\hat{S}}={\bf S}_{w}. On the other hand in batch-mode Laplacian Score, the within-class data similarity and the between-class data similarity are 𝐒w(i,j)={e𝐱i𝐱j22tif 𝐱i𝒩p(𝐱j) or 𝐱j𝒩p(𝐱i)0otherwise{\bf S}_{w}(i,j)=\left\{\begin{array}[]{ll}e^{-\frac{\|{\bf x}_{i}-{\bf x}_{j}\|_{2}^{2}}{t}}&\text{if }{\bf x}_{i}\in\mathcal{N}_{p}({\bf x}_{j})\text{ or }{\bf x}_{j}\in\mathcal{N}_{p}({\bf x}_{i})\\ 0&\text{otherwise}\\ \end{array}\right. and 𝐒b=(𝟏𝐃w𝟏)1𝐃w𝟏𝟏𝐃w{\bf S}_{b}=({\bf 1}^{\prime}{\bf D}_{w}{\bf 1})^{-1}{\bf D}_{w}{\bf 1}{\bf 1}^{\prime}{\bf D}_{w} respectively. In this case, the trace ratio criterion score is tr(𝐖𝐗𝐋b𝐗𝐖)tr(𝐖𝐗𝐋w𝐗𝐖)=s=1k𝐟is𝐃w𝐟iss=1k𝐟is(𝐃w𝐒w)𝐟is\frac{tr({\bf W}^{\prime}{\bf X}^{\prime}{\bf L}_{b}{\bf X}{\bf W})}{tr({\bf W}^{\prime}{\bf X}^{\prime}{\bf L}_{w}{\bf X}{\bf W})}=\frac{\sum_{s=1}^{k}{\bf f}_{i_{s}}^{\prime}{\bf D}_{w}{\bf f}_{i_{s}}}{\sum_{s=1}^{k}{\bf f}_{i_{s}}^{\prime}({\bf D}_{w}-{\bf S}_{w}){\bf f}_{i_{s}}}. Therefore, maximizing the trace ratio criterion is also equivalent to solving the unified maximization problem in Eq. (1) where 𝐟^=𝐟/𝐃w12𝐟2{\bf\hat{f}}={\bf f}/\|{\bf D}_{w}^{\frac{1}{2}}{\bf f}\|_{2} and 𝐒^=𝐒w{\bf\hat{S}}={\bf S}_{w}.

2.1.5 ReliefF

ReliefF [Robnik-Šikonja and Kononenko (2003)] selects features to separate instances from different classes. Assume that ll data instances are randomly selected among all nn instances, then the feature score of fif_{i} in ReliefF is defined as follows:

ReliefF_score(fi)=1cj=1l(1mjxrNH(j)d(𝐗(j,i)𝐗(r,i))+yyj1hjyp(y)1p(y)xrNM(j,y)d(𝐗(j,i)𝐗(r,i))),\small\begin{split}ReliefF\_score(f_{i})=&\frac{1}{c}\sum_{j=1}^{l}(-\frac{1}{m_{j}}\sum_{x_{r}\in\text{\emph{NH}}(j)}d({\bf X}(j,i)-{\bf X}(r,i))\\ +&\sum_{y\neq y_{j}}\frac{1}{h_{jy}}\frac{p(y)}{1-p(y)}\sum_{x_{r}\in\text{\emph{NM}}(j,y)}d({\bf X}(j,i)-{\bf X}(r,i))),\end{split}\vskip-7.22743pt (7)

where NH(j)(j) and NM(j,y)(j,y) are the nearest instances of xjx_{j} in the same class and in class yy, respectively. Their sizes are mjm_{j} and hjyh_{jy}, respectively. p(y)p(y) is the ratio of instances in class yy.

ReliefF is equivalent to selecting features that preserve a special form of data similarity matrix. Assume that the dataset has the same number of instances in each of the cc classes and there are qq instances in both NM(j)NM(j) and NH(j,y)NH(j,y). Then according to [Zhao and Liu (2007)], the ReliefF feature selection can be reduced to the utility maximization framework in Eq. (1).

Discussion: Similarity based feature selection algorithms have demonstrated with excellent performance in both supervised and unsupervised learning problems. This category of methods is straightforward and simple as the computation focuses on building an affinity matrix, and afterwards, the scores of features can be obtained. Also, these methods are independent of any learning algorithms and the selected features are suitable for many subsequent learning tasks. However, one drawback of these methods is that most of them cannot handle feature redundancy. In other words, they may repeatedly find highly correlated features during the selection phase

2.2 Information Theoretical based Methods

A large family of existing feature selection algorithms is information theoretical based methods. Algorithms in this family exploit different heuristic filter criteria to measure the importance of features. As indicated in [Duda et al. (2012)], many hand-designed information theoretic criteria are proposed to maximize feature relevance and minimize feature redundancy. Since the relevance of a feature is usually measured by its correlation with class labels, most algorithms in this family are performed in a supervised way. In addition, most information theoretic concepts can only be applied to discrete variables. Therefore, feature selection algorithms in this family can only work with discrete data. For continuous feature values, some data discretization techniques are required beforehand. Two decades of research on information theoretic criteria can be unified in a conditional likelihood maximization framework [Brown et al. (2012)]. In this subsection, we introduce some representative algorithms in this family. We first give a brief introduction about basic information theoretic concepts.

The concept of entropy measures the uncertainty of a discrete random variable. The entropy of a discrete random variable XX is defined as follows:

H(X)=xiXP(xi)log(P(xi)),\small H(X)=-\sum_{x_{i}\in X}P(x_{i})log(P(x_{i})),\vskip-7.22743pt (8)

where xix_{i} denotes a specific value of random variable XX, P(xi)P(x_{i}) denotes the probability of xix_{i} over all possible values of XX.

Second, the conditional entropy of XX given another discrete random variable YY is:

H(X|Y)=yjYP(yj)xiXP(xi|yj)log(P(xi|yj)),\small H(X|Y)=-\sum_{y_{j}\in Y}P(y_{j})\sum_{x_{i}\in X}P(x_{i}|y_{j})log(P(x_{i}|y_{j})),\vskip-7.22743pt (9)

where P(yi)P(y_{i}) is the prior probability of yiy_{i}, while P(xi|yj)P(x_{i}|y_{j}) is the conditional probability of xix_{i} given yjy_{j}. It shows the uncertainty of XX given YY.

Then, information gain or mutual information between XX and YY is used to measure the amount of information shared by XX and YY together:

I(X;Y)=H(X)H(X|Y)=xiXyjYP(xi,yj)logP(xi,yj)P(xi)P(yj),\small I(X;Y)=H(X)-H(X|Y)=\sum_{x_{i}\in X}\sum_{y_{j}\in Y}P(x_{i},y_{j})log\frac{P(x_{i},y_{j})}{P(x_{i})P(y_{j})},\vskip-3.61371pt (10)

where P(xi,yj)P(x_{i},y_{j}) is the joint probability of xix_{i} and yjy_{j}. Information gain is symmetric such that I(X;Y)=I(Y;X)I(X;Y)=I(Y;X), and is zero if the discrete variables XX and YY are independent.

At last, conditional information gain (or conditional mutual information) of discrete variables XX and YY given a third discrete variable ZZ is given as follows:

I(X;Y|Z)=H(X|Z)H(X|Y,Z)=zkZP(zk)xiXyjYP(xi,yj|zk)logP(xi,yj|zk)P(xi|zk)P(yj|zk).\small I(X;Y|Z)=H(X|Z)-H(X|Y,Z)=\sum_{z_{k}\in Z}P(z_{k})\sum_{x_{i}\in X}\sum_{y_{j}\in Y}P(x_{i},y_{j}|z_{k})log\frac{P(x_{i},y_{j}|z_{k})}{P(x_{i}|z_{k})P(y_{j}|z_{k})}.\vskip-3.61371pt (11)

It shows the amount of mutual information shared by XX and YY given ZZ.

Searching for the global best set of features is NP-hard, thus most algorithms exploit heuristic sequential search approaches to add/remove features one by one. In this survey, we explain the feature selection problem by forward sequential search such that features are added into the selected feature set one by one. We denote 𝒮\mathcal{S} as the current selected feature set that is initially empty. YY represents the class labels. Xj𝒮X_{j}\in\mathcal{S} is a specific feature in the current 𝒮\mathcal{S}. J(.)J(.) is a feature selection criterion (score) where, generally, the higher the value of J(Xk)J(X_{k}), the more important the feature XkX_{k} is. In the unified conditional likelihood maximization feature selection framework, the selection criterion (score) for a new unselected feature XkX_{k} is given as follows:

JCMI(Xk)=I(Xk;Y)+Xj𝒮g[I(Xj;Xk),I(Xj;Xk|Y)],\small J_{\text{CMI}}(X_{k})=I(X_{k};Y)+\sum_{X_{j}\in\mathcal{S}}g[I(X_{j};X_{k}),I(X_{j};X_{k}|Y)],\vskip-3.61371pt (12)

where g(.)g(.) is a function w.r.t. two variables I(Xj;Xk)I(X_{j};X_{k}) and I(Xj;Xk|Y)I(X_{j};X_{k}|Y). If g(.)g(.) is a linear function w.r.t. these two variables, it is referred as a criterion by linear combinations of Shannon information terms such that:

JCMI(Xk)=I(Xk;Y)βXj𝒮I(Xj;Xk)+λXj𝒮I(Xj;Xk|Y).\small J_{\text{CMI}}(X_{k})=I(X_{k};Y)-\beta\sum_{X_{j}\in\mathcal{S}}I(X_{j};X_{k})+\lambda\sum_{X_{j}\in\mathcal{S}}I(X_{j};X_{k}|Y).\vskip-3.61371pt (13)

where β\beta and λ\lambda are two nonnegative parameters between zero and one. On the other hand, if g(.)g(.) is a non-linear function w.r.t. these two variables, it is referred as a criterion by non-linear combination of Shannon information terms.

2.2.1 Mutual Information Maximization (Information Gain)

Mutual Information Maximization (MIM) (a.k.a. Information Gain) [Lewis (1992)] measures the importance of a feature by its correlation with class labels. It assumes that when a feature has a strong correlation with the class label, it can help achieve good classification performance. The Mutual Information score for feature XkX_{k} is:

JMIM(Xk)=I(Xk;Y).\small J_{\text{MIM}}(X_{k})=I(X_{k};Y).\vskip-3.61371pt (14)

It can be observed that in MIM, the scores of features are assessed individually. Therefore, only the feature correlation is considered while the feature redundancy is completely ignored. After it obtains the MIM feature scores for all features, we choose the features with the highest feature scores and add them to the selected feature set. The process repeats until the desired number of selected features is obtained.

It can also be observed that MIM is a special case of linear combination of Shannon information terms in Eq. (13) where both β\beta and λ\lambda are equal to zero.

2.2.2 Mutual Information Feature Selection

A limitation of MIM criterion is that it assumes that features are independent of each other. In reality, good features should not only be strongly correlated with class labels but also should not be highly correlated with each other. In other words, the correlation between features should be minimized. Mutual Information Feature Selection (MIFS) [Battiti (1994)] considers both the feature relevance and feature redundancy in the feature selection phase, the feature score for a new unselected feature XkX_{k} can be formulated as follows:

JMIFS(Xk)=I(Xk;Y)βXj𝒮I(Xk;Xj).\small J_{\text{MIFS}}(X_{k})=I(X_{k};Y)-\beta\sum_{X_{j}\in\mathcal{S}}I(X_{k};X_{j}).\vskip-3.61371pt (15)

In MIFS, the feature relevance is evaluated by I(Xk;Y)I(X_{k};Y), while the second term penalizes features that have a high mutual information with the currently selected features such that feature redundancy is minimized.

MIFS can also be reduced to be a special case of the linear combination of Shannon information terms in Eq. (13) where β\beta is between zero and one, and λ\lambda is zero.

2.2.3 Minimum Redundancy Maximum Relevance

[Peng et al. (2005)] proposes a Minimum Redundancy Maximum Relevance (MRMR) criterion to set the value of β\beta to be the reverse of the number of selected features:

JMRMR(Xk)=I(Xk;Y)1|𝒮|Xj𝒮I(Xk;Xj).\small J_{\text{MRMR}}(X_{k})=I(X_{k};Y)-\frac{1}{|\mathcal{S}|}\sum_{X_{j}\in\mathcal{S}}I(X_{k};X_{j}).\vskip-3.61371pt (16)

Hence, with more selected features, the effect of feature redundancy is gradually reduced. The intuition is that with more non-redundant features selected, it becomes more difficult for new features to be redundant to the features that have already been in 𝒮\mathcal{S}. In [Brown et al. (2012)], it gives another interpretation that the pairwise independence between features becomes stronger as more features are added to 𝒮\mathcal{S}, possibly because of noise information in the data.

MRMR is also strongly linked to the Conditional likelihood maximization framework if we iteratively revise the value of β\beta to be 1|𝒮|\frac{1}{|\mathcal{S}|}, and set the other parameter λ\lambda to be zero.

2.2.4 Conditional Infomax Feature Extraction

Some studies [Lin and Tang (2006), El Akadi et al. (2008), Guo and Nixon (2009)] show that in contrast to minimize the feature redundancy, the conditional redundancy between unselected features and already selected features given class labels should also be maximized. In other words, as long as the feature redundancy given class labels is stronger than the intra-feature redundancy, the feature selection will be affected negatively. A typical feature selection under this argument is Conditional Infomax Feature Extraction (CIFE) [Lin and Tang (2006)], in which the feature score for a new unselected feature XkX_{k} is:

JCIFE(Xk)=I(Xk;Y)Xj𝒮I(Xj;Xk)+Xj𝒮I(Xj;Xk|Y).\small J_{\text{CIFE}}(X_{k})=I(X_{k};Y)-\sum_{X_{j}\in\mathcal{S}}I(X_{j};X_{k})+\sum_{X_{j}\in\mathcal{S}}I(X_{j};X_{k}|Y). (17)

Compared with MIFS, it adds a third term Xj𝒮I(Xj;Xk|Y)\sum_{X_{j}\in\mathcal{S}}I(X_{j};X_{k}|Y) to maximize the conditional redundancy. Also, CIFE is a special case of the linear combination of Shannon information terms by setting both β\beta and γ\gamma to be 1.

2.2.5 Joint Mutual Information

MIFS and MRMR reduce feature redundancy in the feature selection process. An alternative criterion, Joint Mutual Information [Yang and Moody (1999), Meyer et al. (2008)] is proposed to increase the complementary information that is shared between unselected features and selected features given the class labels. The feature selection criterion is listed as follows:

JJMI(Xk)=Xj𝒮I(Xk,Xj;Y).\small J_{\text{JMI}}(X_{k})=\sum_{X_{j}\in\mathcal{S}}I(X_{k},X_{j};Y). (18)

The basic idea of JMI is that we should include new features that are complementary to the existing features given the class labels.

JMI cannot be directly reduced to the condition likelihood maximization framework. In [Brown et al. (2012)], the authors demonstrate that with simple manipulations, the JMI criterion can be re-written as:

JJMI(Xk)=I(Xk;Y)1|𝒮|Xj𝒮I(Xj;Xk)+1|𝒮|Xj𝒮I(Xj;Xk|Y).\small J_{\text{JMI}}(X_{k})=I(X_{k};Y)-\frac{1}{|\mathcal{S}|}\sum_{X_{j}\in\mathcal{S}}I(X_{j};X_{k})+\frac{1}{|\mathcal{S}|}\sum_{X_{j}\in\mathcal{S}}I(X_{j};X_{k}|Y). (19)

Therefore, it is also a special case of the linear combination of Shannon information terms by iteratively setting β\beta and λ\lambda to be 1|𝒮|\frac{1}{|\mathcal{S}|}.

2.2.6 Conditional Mutual Information Maximization

Previously mentioned criteria could be reduced to a linear combination of Shannon information terms. Next, we show some other algorithms that can only be reduced to a non-linear combination of Shannon information terms. Among them, Conditional Mutual Information Maximization (CMIM) [Vidal-Naquet and Ullman (2003), Fleuret (2004)] iteratively selects features which maximize the mutual information with the class labels given the selected features so far. Mathematically, during the selection phase, the feature score for each new unselected feature XkX_{k} can be formulated as follows:

JCMIM(Xk)=minXj𝒮[I(Xk;Y|Xj)].\small J_{\text{CMIM}}(X_{k})=\min_{X_{j}\in\mathcal{S}}[I(X_{k};Y|X_{j})]. (20)

Note that the value of I(Xk;Y|Xj)I(X_{k};Y|X_{j}) is small if XkX_{k} is not strongly correlated with the class label YY or if XkX_{k} is redundant when 𝒮\mathcal{S} is known. By selecting the feature that maximizes this minimum value, it can guarantee that the selected feature has a strong predictive ability, and it can reduce the redundancy w.r.t. the selected features.

The CMIM criterion is equivalent to the following form after some derivations:

JCMIM(Xk)=I(Xk;Y)maxXj𝒮[I(Xj;Xk)I(Xj;Xk|Y)].\small J_{\text{CMIM}}(X_{k})=I(X_{k};Y)-\max_{X_{j}\in\mathcal{S}}[I(X_{j};X_{k})-I(X_{j};X_{k}|Y)]. (21)

Therefore, CMIM is also a special case of the conditional likelihood maximization framework in Eq. (12).

2.2.7 Informative Fragments

In [Vidal-Naquet and Ullman (2003)], the authors propose a feature selection criterion called Informative Fragments (IF). The feature score of each new unselected features is given as:

JIF(Xk)=minXj𝒮[I(XjXk;Y)I(Xj;Y)].\small J_{\text{IF}}(X_{k})=\min_{X_{j}\in\mathcal{S}}[I(X_{j}X_{k};Y)-I(X_{j};Y)]. (22)

The intuition behind Informative Fragments is that the addition of the new feature XkX_{k} should maximize the value of conditional mutual information between XkX_{k} and existing features in 𝒮\mathcal{S} over the mutual information between XjX_{j} and YY. An interesting phenomenon of IF is that with the chain rule that I(XkXj;Y)=I(Xj;Y)+I(Xk;Y|Xj)I(X_{k}X_{j};Y)=I(X_{j};Y)+I(X_{k};Y|X_{j}), IF has the equivalent form as CMIM. Hence, it can also be reduced to the general framework in Eq. (12).

2.2.8 Interaction Capping

Interaction Capping [Jakulin (2005)] is a similar feature selection criterion as CMIM in Eq. (21), it restricts the term I(Xj;Xk)I(Xj;Xk|Y)I(X_{j};X_{k})-I(X_{j};X_{k}|Y) to be nonnegative:

JCMIM(Xk)=I(Xk;Y)Xj𝒮max[0,I(Xj;Xk)I(Xj;Xk|Y)].\small J_{\text{CMIM}}(X_{k})=I(X_{k};Y)-\sum_{X_{j}\in\mathcal{S}}\max[0,I(X_{j};X_{k})-I(X_{j};X_{k}|Y)]. (23)

Apparently, it is a special case of non-linear combination of Shannon information terms by setting the function g(.)g(.) to be max[0,I(Xj;Xk)I(Xj;Xk|Y)]-\max[0,I(X_{j};X_{k})-I(X_{j};X_{k}|Y)].

2.2.9 Double Input Symmetrical Relevance

Another class of information theoretical based methods such as Double Input Symmetrical Relevance (DISR) [Meyer and Bontempi (2006)] exploits normalization techniques to normalize mutual information [Guyon et al. (2008)]:

JDISR(Xk)=Xj𝒮I(XjXk;Y)H(XjXkY).\small J_{\text{DISR}}(X_{k})=\sum_{X_{j}\in\mathcal{S}}\frac{I(X_{j}X_{k};Y)}{H(X_{j}X_{k}Y)}. (24)

It is easy to validate that DISR is a non-linear combination of Shannon information terms and can be reduced to the conditional likelihood maximization framework.

2.2.10 Fast Correlation Based Filter

There are other information theoretical based feature selection methods that cannot be simply reduced to the unified conditional likelihood maximization framework. Fast Correlation Based Filter (FCBF) [Yu and Liu (2003)] is an example that exploits feature-class correlation and feature-feature correlation simultaneously. The algorithm works as follows: (1) given a predefined threshold δ\delta, it selects a subset of features 𝒮\mathcal{S} that are highly correlated with the class labels with SUδSU\geq\delta, where SUSU is the symmetric uncertainty. The SUSU between a set of features X𝒮X_{\mathcal{S}} and the class label YY is given as follows:

SU(X𝒮,Y)=2I(X𝒮;Y)H(X𝒮)+H(Y).\small SU(X_{\mathcal{S}},Y)=2\frac{I(X_{\mathcal{S}};Y)}{H(X_{\mathcal{S}})+H(Y)}. (25)

A specific feature XkX_{k} is called predominant iff SU(Xk,Y)δSU(X_{k},Y)\geq\delta and there does not exist a feature Xj𝒮X_{j}\in\mathcal{S} (jk)(j\neq k) such that SU(Xj,Xk)SU(Xk,Y)SU(X_{j},X_{k})\geq SU(X_{k},Y). Feature XjX_{j} is considered to be redundant to feature XkX_{k} if SU(Xj,Xk)SU(Xk,Y)SU(X_{j},X_{k})\geq SU(X_{k},Y); (2) the set of redundant features is denoted as 𝒮Pi\mathcal{S}_{P_{i}}, which will be further split into 𝒮Pi+\mathcal{S}_{P_{i}}^{+} and 𝒮Pi\mathcal{S}_{P_{i}}^{-} where they contain redundant features to feature XkX_{k} with SU(Xj,Y)>SU(Xk,Y)SU(X_{j},Y)>SU(X_{k},Y) and SU(Xj,Y)<SU(Xk,Y)SU(X_{j},Y)<SU(X_{k},Y), respectively; and (3) different heuristics are applied on 𝒮P\mathcal{S}_{P}, 𝒮Pi+\mathcal{S}_{P_{i}}^{+} and 𝒮Pi\mathcal{S}_{P_{i}}^{-} to remove redundant features and keep the features that are most relevant to the class labels.

Discussion: Unlike similarity based feature selection algorithms that fail to tackle feature redundancy, most aforementioned information theoretical based feature selection algorithms can be unified in a probabilistic framework that considers both “feature relevance” and “feature redundancy”. Meanwhile, similar as similarity based methods, this category of methods is independent of any learning algorithms and hence are generalizable. However, most of the existing information theoretical based feature selection methods can only work in a supervised scenario. Without the guide of class labels, it is still not clear how to assess the importance of features. In addition, these methods can only handle discrete data and continuous numerical variables require discretization preprocessing beforehand

2.3 Sparse Learning based Methods

The third type of methods is sparse learning based methods which aim to minimize the fitting errors along with some sparse regularization terms. The sparse regularizer forces many feature coefficients to be small, or exactly zero, and then the corresponding features can be simply eliminated. Sparse learning based methods have received considerable attention in recent years due to their good performance and interpretability. In the following parts, we review some representative sparse learning based feature selection methods from both supervised and unsupervised perspectives.

2.3.1 Feature Selection with p\ell_{p}-norm Regularizer

First, we consider the binary classification or univariate regression problem. To achieve feature selection, the p\ell_{p}-norm sparsity-induced penalty term is added on the classification or regression model, where 0p10\leq p\leq 1. Let 𝐰{\bf w} denotes the feature coefficient, then the objective function for feature selection is:

min𝐰loss(𝐰;𝐗,𝐲)+α𝐰p,\small\min_{{\bf w}}\,loss({\bf w};{\bf X},{\bf y})+\alpha\,\|{\bf w}\|_{p},\vskip-3.61371pt (26)

where loss(.)loss(.) is a loss function, and some widely used loss functions loss(.)loss(.) include least squares loss, hinge loss and logistic loss. 𝐰p=(i=1dwip)1p\|{\bf w}\|_{p}=(\sum_{i=1}^{d}\|w_{i}\|^{p})^{\frac{1}{p}} is a sparse regularization term, and α\alpha is a regularization parameter to balance the contribution of the loss function and the sparse regularization term for feature selection.

Typically when p=0p=0, the 0\ell_{0}-norm regularization term directly seeks for the optimal set of nonzero entries (features) for the model learning. However, the optimization problem is naturally an integer programming problem and is difficult to solve. Therefore, it is often relaxed to a 1\ell_{1}-norm regularization problem, which is regarded as the tightest convex relaxation of the 0\ell_{0}-norm. One main advantage of 1\ell_{1}-norm regularization (LASSO) [Tibshirani (1996)] is that it forces many feature coefficients to become smaller and, in some cases, exactly zero. This property makes it suitable for feature selection, as we can select features whose corresponding feature weights are large, which motivates a surge of 1\ell_{1}-norm regularized feature selection methods [Zhu et al. (2004), Xu et al. (2014), Wei et al. (2016a), Wei and Yu (2016), Hara and Maehara (2017)]. Also, the sparse vector 𝐰{\bf w} enables the ranking of features. Normally, the higher the value, the more important the corresponding feature is.

2.3.2 Feature Selection with p,q\ell_{p,q}-norm Regularizer

Here, we discuss how to perform feature selection for the general multi-class classification or multivariate regression problems. The problem is more difficult because of the multiple classes and multivariate regression targets, and we would like the feature selection phase to be consistent over multiple targets. In other words, we want multiple predictive models for different targets to share the same parameter sparsity patterns – each feature either has small scores or large scores for all targets. This problem can be generally solved by the p,q\ell_{p,q}-norm sparsity-induced regularization term, where p>1p>1 (most existing work focus on p=2p=2 or \infty) and 0q10\leq q\leq 1 (most existing work focus on q=1q=1 or 0). Assume that 𝐗{\bf X} denotes the data matrix, and 𝐘{\bf Y} denotes the one-hot label indicator matrix. Then the model is formulated as follows:

min𝐖loss(𝐖;𝐗,𝐲)+α𝐖p,q,\small\min_{{\bf W}}\,loss({\bf W};{\bf X},{\bf y})+\alpha\|{\bf W}\|_{p,q},\vskip-3.61371pt (27)

where 𝐖p,q=(j=1c(i=1d|𝐖(i,j)|p)qp)1q\|{\bf W}\|_{p,q}=(\sum_{j=1}^{c}(\sum_{i=1}^{d}|{\bf W}(i,j)|^{p})^{\frac{q}{p}})^{\frac{1}{q}}; and the parameter α\alpha is used to control the contribution of the loss function and the sparsity-induced regularization term. Then the features can be ranked according to the value of 𝐖(i,:)22(i=1,,d)\|{\bf W}(i,:)\|_{2}^{2}({i=1,...,d}), the higher the value, the more important the feature is.

Case 1: p=2p=2, q=0q=0

To find relevant features across multiple targets, an intuitive way is to use discrete optimization through the 2,0\ell_{2,0}-norm regularization. The optimization problem with the 2,0\ell_{2,0}-norm regularization term can be reformulated as follows:

min𝐖loss(𝐖;𝐗,𝐲)s.t.𝐖2,0k.\small\min_{{\bf W}}\,loss({\bf W};{\bf X},{\bf y})\,\,s.t.\|{\bf W}\|_{2,0}\leq k.\vskip-3.61371pt (28)

However, solving the above optimization problem has been proven to be NP-hard, and also, due to its discrete nature, the objective function is also not convex. To solve it, a variation of Alternating Direction Method could be leveraged to seek for a local optimal solution [Cai et al. (2013), Gu et al. (2012)]. In [Zhang et al. (2014)], the authors provide two algorithms, proximal gradient algorithm and rank-one update algorithm to solve this discrete selection problem.

Case 2: p=2p=2, 0<q<10<q<1

The above sparsity-reduced regularization term is inherently discrete and hard to solve. In [Peng and Fan (2016), Peng and Fan (2017)], the authors propose a more general framework to directly optimize the sparsity-reduced regularization when 0<q<10<q<1 and provided efficient iterative algorithm with guaranteed convergence rate.

Case 3: p=2p=2, q=1q=1

Although the 2,0\ell_{2,0}-norm is more desired for feature sparsity, however, it is inherently non-convex and non-smooth. Hence, the 2,1\ell_{2,1}-norm regularization is preferred and widely used in different scenarios such as multi-task learning [Obozinski et al. (2007), Zhang et al. (2008)], anomaly detection [Li et al. (2017a), Wu et al. (2017)] and crowdsourcing [Zhou and He (2017)]. Many 2,1\ell_{2,1}-norm regularization based feature selection methods have been proposed over the past decade [Zhao et al. (2010), Gu et al. (2011c), Yang et al. (2011), Hou et al. (2011), Li et al. (2012), Qian and Zhai (2013), Shi et al. (2014), Liu et al. (2014), Du and Shen (2015), Jian et al. (2016), Liu et al. (2016b), Nie et al. (2016), Zhu et al. (2016), Li et al. (2017c)]. Similar to 1\ell_{1}-norm regularization, 2,1\ell_{2,1}-norm regularization is also convex and a global optimal solution can be achieved [Liu et al. (2009a)], thus the following discussions about the sparse learning based feature selection will center around the 2,1\ell_{2,1}-norm regularization term. The 2,1\ell_{2,1}-norm regularization also has strong connections with group lasso [Yuan and Lin (2006)] which will be explained later. By solving the related optimization problem, we can obtain a sparse matrix 𝐖{\bf W} where many rows are exact zero or of small values, and then the features corresponding to these rows can be eliminated.

Case 4: p=p=\infty, q=1q=1

In addition to the 2,1\ell_{2,1}-norm regularization term, the ,1\ell_{\infty,1}-norm regularization is also widely used to achieve joint feature sparsity across multiple targets [Quattoni et al. (2009)]. In particular, it penalizes the sum of maximum absolute values of each row, such that many rows of the matrix will all be zero.

2.3.3 Efficient and Robust Feature Selection

Authors in [Nie et al. (2010)] propose an efficient and robust feature selection (REFS) method by employing a joint 2,1\ell_{2,1}-norm minimization on both the loss function and the regularization. Their argument is that the 2\ell_{2}-norm based loss function is sensitive to noisy data while the 2,1\ell_{2,1}-norm based loss function is more robust to noise. The reason is that 2,1\ell_{2,1}-norm loss function has a rotational invariant property [Ding et al. (2006)]. Consistent with 2,1\ell_{2,1}-norm regularized feature selection model, a 2,1\ell_{2,1}-norm regularizer is added to the 2,1\ell_{2,1}-norm loss function to achieve group feature sparsity. The objective function of REFS is:

min𝐖𝐗𝐖𝐘2,1+α𝐖2,1,\small\min_{{\bf W}}\|{\bf XW}-{\bf Y}\|_{2,1}+\alpha\|{\bf W}\|_{2,1},\vskip-3.61371pt (29)

To solve the convex but non-smooth optimization problem, an efficient algorithm is proposed with strict convergence analysis.

It should be noted that the aforementioned REFS is designed for multi-class classification problems where each instance only has one class label. However, data could be associated with multiple labels in many domains such as information retrieval and multimedia annotation. Recently, there is a surge of research work study multi-label feature selection problems by considering label correlations. Most of them, however, are also based on the 2,1\ell_{2,1}-norm sparse regularization framework [Gu et al. (2011a), Chang et al. (2014), Jian et al. (2016)].

2.3.4 Multi-Cluster Feature Selection

Most of existing sparse learning based approaches build a learning model with the supervision of class labels. The feature selection phase is derived afterwards on the sparse feature coefficients. However, since labeled data is costly and time-consuming to obtain, unsupervised sparse learning based feature selection has received increasing attention in recent years. Multi-Cluster Feature Selection (MCFS) [Cai et al. (2010)] is one of the first attempts. Without class labels to guide the feature selection process, MCFS proposes to select features that can cover multi-cluster structure of the data where spectral analysis is used to measure the correlation between different features.

MCFS consists of three steps. In the first step, it constructs a pp-nearest neighbor graph to capture the local geometric structure of data and gets the graph affinity matrix 𝐒{\bf S} and the Laplacian matrix 𝐋{\bf L}. Then a flat embedding that unfolds the data manifold can be obtained by spectral clustering techniques. In the second step, since the embedding of data is known, MCFS takes advantage of them to measure the importance of features by a regression model with a 1\ell_{1}-norm regularization. Specifically, given the ii-th embedding 𝐞i{\bf e}_{i}, MCFS regards it as a regression target to minimize:

minwi𝐗𝐰i𝐞i22+α𝐰i1,\small\min_{w_{i}}\|{\bf X}{\bf w}_{i}-{\bf e}_{i}\|_{2}^{2}+\alpha\|{\bf w}_{i}\|_{1},\vskip-3.61371pt (30)

where 𝐰i{\bf w}_{i} denotes the feature coefficient vector for the ii-th embedding. By solving all KK sparse regression problems, MCFS obtains KK sparse feature coefficient vectors 𝐖=[𝐰1,,𝐰K]{\bf W}=[{\bf w}_{1},...,{\bf w}_{K}] and each vector corresponds to one embedding of 𝐗{\bf X}. In the third step, for each feature fjf_{j}, the MCFS score for that feature can be computed as MCFS(j)=maxi|𝐖(j,i)|\text{\emph{MCFS}}(j)=\max_{i}|{\bf W}(j,i)|. The higher the MCFS score, the more important the feature is.

2.3.5 2,1\ell_{2,1}-norm Regularized Discriminative Feature Selection

In [Yang et al. (2011)], the authors propose a new unsupervised feature selection algorithm (UDFS) to select the most discriminative features by exploiting both the discriminative information and feature correlations. First, assume 𝐗~\tilde{{\bf X}} is the centered data matrix such 𝐗~=𝐇n𝐗\tilde{{\bf X}}={\bf H}_{n}{\bf X} and 𝐆=[𝐆1,𝐆1,,𝐆n]=𝐘(𝐘𝐘)12{\bf G}=[{\bf G}_{1},{\bf G}_{1},...,{\bf G}_{n}]^{\prime}={\bf Y}({\bf Y}^{\prime}{\bf Y})^{-\frac{1}{2}} is the weighted label indicator matrix, where 𝐇n=𝐈n1n𝟏n𝟏n{\bf H}_{n}={\bf I}_{n}-\frac{1}{n}{\bf 1}_{n}{\bf 1}_{n}^{\prime}. Instead of using global discriminative information, they propose to utilize the local discriminative information to select discriminative features. The advantage of using local discriminative information are two folds. First, it has been demonstrated to be more important than global discriminative information in many classification and clustering tasks. Second, when it considers the local discriminative information, the data manifold structure is also well preserved. For each data instance xix_{i}, it constructs a pp-nearest neighbor set for that instance 𝒩p(xi)={xi1,xi2,,xip}\mathcal{N}_{p}(x_{i})=\{x_{i_{1}},x_{i_{2}},...,x_{i_{p}}\}. Let 𝐗𝒩p(i)=[𝐱i,𝐱i1,,𝐱ip]{\bf X}_{\mathcal{N}_{p}(i)}=[{\bf x}_{i},{\bf x}_{i_{1}},...,{\bf x}_{i_{p}}] denotes the local data matrix around xix_{i}, then the local total scatter matrix 𝐒t(i){\bf S}_{t}^{(i)} and local between class scatter matrix 𝐒b(i){\bf S}_{b}^{(i)} are 𝐗i~𝐗i~\tilde{{\bf X}_{i}}^{\prime}\tilde{{\bf X}_{i}} and 𝐗i~𝐆i𝐆i𝐗i~\tilde{{\bf X}_{i}}^{\prime}{\bf G}_{i}{\bf G}_{i}^{\prime}\tilde{{\bf X}_{i}} respectively, where 𝐗i~\tilde{{\bf X}_{i}} is the centered data matrix and 𝐆(i)=[𝐆i,𝐆i1,,𝐆ik]{\bf G}_{(i)}=[{\bf G}_{i},{\bf G}_{i_{1}},...,{\bf G}_{i_{k}}]^{\prime}. Note that 𝐆(i){\bf G}_{(i)} is a subset from 𝐆{\bf G} and 𝐆(i){\bf G}_{(i)} can be obtained by a selection matrix 𝐏i{0,1}n×(k+1){\bf P}_{i}\in\{0,1\}^{n\times(k+1)} such that 𝐆(i)=𝐏i𝐆{\bf G}_{(i)}={\bf P}_{i}^{\prime}{\bf G}. Without label information in unsupervised feature selection, UDFS assumes that there is a linear classifier 𝐖d×s{\bf W}\in\mathbb{R}^{d\times s} to map each data instance 𝐱id{\bf x}_{i}\in\mathbb{R}^{d} to a low dimensional space 𝐆is{\bf G}_{i}\in\mathbb{R}^{s}. Following the definition of global discriminative information [Yang et al. (2010), Fukunaga (2013)], the local discriminative score for each instance xix_{i} is :

DSi=tr[(𝐒t(i)+λ𝐈d)1𝐒b(i)]=tr[𝐖𝐗𝐏(i)𝐗i~(𝐗i~𝐗i~+λ𝐈d)1𝐗i~𝐏(i)𝐗𝐖],\small DS_{i}=tr[({\bf S}_{t}^{(i)}+\lambda{\bf I}_{d})^{-1}{\bf S}_{b}^{(i)}]=tr[{\bf W}^{\prime}{\bf X}^{\prime}{\bf P}_{(i)}\tilde{{\bf X}_{i}}^{\prime}(\tilde{{\bf X}_{i}}\tilde{{\bf X}_{i}}^{\prime}+\lambda{\bf I}_{d})^{-1}\tilde{{\bf X}_{i}}{\bf P}_{(i)}^{\prime}{\bf X}{\bf W}], (31)

A high local discriminative score indicates that the instance can be well discriminated by 𝐖{\bf W}. Therefore, UDFS tends to train 𝐖{\bf W} which obtains the highest local discriminative score for all instances in 𝐗{\bf X}; also it incorporates a 2,1\ell_{2,1}-norm regularizer to achieve feature selection, the objective function is formulated as follows:

min𝐖𝐖=𝐈di=1n{tr[𝐆(i)𝐇k+1𝐆(i)DSi]}+α𝐖2,1,\small\min_{{\bf W}^{\prime}{\bf W}={\bf I}_{d}}\sum_{i=1}^{n}\{tr[{\bf G}_{(i)}^{\prime}{\bf H}_{k+1}{\bf G}_{(i)}-DS_{i}]\}+\alpha\|{\bf W}\|_{2,1},\vskip-3.61371pt (32)

where α\alpha is a regularization parameter to control the sparsity of the learned model.

2.3.6 Feature Selection Using Nonnegative Spectral Analysis

Nonnegative Discriminative Feature Selection (NDFS) [Li et al. (2012)] performs spectral clustering and feature selection simultaneously in a joint framework to select a subset of discriminative features. It assumes that pseudo class label indicators can be obtained by spectral clustering techniques. Different from most existing spectral clustering techniques, NDFS imposes nonnegative and orthogonal constraints during the spectral clustering phase. The argument is that with these constraints, the learned pseudo class labels are closer to real cluster results. These nonnegative pseudo class labels then act as regression constraints to guide the feature selection phase. Instead of performing these two tasks separately, NDFS incorporates these two phases into a joint framework.

Similar to the UDFS, we use 𝐆=[𝐆1,𝐆1,,𝐆n]=𝐘(𝐘𝐘)12{\bf G}=[{\bf G}_{1},{\bf G}_{1},...,{\bf G}_{n}]^{\prime}={\bf Y}({\bf Y}^{\prime}{\bf Y})^{-\frac{1}{2}} to denote the weighted cluster indicator matrix. It is easy to show that we have 𝐆𝐆=𝐈n{\bf G}{\bf G}^{\prime}={\bf I}_{n}. NDFS adopts a strategy to learn the weight cluster matrix such that the local geometric structure of the data can be well preserved [Shi and Malik (2000), Yu and Shi (2003)]. The local geometric structure can be preserved by minimizing the normalized graph Laplacian tr(𝐆𝐋𝐆)tr({\bf G}^{\prime}{\bf L}{\bf G}), where 𝐋{\bf L} is the Laplacian matrix that can be derived from RBF kernel. In addition to that, given the pseudo labels 𝐆{\bf G}, NDFS assumes that there exists a linear transformation matrix 𝐖d×s{\bf W}\in\mathbb{R}^{d\times s} between the data instances 𝐗{\bf X} and the pseudo labels 𝐆{\bf G}. These pseudo class labels are utilized as constraints to guide the feature selection process. The combination of these two components results in the following problem:

min𝐆,𝐖tr(𝐆𝐋𝐆)+β(𝐗𝐖𝐆F2+α𝐖2,1)s.t.𝐆𝐆=𝐈𝐧,𝐆0,\small\begin{split}\min_{{\bf G},{\bf W}}\,tr({\bf G}^{\prime}{\bf L}{\bf G})&+\beta(\|{\bf X}{\bf W}-{\bf G}\|_{F}^{2}+\alpha\|{\bf W}\|_{2,1})\\ &\mbox{s.t.}\quad{\bf{\bf G}{\bf G}^{\prime}={\bf I}_{n}},{\bf G}\geq 0,\end{split} (33)

where α\alpha is a parameter to control the sparsity of the model, and β\beta is introduced to balance the contribution of spectral clustering and discriminative feature selection.

Discussion: Sparse learning based feature selection methods have gained increasing popularity in recent years. A merit of such type of methods is that it embeds feature selection into a typical learning algorithm (such as linear regression, SVM, etc.). Thus it can often lead very good performance for the underlying learning algorithm. Also, with sparsity of feature weights, the model poses good interpretability as it enables us to explain why we make such prediction. Nonetheless, there are still some drawbacks of these methods: First, as it directly optimizes a particular learning algorithm by feature selection, the selected features do not necessary achieve good performance in other learning tasks. Second, this kind of methods often involves solving a non-smooth optimization problem, and with complex matrix operations (e.g., multiplication, inverse, etc) in most cases. Hence, the expensive computational cost is another bottleneck

2.4 Statistical based Methods

Another category of feature selection algorithms is based on different statistical measures. As they rely on various statistical measures instead of learning algorithms to assess feature relevance, most of them are filter based methods. In addition, most statistical based algorithms analyze features individually. Hence, feature redundancy is inevitably ignored during the selection phase. We introduce some representative feature selection algorithms in this category.

2.4.1 Low Variance

Low Variance eliminates features whose variance are below a predefined threshold. For example, for the features that have the same values for all instances, the variance is 0 and should be removed since it cannot help discriminate instances from different classes. Suppose that the dataset consists of only boolean features, i.e., the feature values are either 0 and 1. As the boolean feature is a Bernoulli random variable, its variance value can be computed as:

variance_score(fi)=p(1p),\small variance\_score(f_{i})=p(1-p),\vskip-3.61371pt (34)

where pp denotes the percentage of instances that take the feature value of 1. After the variance of features is obtained, the feature with a variance score below a predefined threshold can be directly pruned.

2.4.2 T-score

TT-score [Davis and Sampson (1986)] is used for binary classification problems. For each feature fif_{i}, suppose that μ1\mu_{1} and μ2\mu_{2} are the mean feature values for the instances from two different classes, σ1\sigma_{1} and σ2\sigma_{2} are the corresponding standard deviations, n1n_{1} and n2n_{2} denote the number of instances from these two classes. Then the tt-score for the feature fif_{i} is:

t_score(fi)=|μ1μ2|/σ12n1+σ22n2.\small t\_score(f_{i})=|\mu_{1}-\mu_{2}|/\sqrt{\frac{\sigma_{1}^{2}}{n_{1}}+\frac{\sigma_{2}^{2}}{n_{2}}}.\vskip-3.61371pt (35)

The basic idea of tt-score is to assess whether the feature makes the means of two classes statistically different, which can be computed as the ratio between the mean difference and the variance of two classes. The higher the tt-score, the more important the feature is.

2.4.3 Chi-Square Score

Chi-square score [Liu and Setiono (1995)] utilizes the test of independence to assess whether the feature is independent of the class label. Given a particular feature fif_{i} with rr different feature values, the Chi-square score of that feature can be computed as:

Chi_square_score(fi)=j=1rs=1c(njsμjs)2μjs,\small Chi\_square\_score(f_{i})=\sum_{j=1}^{r}\sum_{s=1}^{c}\frac{(n_{js}-\mu_{js})^{2}}{\mu_{js}},\vskip-3.61371pt (36)

where njsn_{js} is the number of instances with the jj-th feature value given feature fif_{i}. In addition, μjs=nsnjn\mu_{js}=\frac{n_{*s}n_{j*}}{n}, where njn_{j*} indicates the number of data instances with the jj-th feature value given feature fif_{i}, nsn_{*s} denotes the number of data instances in class rr. A higher Chi-square score indicates that the feature is relatively more important.

2.4.4 Gini Index

Gini index [Gini (1912)] is also a widely used statistical measure to quantify if the feature is able to separate instances from different classes. Given a feature fif_{i} with rr different feature values, suppose 𝒲\mathcal{W} and 𝒲¯\overline{\mathcal{W}} denote the set of instances with the feature value smaller or equal to the jj-th feature value, and larger than the jj-th feature value, respectively. In other words, the jj-th feature value can separate the dataset into 𝒲\mathcal{W} and 𝒲¯\overline{\mathcal{W}}, then the Gini index score for the feature fif_{i} is given as follows:

gini_index_score(fi)=min𝒲(p(𝒲)(1s=1cp(Cs|𝒲)2)+p(𝒲¯)(1s=1cp(Cs|𝒲¯)2)),\small gini\_index\_score(f_{i})=\min_{\mathcal{W}}\left(p(\mathcal{W})(1-\sum_{s=1}^{c}p(C_{s}|\mathcal{W})^{2})+p(\overline{\mathcal{W}})(1-\sum_{s=1}^{c}p(C_{s}|\overline{\mathcal{W}})^{2})\right),\vskip-3.61371pt (37)

where p(.)p(.) denotes the probability. For instance, p(Cs|𝒲)p(C_{s}|\mathcal{W}) is the conditional probability of class ss given 𝒲\mathcal{W}. For binary classification, Gini Index can take a maximum value of 0.5, it can also be used in multi-class classification problems. Unlike previous statistical measures, the lower the Gini index value, the more relevant the feature is.

2.4.5 CFS

The basic idea of CFS [Hall and Smith (1999)] is to use a correlation based heuristic to evaluate the worth of a feature subset 𝒮\mathcal{S}:

CFS_score(𝒮)=krcf¯k+k(k1)rff¯,\small CFS\_score(\mathcal{S})=\frac{k\overline{r_{cf}}}{\sqrt{k+k(k-1)}\overline{r_{ff}}}, (38)

where the CFS score shows the heuristic “merit” of the feature subset 𝒮\mathcal{S} with kk features. rcf¯\overline{r_{cf}} is the mean feature class correlation and rff¯\overline{r_{ff}} is the average feature-feature correlation. In Eq. (38), the numerator indicates the predictive power of the feature set while the denominator shows how much redundancy the feature set has. The basic idea is that a good feature subset should have a strong correlation with class labels and are weakly intercorrelated. To get the feature-class correlation and feature-feature correlation, CFS uses symmetrical uncertainty [Vetterling et al. (1992)]. As finding the globally optimal subset is computational prohibitive, it adopts a best-search strategy to find a local optimal feature subset. At the very beginning, it computes the utility of each feature by considering both feature-class and feature-feature correlation. It then starts with an empty set and expands the set by the feature with the highest utility until it satisfies some stopping criteria.

Discussion: Most of the statistical based feature selection methods rely on predefined statistical measures to filter out unwanted features, and are simple, straightforward in nature. And the computational costs of these methods are often very low. To this end, they are often used as a preprocessing step before applying other sophisticated feature selection algorithms. Also, as similarity based feature selection methods, these methods often evaluate the importance of features individually and hence cannot handle feature redundancy. Meanwhile, most algorithms in this family can only work on discrete data and conventional data discretization techniques are required to preprocess numerical and continuous variables

2.5 Other Methods

In this subsection, we present other feature selection methods that do not belong to the above four types of feature selection algorithms. In particular, we review hybrid feature selection methods, deep learning based and reconstruction based methods.

Hybrid feature selection methods is a kind of ensemble-based methods that aim to construct a group of feature subsets from different feature selection algorithms, and then produce an aggregated result out of the group. In this way, the instability and perturbation issues of most single feature selection algorithms can be alleviated, and also, the subsequent learning tasks can be enhanced. Similar to conventional ensemble learning methods [Zhou (2012)], hybrid feature selection methods consist of two steps: (1) construct a set of different feature selection results; and (2) aggregate different outputs into a consensus result. Different methods differ in the way how these two steps are performed. For the first step, existing methods either ensemble the selected feature subsets of a single method on different sample subset or ensemble the selected feature subsets from multiple feature selection algorithms. In particular, a sampling method to obtain different sample subsets is necessary for the first case; and typical sampling methods include random sampling and bootstrap sampling. For example, [Saeys et al. (2008)] studied the ensemble feature selection which aggregates a conventional feature selection algorithm such as RELIEF with multiple bootstrapped samples of the training data. In [Abeel et al. (2010)], the authors improved the stability of SVM-RFE feature selection algorithm by applying multiple random sampling on the original data. The second step involves in aggregating rankings of multiple selected feature subset. Most of the existing methods employ a simple yet effective linear aggregation function [Saeys et al. (2008), Abeel et al. (2010), Yang and Mao (2011)]. Nonetheless, other ranking aggregation functions such as Markov chain-based method [Dutkowski and Gambin (2007)], distance synthesis method [Yang et al. (2005)], and stacking method [Netzer et al. (2009)] are also widely used. In addition to using the aggregation function, another way is to identify the consensus features directly from multiple sample subsets [Loscalzo et al. (2009)].

Nowadays, deep learning techniques are popular and successful in various real-world applications, especially in computer vision and natural language processing. Deep learning is distinct from feature selection as deep learning leverages deep neutral networks structures to learn new feature representations while feature selection directly finds relevant features from the original features. From this perspective, the results of feature selection are more human readable and interpretable. Even though deep learning is mainly used for feature learning, there are still some attempts that use deep learning techniques for feature selection. We briefly review these deep learning based feature selection methods. For example, in [Li et al. (2015a)], a deep feature selection model (DFS) is proposed. DFS selects features at the input level of a deep neural network. Typically, it adds a sparse one-to-one linear layer between the input layer and the first hidden layer of a multilayer perceptrons (MLP). To achieve feature selection, DFS imposes sparse regularization term, then only the features corresponding to nonzero weights are selected. Similarly, in [Roy et al. (2015)], the authors also propose to select features at the input level of a deep neural network. The difference is that they propose a new concept - net positive contribution, to assess if features are more likely to make the neurons contribute in the classification phase. Since heterogeneous (multi-view) features are prevalent in machine learning and pattern recognition applications, [Zhao et al. (2015)] proposes to combine deep neural networks with sparse representation for grouped heterogeneous feature selection. It first extracts a new unified representation from each feature group using a multi-modal neural network. Then the importance of features is learned by a kind of sparse group lasso method. In [Wang et al. (2014a)], the authors propose an attentional neural network, which guides feature selection with cognitive bias. It consists of two modules, a segmentation module, and a classification module. First, given a cognitive bias vector, segmentation module segments out an object belonging to one of classes in the input image. Then, in the classification module, a reconstruction function is applied to the segment to gate the raw image with a threshold for classification. When features are sensitive to a cognitive bias, the cognitive bias will activate the corresponding relevant features.

Recently, data reconstruction error emerged as a new criterion for feature selection, especially for unsupervised feature selection. It defines feature relevance as the capability of features to approximate the original data via a reconstruction function. Among them, Convex Principal Feature Selection (CPFS) [Masaeli et al. (2010)] reformulates the feature selection problem as a convex continuous optimization problem that minimizes a mean-squared-reconstruction error with linear and sparsity constraint. GreedyFS [Farahat et al. (2011)] uses a projection matrix to project the original data onto the span of some representative feature vectors and derives an efficient greedy algorithm to obtain these representative features. Zhao et al. [Zhao et al. (2016)] formulates the problem of unsupervised feature selection as the graph regularized data reconstruction. The basic idea is to make the selected features well preserve the data manifold structure of the original data, and reconstruct each data sample via linear reconstruction. A pass-efficient unsupervised feature selection is proposed in [Maung and Schweitzer (2013)]. It can be regarded as a modification of the classical pivoted QR algorithm, the basic idea is still to select representative features that can minimize the reconstruction error via linear function. The aforementioned methods mostly use linear reconstruction functions, [Li et al. (2017b)] argues that the reconstruction function is not necessarily linear and proposes to learn the reconstruction function automatically function from data. In particular, they define a scheme to embed the reconstruction function learning into feature selection.

3 Feature Selection with Structured Features

Existing feature selection methods for conventional data are based on a strong assumption that features are independent of each other (flat) while ignoring the inherent feature structures. However, in many real applications features could exhibit various kinds of structures, e.g., spatial or temporal smoothness, disjoint groups, overlap groups, trees and graphs [Tibshirani et al. (2005), Jenatton et al. (2011), Yuan et al. (2011), Huang et al. (2011), Zhou et al. (2012), Wang and Ye (2015)]. If this is the case, feature selection algorithms incorporating knowledge about the structure information may help find more relevant features and therefore can improve subsequent learning tasks. One motivating example is from bioinformatics, in the study of array CGH, features have some natural spatial order, incorporating such spatial structure can help select more important features and achieve more accurate classification accuracy. Therefore, in this section, we discuss some representative feature selection algorithms which explicitly consider feature structures. Specifically, we will focus on group structure, tree structure and graph structure.

A popular and successful approach to achieve feature selection with structured features is to minimize an empirical error penalized by a structural regularization term:

𝐰=argmin𝐰loss(𝐰;𝐗,𝐲)+αpenalty(𝐰,𝒢),\small{\bf w}=\arg\!\min_{{\bf w}}\,loss({\bf w};{\bf X},{\bf y})+\alpha\,penalty({\bf w},\mathcal{G}),\vskip-3.61371pt (39)

where 𝒢\mathcal{G} denotes the structures among features and α\alpha is a trade-off parameter between the loss function and the structural regularization term. To achieve feature selection, penalty(𝐰,𝒢)penalty({\bf w},\mathcal{G}) is usually set to be a sparse regularization term. Note that the above formulation is similar to that in Eq. (26), the only difference is that for feature selection with structured features, we explicitly consider the structural information 𝒢\mathcal{G} among features in the sparse regularization term.

3.1 Feature Selection with Group Feature Structures

Refer to caption
Figure 3: Illustration of Lasso, Group Lasso, and Sparse Group Lasso. The feature set can be divided into four groups G1G_{1}, G2G_{2}, G3G_{3} and G4G_{4}. The column with dark color denotes selected features while the column with light color denotes unselected features.

First, features could exhibit group structures. One of the most common examples is that in multifactor analysis-of-variance (ANOVA), each factor is associated with several groups and can be expressed by a set of dummy features [Yuan and Lin (2006)]. Some other examples include different frequency bands represented as groups in signal processing [McAuley et al. (2005)] and genes with similar functionalities acting as groups in bioinformatics [Ma et al. (2007)]. Therefore, when performing feature selection, it is more appealing to model the group structure explicitly.

3.1.1 Group Lasso

Group Lasso [Yuan and Lin (2006), Bach (2008), Jacob et al. (2009), Meier et al. (2008)], which derives feature coefficients from certain groups to be small or exact zero, is a solution to this problem. In other words, it selects or ignores a group of features as a whole. The difference between Lasso and Group Lasso is shown by the illustrative example in Fig. 3. Suppose that these features come from 4 different groups and there is no overlap between these groups. Lasso completely ignores the group structures among features, and the selected features are from four different groups. On the contrary, Group Lasso tends to select or not select features from different groups as a whole. As shown in the figure, Group Lasso only selects the second and the fourth group G2G_{2} and G4G_{4}, features in the other two groups G1G_{1} and G3G_{3} are not selected. Mathematically, Group Lasso first uses a 2\ell_{2}-norm regularization term for feature coefficients 𝐰i{\bf w}_{i} in each group GiG_{i}, then it performs a 1\ell_{1}-norm regularization for all previous 2\ell_{2}-norm terms. The objective function of Group Lasso is formulated as follows:

min𝐰loss(𝐰;𝐗,𝐲)+αi=1ghi𝐰Gi2,\small\min_{{\bf w}}\,loss({\bf w};{\bf X},{\bf y})+\alpha\,\sum_{i=1}^{g}h_{i}\|{\bf w}_{G_{i}}\|_{2},\vskip-3.61371pt (40)

where hih_{i} is a weight for the ii-th group 𝐰Gi{\bf w}_{G_{i}} which can be considered as a prior to measuring the contribution of the ii-th group in the feature selection process.

3.1.2 Sparse Group Lasso

Once Group Lasso selects a group, all the features in the selected group will be kept. However, in many cases, not all features in the selected group could be useful, and it is desirable to consider the intrinsic feature structures and select features from different selected groups simultaneously (as illustrated in Fig. 3). Sparse Group Lasso [Friedman et al. (2010), Peng et al. (2010)] takes advantage of both Lasso and Group Lasso, and it produces a solution with simultaneous intra-group and inter-group sparsity. The sparse regularization term of Sparse Group Lasso is a combination of the penalty term of Lasso and Group Lasso:

min𝐰loss(𝐰;𝐗,𝐲)+α𝐰1+(1α)i=1ghi𝐰Gi2,\small\min_{{\bf w}}\,loss({\bf w};{\bf X},{\bf y})+\alpha\|{\bf w}\|_{1}+(1-\alpha)\,\sum_{i=1}^{g}h_{i}\|{\bf w}_{G_{i}}\|_{2},\vskip-3.61371pt (41)

where α\alpha is parameter between 0 and 1 to balance the contribution of inter-group sparsity and intra-group sparsity for feature selection. The difference between Lasso, Group Lasso and Sparse Group Lasso is shown in Fig. 3.

3.1.3 Overlapping Sparse Group Lasso

Above methods consider the disjoint group structures among features. However, groups may also overlap with each other [Jacob et al. (2009), Jenatton et al. (2011), Zhao et al. (2009)]. One motivating example is the usage of biologically meaningful gene/protein groups mentioned in [Ye and Liu (2012)]. Different groups of genes may overlap, i.e., one protein/gene may belong to multiple groups. A general Overlapping Sparse Group Lasso regularization is similar to the regularization term of Sparse Group Lasso. The difference is that different feature groups GiG_{i} can have an overlap, i.e., there exist at least two groups GiG_{i} and GjG_{j} such that GiGjG_{i}\bigcap G_{j}\neq\emptyset.

3.2 Feature Selection with Tree Feature Structures

In addition to the group structures, features can also exhibit tree structures. For example, in face recognition, different pixels can be represented as a tree, where the root node indicates the whole face, its child nodes can be different organs, and each specific pixel is considered as a leaf node. Another motivating example is that genes/proteins may form certain hierarchical tree structures [Liu and Ye (2010)]. Recently, Tree-guided Group Lasso is proposed to handle the feature selection for features that can be represented in an index tree [Kim and Xing (2010), Liu and Ye (2010), Jenatton et al. (2010)].

Refer to caption
Figure 4: Illustration of the tree structure among features. These eight features form a simple index tree with a depth of 3.

3.2.1 Tree-guided Group Lasso

In Tree-guided Group Lasso [Liu and Ye (2010)], the structure over the features can be represented as a tree with leaf nodes as features. Each internal node denotes a group of features such that the internal node is considered as a root of a subtree and the group of features is considered as leaf nodes. Each internal node in the tree is associated with a weight that represents the height of its subtree, or how tightly the features in this subtree are correlated.

In Tree-guided Group Lasso, for an index tree 𝒢\mathcal{G} with a depth of dd, 𝒢i={G1i,G2i,,Gnii}\mathcal{G}_{i}=\{G_{1}^{i},G_{2}^{i},...,G_{n_{i}}^{i}\} denotes the whole set of nodes (features) in the ii-th level (the root node is in level 0), and nin_{i} denotes the number of nodes in the level ii. Nodes in Tree-guided Group Lasso have to satisfy the following two conditions: (1) internal nodes from the same depth level have non-overlapping indices, i.e., GjiGki=G_{j}^{i}\bigcap G_{k}^{i}=\emptyset, i=1,2,,d\forall i=1,2,...,d, jkj\neq k, ij,knii\leq j,k\leq n_{i}; and (2) if Gmi1G_{m}^{i-1} is the parent node of GjiG_{j}^{i}, then GjiGmi1G_{j}^{i}\subseteq G_{m}^{i-1}.

We explain these conditions via an illustrative example in Fig. 4. In the figure, we can observe that 8 features are organized in an indexed tree of depth 3. For the internal nodes in each level, we have G10={f1,f2,f3,f4,f5,f6,f7,f8}G_{1}^{0}=\{f_{1},f_{2},f_{3},f_{4},f_{5},f_{6},f_{7},f_{8}\}, G11={f1,f2},G21={f3,f4,f5,f6,f7},G31={f8}G_{1}^{1}=\{f_{1},f_{2}\},G_{2}^{1}=\{f_{3},f_{4},f_{5},f_{6},f_{7}\},G_{3}^{1}=\{f_{8}\}, G12={f1,f2},G22={f3,f4},G32={f5,f6,f7}G_{1}^{2}=\{f_{1},f_{2}\},G_{2}^{2}=\{f_{3},f_{4}\},G_{3}^{2}=\{f_{5},f_{6},f_{7}\}. G10G_{1}^{0} is the root node of the index tree. In addition, internal nodes from the same level do not overlap while the parent node and the child node have some overlap such that the features of the child node is a subset of those of the parent node. In this way, the objective function of Tree-guided Group Lasso is:

min𝐰loss(𝐰;𝐗,𝐲)+αi=0dj=1nihji𝐰Gji2,\small\min_{{\bf w}}\,loss({\bf w};{\bf X},{\bf y})+\alpha\,\sum_{i=0}^{d}\sum_{j=1}^{n_{i}}h_{j}^{i}\|{\bf w}_{G_{j}^{i}}\|_{2},\vskip-3.61371pt (42)

where α0\alpha\geq 0 is a regularization parameter and hji0h_{j}^{i}\geq 0 is a predefined parameter to measure the contribution of the internal node GjiG_{j}^{i}. Since parent node is a superset of its child nodes, thus, if a parent node is not selected, all of its child nodes will not be selected. For example, as illustrated in Fig. 4, if the internal node G21G_{2}^{1} is not selected, both of its child nodes G22G_{2}^{2} and G32G_{3}^{2} will not be selected.

3.3 Feature Selection with Graph Feature Structures

In many cases, features may have strong pairwise interactions. For example, in natural language processing, if we take each word as a feature, we have synonyms and antonyms relationships between different words [Fellbaum (1998)]. Moreover, many biological studies show that there exist strong pairwise dependencies between genes. Since features show certain kinds of dependencies in these cases, we can model them by an undirected graph, where nodes represent features and edges among nodes show the pairwise dependencies between features [Sandler et al. (2009), Kim and Xing (2009), Yang et al. (2012)]. We can use an undirected graph 𝒢(N,E)\mathcal{G}(N,E) to encode these dependencies. Assume that there are nn nodes N={N1,N2,,Nn}N=\{N_{1},N_{2},...,N_{n}\} and a set of ee edges {E1,E2,,Ee}\{E_{1},E_{2},...,E_{e}\} in 𝒢(N,E)\mathcal{G}(N,E). Then node NiN_{i} corresponds to the ii-th feature and the pairwise feature dependencies can be represented by an adjacency matrix 𝐀Nn×Nn{\bf A}\in\mathbb{R}^{N_{n}\times N_{n}}.

3.3.1 Graph Lasso

Since features exhibit graph structures, when two nodes (features) NiN_{i} and NjN_{j} are connected by an edge in 𝒢(N,E)\mathcal{G}(N,E), the features fif_{i} and fjf_{j} are more likely to be selected together, and they should have similar feature coefficients. One way to achieve this target is via Graph Lasso – adding a graph regularizer for the feature graph on the basis of Lasso [Ye and Liu (2012)]. The formulation is:

min𝐰loss(𝐰;𝐗,𝐲)+α𝐰1+(1α)i,j𝐀(i,j)(𝐰i𝐰j)2,\small\min_{{\bf w}}\,loss({\bf w};{\bf X},{\bf y})+\alpha\|{\bf w}\|_{1}+(1-\alpha)\,\sum_{i,j}{\bf A}(i,j)({\bf w}_{i}-{\bf w}_{j})^{2},\vskip-3.61371pt (43)

where the first regularization term α𝐰1\alpha\|{\bf w}\|_{1} is from Lasso while the second term ensures that if a pair of features show strong dependency, i.e., large 𝐀(i,j){\bf A}(i,j), their feature coefficients should also be similar to each other.

3.3.2 GFLasso

In Eq. (43), Graph Lasso encourages features connected together have similar feature coefficients. However, features can also be negatively correlated. In this case, the feature graph 𝒢(N,E)\mathcal{G}(N,E) is represented by a signed graph, with both positive and negative edges. GFLasso [Kim and Xing (2009)] is proposed to model both positive and negative feature correlations, the objective function is:

min𝐰loss(𝐰;𝐗,𝐲)+α𝐰1+(1α)i,j𝐀(i,j)|𝐰isign(ri,j)𝐰j|,\small\min_{{\bf w}}\,loss({\bf w};{\bf X},{\bf y})+\alpha\|{\bf w}\|_{1}+(1-\alpha)\,\sum_{i,j}{\bf A}(i,j)|{\bf w}_{i}-\mbox{sign}(r_{i,j}){\bf w}_{j}|,\vskip-3.61371pt (44)

where ri,jr_{i,j} indicates the correlation between two features fif_{i} and fjf_{j}. When two features are positively correlated, we have 𝐀(i,j)=1{\bf A}(i,j)=1 and ri,j>0r_{i,j}>0, and the penalty term forces the feature coefficients 𝐰i{\bf w}_{i} and 𝐰j{\bf w}_{j} to be similar; on the other hand, if two features are negatively correlated, we have 𝐀(i,j)=1{\bf A}(i,j)=1 and ri,j<0r_{i,j}<0, and the penalty term makes the feature coefficients 𝐰i{\bf w}_{i} and 𝐰j{\bf w}_{j} to be dissimilar. A major limitation of GFLasso is that it uses pairwise sample correlations to measure feature dependencies, which may lead to additional estimation bias. The feature dependencies cannot be correctly estimated when the sample size is small.

3.3.3 GOSCAR

To address the limitations of GFLasso, [Yang et al. (2012)] propose GOSCAR by putting a \ell_{\infty}-norm regularization to enforce pairwise feature coefficients to be equivalent if two features are connected in the feature graph. The formulation is:

min𝐰loss(𝐰;𝐗,𝐲)+α𝐰1+(1α)i,j𝐀(i,j)max(|𝐰i|,|𝐰j|).\small\min_{{\bf w}}\,loss({\bf w};{\bf X},{\bf y})+\alpha\|{\bf w}\|_{1}+(1-\alpha)\,\sum_{i,j}{\bf A}(i,j)\mbox{max}(|{\bf w}_{i}|,|{\bf w}_{j}|). (45)

In the above formulation, the 1\ell_{1}-norm regularization is used for feature selection while the pairwise \ell_{\infty}-norm term penalizes large coefficients. The pairwise \ell_{\infty}-norm term can be decomposed as max(|𝐰i|,|𝐰j|)=12(|𝐰i+𝐰j|+|𝐰i𝐰j|)=|𝐮𝐰|+|𝐯𝐰|\mbox{max}(|{\bf w}_{i}|,|{\bf w}_{j}|)=\frac{1}{2}(|{\bf w}_{i}+{\bf w}_{j}|+|{\bf w}_{i}-{\bf w}_{j}|)=|{\bf u}^{\prime}{\bf w}|+|{\bf v}^{\prime}{\bf w}|, where 𝐮{\bf u} and 𝐯{\bf v} are sparse vectors with only two nonzero entries such that 𝐮i=𝐮j=12{\bf u}_{i}={\bf u}_{j}=\frac{1}{2}, 𝐯i=𝐯j=12{\bf v}_{i}=-{\bf v}_{j}=\frac{1}{2}.

Discussion: This family of algorithms explicitly take the structures among features as prior knowledge and feed into feature selection. Therefore, the selected features could enhance subsequent learning tasks. However, most of these methods are based on the sparse learning framework, and often involves in solving complex optimization algorithms. Thus, computational costs could be relatively high. Moreover, the feature structure are often given a priori, it is still a challenging problem to automatically infer the structures from data for feature selection

4 Feature Selection with Heterogeneous Data

Traditional feature selection algorithms are heavily based on the data i.i.d. assumption. However, heterogeneous data from different sources is becoming more and more prevalent in the era of big data. For example, in the medical domain, genes are often associated with different types of clinical features. Since data of each source can be noisy, partial, or redundant, how to find relevant sources and how to fuse them together for effective feature selection is a challenging problem. Another example is in social media platforms, instances of high dimensionality are often linked together, how to integrate link information to guide feature selection is another difficult problem. In this section, we review current feature selection algorithms for heterogeneous data from three aspects: (1) feature selection for linked data; (2) multi-source feature selection; and (3) multi-view feature selection. Note that multi-source and multi-view feature selection are different in two ways: First, multi-source feature selection aims to select features from the original feature space by integrating multiple sources while multi-view feature selection selects features from different feature spaces for all views simultaneously. Second, multi-source feature selection normally ignores the correlations among sources while multi-view feature selection exploits relations among features from different sources.

4.1 Feature Selection Algorithms with Linked Data

Linked data is ubiquitous in real-world applications such as Twitter (tweets linked by hyperlinks), Facebook (users connected by friendships) and biological systems (protein interactions). Due to different types of links, they are distinct from traditional attribute-value data (or so-called “flat” data).

Fig. 5 illustrates an example of linked data and its representation. Fig. 5(a) shows 8 linked instances, the feature information is illustrated in the left part of Fig. 5(b). Linked data provides an extra source of information, which can be represented by an adjacency matrix, illustrated in the right part of Fig. 5(b). Many linked data related learning tasks are proposed such as collective classification [Macskassy and Provost (2007), Sen et al. (2008)], relational learning [Long et al. (2006), Long et al. (2007), Li et al. (2017c)], link prediction [Liben-Nowell and Kleinberg (2007), Backstrom and Leskovec (2011), Chen et al. (2016)], and active learning [Bilgic et al. (2010), Hu et al. (2013)], but the task of feature selection is not well studied due to some of its unique challenges: (1) how to exploit relations among data instances; (2) how to take advantage of these relations for feature selection; and (3) linked data is often unlabeled, how to evaluate the relevance of features without labels. Recent years have witnessed a surge of research interests in performing feature selection on linked data [Gu and Han (2011), Tang and Liu (2012a), Tang and Liu (2012b), Tang and Liu (2013), Wei et al. (2015), Wei et al. (2016b), Li et al. (2015b), Li et al. (2016b), Li et al. (2016a), Cheng et al. (2017)]. Next, we introduce some representative algorithms in this family.

Refer to caption
(a) Linked Data
Refer to caption
(b) Linked Data Representation
Figure 5: An illustrative example of linked data.

4.1.1 Feature Selection on Networks

In [Gu and Han (2011)], the authors propose a supervised feature selection algorithm (FSNet) based on Laplacian Regularized Least Squares (LapRLS). In detail, they propose to use a linear classifier to capture the relationship between content information and class labels, and incorporate link information by graph regularization. Suppose that 𝐗n×d{\bf X}\in\mathbb{R}^{n\times d} denotes the content matrix and 𝐘n×c{\bf Y}\in\mathbb{R}^{n\times c} denotes the one-hot label matrix, 𝐀{\bf A} denotes the adjacency matrix for all nn linked instances. FSNet first attempts to learn a linear classifier 𝐖d×c{\bf W}\in\mathbb{R}^{d\times c} to map 𝐗{\bf X} to 𝐘{\bf Y}:

min𝐖𝐗𝐖𝐘F2+α𝐖2,1+β𝐖F2.\small\min_{{\bf W}}\|{\bf XW}-{\bf Y}\|_{F}^{2}+\alpha\|{\bf W}\|_{2,1}+\beta\|{\bf W}\|_{F}^{2}.\vskip-3.61371pt (46)

The term 𝐖2,1\|{\bf W}\|_{2,1} is included to achieve joint feature sparsity across different classes. 𝐖F2\|{\bf W}\|_{F}^{2} prevents the overfitting of the model. To capture the correlation between link information and content information to select more relevant features, FSNet uses the graph regularization and the basic assumption is that if two instances are linked, their class labels are likely to be similar, which results in the following objective function:

min𝐖𝐗𝐖𝐘F2+α𝐖2,1+β𝐖F2+γtr(𝐖𝐗𝐋𝐗𝐖),\small\min_{{\bf W}}\|{\bf XW}-{\bf Y}\|_{F}^{2}+\alpha\|{\bf W}\|_{2,1}+\beta\|{\bf W}\|_{F}^{2}+\gamma tr({\bf W}^{\prime}{\bf X}^{\prime}{\bf L}{\bf XW}),\vskip-3.61371pt (47)

where tr(𝐖𝐗𝐋𝐗𝐖)tr({\bf W}^{\prime}{\bf X}^{\prime}{\bf L}{\bf XW}) is the graph regularization, and γ\gamma balances the contribution of content information and link information for feature selection.

4.1.2 Feature Selection for Social Media Data (LinkedFS)

[Tang and Liu (2012a)] investigate the feature selection problem on social media data by evaluating various social relations such as CoPost, CoFollowing, CoFollowed, and Following. These four types of relations are supported by social correlation theories such as homophily [McPherson et al. (2001)] and social influence [Marsden and Friedkin (1993)]. We use the CoPost relation as an example to illustrate how these relations can be integrated into feature selection. Let 𝐩={p1,p2,,pN}{\bf p}=\{p_{1},p_{2},...,p_{N}\} be the post set and 𝐗N×d{\bf X}\in\mathbb{R}^{N\times d} be the matrix representation of these posts; 𝐘n×c{\bf Y}\in\mathbb{R}^{n\times c} denotes the label matrix; 𝐮={u1,u2,,un}{\bf u}=\{u_{1},u_{2},...,u_{n}\} denotes the set of nn users and their link information is encoded in an adjacency matrix 𝐀{\bf A}; 𝐏n×N{\bf P}\in\mathbb{R}^{n\times N} denotes the user-post relationships such that 𝐏(i,j)=1{\bf P}(i,j)=1 if uiu_{i} posts pjp_{j}, otherwise 0. To integrate the CoPost relations among users into the feature selection framework, the authors propose to add a regularization term to enforce the hypothesis that the class labels (i.e., topics) of posts by the same user are similar, resulting in the following objective function:

min𝐖𝐗𝐖𝐘F2+α𝐖2,1+βu𝐮{pi,pj}𝐩u𝐗(i,:)𝐖𝐗(j,:)𝐖22,\small\min_{{\bf W}}\|{\bf XW}-{\bf Y}\|_{F}^{2}+\alpha\|{\bf W}\|_{2,1}+\beta\sum_{u\in{\bf u}}\sum_{\{p_{i},p_{j}\}\in{\bf p}_{u}}\|{\bf X}(i,:){\bf W}-{\bf X}(j,:){\bf W}\|_{2}^{2},\vskip-3.61371pt (48)

where 𝐩u{\bf p}_{u} denotes the set of posts by user uu. The parameter α\alpha controls the sparsity of 𝐖{\bf W} in rows across all class labels and β\beta controls the contribution of the CoPost relations.

4.1.3 Unsupervised Feature Selection for Linked Data

Linked Unsupervised Feature Selection (LUFS) [Tang and Liu (2012b)] is an unsupervised feature selection framework for linked data. Without label information to assess feature relevance, LUFS assumes the existence of pseudo labels, and uses 𝐘n×c{\bf Y}\in\mathbb{R}^{n\times c} to denote the pseudo label matrix such that each row of 𝐘{\bf Y} has only one nonzero entry. Also, LUFS assumes a linear mapping matrix 𝐖d×c{\bf W}\in\mathbb{R}^{d\times c} between feature 𝐗{\bf X} and 𝐘{\bf Y}. First, to consider the constraints from link information, LUFS employs social dimension approach [Tang and Liu (2009)] to obtain the hidden factors 𝐇{\bf H} that incur the interdependency among instances. Then, according to the Linear Discriminative Analysis, within, between and total hidden factor scatter matrix 𝐒w{\bf S}_{w}, 𝐒b{\bf S}_{b} and 𝐒t{\bf S}_{t} are defined as 𝐒w=𝐘𝐘𝐘𝐅𝐅𝐘{\bf S}_{w}={\bf Y}^{\prime}{\bf Y}-{\bf Y}^{\prime}{\bf F}{\bf F}^{\prime}{\bf Y}, 𝐒b=𝐘𝐅𝐅𝐘{\bf S}_{b}={\bf Y}^{\prime}{\bf F}{\bf F}^{\prime}{\bf Y}, 𝐒t=𝐘𝐘{\bf S}_{t}={\bf Y}^{\prime}{\bf Y} respectively, where 𝐅=𝐇(𝐇𝐇)12{\bf F}={\bf H}({\bf H}^{\prime}{\bf H})^{-\frac{1}{2}} is the weighted hidden factor matrix. Considering the fact that instances with similar hidden factors are similar and instances with different hidden factors are dissimilar, the constraint from link information can be incorporated by maximizing tr((𝐒t)1𝐒b)tr(({\bf S}_{t})^{-1}{\bf S}_{b}). Second, to take advantage of feature information, LUFS obtains the constraints by spectral analysis to minimize tr(𝐘𝐋𝐘)tr({\bf Y}^{\prime}{\bf L}{\bf Y}), where 𝐋{\bf L} is the Laplacian matrix derived from feature affinity matrix 𝐒{\bf S}. With these, the objective function of LUFS is formulated as follows:

minWtr(𝐘𝐋𝐘)αtr((𝐒t)1𝐒b),\small\min_{W}tr({\bf Y}^{\prime}{\bf L}{\bf Y})-\alpha tr(({\bf S}_{t})^{-1}{\bf S}_{b}),\vskip-3.61371pt (49)

where α\alpha is a regularization parameter to balance the contribution from these two constraints. To achieve feature selection, LUFS further adds a 2,1\ell_{2,1}-norm regularization term on 𝐖{\bf W}, and with spectral relaxation of the pseudo-class label matrix, the objective function in Eq. (49) can be eventually represented as:

min𝐖tr(𝐖(𝐗𝐋𝐗+α𝐗(𝐈n𝐅𝐅))𝐖)+β𝐖2,1s.t.𝐖(𝐗𝐗+λ𝐈d)𝐖=𝐈c,\small\begin{split}\min_{{\bf W}}tr({\bf W}^{\prime}({\bf X}^{\prime}{\bf LX}&+\alpha{\bf X}^{\prime}({\bf I}_{n}-{\bf F}{\bf F}^{\prime})){\bf W})+\beta\|{\bf W}\|_{2,1}\\ \mbox{s.t.}&\quad{\bf W}^{\prime}({\bf X}^{\prime}{\bf X}+\lambda{\bf I}_{d}){\bf W}={\bf I}_{c},\end{split}\vskip-3.61371pt (50)

where β\beta controls the sparsity of 𝐖{\bf W} in rows and λ𝐈d\lambda{\bf I}_{d} makes 𝐗𝐗+λ𝐈d{\bf X}^{\prime}{\bf X}+\lambda{\bf I}_{d} invertible.

4.1.4 Robust Unsupervised Feature Selection for Networked Data

LUFS performs network structure modeling and feature selection separately, and the feature selection heavily depends on the quality of extracted latent representations. In other words, the performance of LUFS will be jeopardized when there are a lot of noisy links in the network.  [Li et al. (2016b)] propose a robust unsupervised feature selection framework (NetFS) to embed latent representation learning into feature selection. Specifically, let 𝐗n×d{\bf X}\in\mathbb{R}^{n\times d} and 𝐀n×n{\bf A}\in\mathbb{R}^{n\times n} denote the feature matrix and adjacency matrix respectively. NetFS first uncovers a low-rank latent representation 𝐔{\bf U} by a symmetric NMF model. The latent representation describes a set of diverse affiliation factors hidden in a network, and instances with similar latent representations are more likely to be connected to each other than the instances with dissimilar latent representations. As latent factors encode some hidden attributes of instances, they should be related to some features. Thus, NetFS takes 𝐔{\bf U} as a constraint to perform feature selection via:

min𝐔0,𝐖𝐗𝐖𝐔F2+α𝐖2,1+β2𝐀𝐔𝐔F2,\small\min_{{\bf U}\geq 0,{\bf W}}\|{\bf XW}-{\bf U}\|_{F}^{2}+\alpha\|{\bf W}\|_{2,1}+\frac{\beta}{2}\|{\bf A}-{\bf U}{\bf U}^{\prime}\|_{F}^{2}, (51)

where α\alpha and β\beta are two balance parameters. By embedding latent representation learning into feature selection, these two phases could help and boost each other. Feature information can help learn better latent representations which are robust to noisy links, and better latent representations can fill the gap of limited label information and rich link information to guide feature selection. The authors further extended the NetFS model to the dynamic case to obtain a subset of relevant features continuously when both the feature information and network structure evolve over time [Li et al. (2016a)]. In addition to positive links, many real-world networks also contain negative links, such as distrust relations in Epinions and foes in Slashdot. Based on NetFS, the authors in [Cheng et al. (2017)] further study if negative links have added value over positive links in finding more relevant features.

4.2 Multi-Source Feature Selection

For many learning tasks, we often have multiple data sources for the same set of data instances. For example, recent advancements in bioinformatics reveal that non-coding RNA species function across a variety of biological process. The task of multi-source feature selection in this case is formulated as follows: given mm sources of data depicting the same set of nn instances, and their matrix representations 𝐗1n×d1,𝐗2n×d2,,𝐗mn×dm{\bf X}_{1}\in\mathbb{R}^{n\times d_{1}},{\bf X}_{2}\in\mathbb{R}^{n\times d_{2}},...,{\bf X}_{m}\in\mathbb{R}^{n\times d_{m}} (where d1,,dmd_{1},...,d_{m} denote the feature dimensions), select a subset of relevant features from a target source (e.g., 𝐗i{\bf X}_{i}) by taking advantage of all information from mm sources.

4.2.1 Multi-Source Feature Selection via Geometry-Dependent Covariance Analysis (GDCOV)

To integrate information from multiple sources,  [Zhao and Liu (2008)] propose an intuitive way to learn a global geometric pattern from all sources that reflects the intrinsic relationships among instances [Lanckriet et al. (2004)]. They introduce a concept of geometry-dependent covariance that enables the usage of the global geometric pattern in covariance analysis for feature selection. Given multiple local geometric patterns in multiple affinity matrices 𝐒i{\bf S}_{i}, where ii denotes the ii-th data source, a global pattern can be obtained by linearly combining all affinity matrices as 𝐒=i=1mαi𝐒i{\bf S}=\sum_{i=1}^{m}\alpha_{i}{\bf S}_{i}, where αi\alpha_{i} controls the contribution of the ii-th source. With the global geometric pattern obtained from multiple data sources, one can build a geometry-dependent sample covariance matrix for the target source 𝐗i{\bf X}_{i} as 𝐂=1n1𝚷𝐗i(𝐒𝐒𝟏𝟏𝐒𝟏𝐒𝟏)𝐗i𝚷{\bf C}=\frac{1}{n-1}{\bf\Pi}{\bf X}_{i}^{\prime}({\bf S}-\frac{{\bf S}{\bf 1}{\bf 1}^{\prime}{\bf S}}{{\bf 1}^{\prime}{\bf S}{\bf 1}}){\bf X}_{i}{\bf\Pi}, where 𝚷{\bf\Pi} is a diagonal matrix with 𝚷(j,j)=𝐃12𝐗i(:,j)1{\bf\Pi}(j,j)=\|{\bf D}^{\frac{1}{2}}{\bf X}_{i}(:,j)\|^{-1}, and 𝐃{\bf D} is also a diagonal matrix from 𝐒{\bf S} with 𝐃(k,k)=j=1n𝐒(k,j){\bf D}(k,k)=\sum_{j=1}^{n}{\bf S}(k,j).

After getting a geometry-dependent sample covariance matrix, a subsequent question is how to use it effectively for feature selection. Basically, two methods are proposed. The first method, GPCOVvar sorts the diagonal of the covariance matrix and selects the features that have the highest variances. Selecting features based on this approach is equivalent to choosing features that are consistent with the global geometry pattern. The second method, GPCOVspca, applies Sparse Principal Component Analysis (SPCA) [d’Aspremont et al. (2007)] to select features that can retain the total variance maximally. Hence, it considers interactions among features and can select features with less redundancy.

4.3 Feature Selection Algorithms with Multi-View Data

Multi-View data represent different facets of data instances in different feature spaces. These feature spaces are naturally dependent and high-dimensional. Hence, the task of multi-view feature selection arises [Feng et al. (2013), Tang et al. (2013), Wang et al. (2013), Liu et al. (2016a)], which aims to select features from different feature spaces simultaneously by using their relations. One motivating example is to select relevant features in pixels, tags, and terms associated with images simultaneously. Since multi-view feature selection is designed to select features across multiple views by using their relations, they are naturally different from multi-source feature selection. The difference between multi-source feature selection and multi-view feature selection is illustrated in Fig. 6.

Refer to caption
(a) Multi-Source Feature Selection
Refer to caption
(b) Multi-View Feature Selection
Figure 6: Differences between multi-source and multi-view feature selection.

For supervised multi-view feature selection, the most common approach is Sparse Group Lasso [Friedman et al. (2010), Peng et al. (2010)]. In this subsection, we review some representative algorithms for unsupervised multi-view feature selection.

4.3.1 Adaptive Multi-View Feature Selection

Adaptive unsupervised multi-view feature selection (AUMFS) [Feng et al. (2013)] takes advantages of the data cluster structure, the data similarity and the correlations among views simultaneously. Specifically, let 𝐗1n×d1,𝐗2n×d2,,𝐗1n×dm{\bf X}_{1}\in\mathbb{R}^{n\times d_{1}},{\bf X}_{2}\in\mathbb{R}^{n\times d_{2}},...,{\bf X}_{1}\in\mathbb{R}^{n\times d_{m}} denote the description of nn instances from mm different views respectively, 𝐗=[𝐗1,𝐗2,,𝐗m]d{\bf X}=[{\bf X}_{1},{\bf X}_{2},...,{\bf X}_{m}]\in\mathbb{R}^{d} denotes the concatenated data, where d=d1+d2++dmd=d_{1}+d_{2}+...+d_{m}. AUMFS first builds a feature selection model by using 2,1\ell_{2,1}-norm regularized least squares loss function:

min𝐖,𝐅𝐗𝐖𝐅2,1+α𝐖2,1,\small\min_{{\bf W},{\bf F}}\|{\bf X}{\bf W}-{\bf F}\|_{2,1}+\alpha\|{\bf W}\|_{2,1},\vskip-3.61371pt (52)

where 𝐅n×c{\bf F}\in\mathbb{R}^{n\times c} is the pseudo class label matrix. The 2,1\ell_{2,1}-norm loss function is imposed since it is robust to outliers and 2,1\ell_{2,1}-norm regularization selects features across all cc pseudo class labels with joint sparsity. Then AUMFS uses spectral clustering on an affinity matrix from different views to learn the shared pseudo class labels. For the data matrix 𝐗i{\bf X}_{i} in each view, it first builds an affinity matrix 𝐒i{\bf S}_{i} based on the data similarity on that view and gets the corresponding Laplacian matrix 𝐋i{\bf L}_{i}. Then it aims to learn the pseudo class label matrix by considering the spectral clustering from all views. Integrating it with Eq. (52), we have the following objective function:

mintr(𝐅i=1mλi𝐋i𝐅)+β(𝐗𝐖𝐅2,1+α𝐖2,1)s.t. 𝐅𝐅=𝐈c,𝐅0,i=1mλi=1,λi0.\small\begin{split}\min tr({\bf F}^{\prime}&\sum_{i=1}^{m}\lambda_{i}{\bf L}_{i}{\bf F})+\beta(\|{\bf X}{\bf W}-{\bf F}\|_{2,1}+\alpha\|{\bf W}\|_{2,1})\\ \mbox{s.t. }&\quad{\bf F}^{\prime}{\bf F}={\bf I}_{c},{\bf F}\geq 0,\sum_{i=1}^{m}\lambda_{i}=1,\lambda_{i}\geq 0.\end{split}\vskip-3.61371pt (53)

where the contribution of each view for the joint spectral clustering is balanced by a nonnegative weight λi\lambda_{i} and the summation of all λi\lambda_{i} equals 1. β\beta is a parameter to balance the contribution of spectral clustering and feature selection.

4.3.2 Unsupervised Feature Selection for Multi-View Data

AUMFS [Feng et al. (2013)] learns one feature weight matrix for all features from different views to approximate the pseudo class labels.  [Tang et al. (2013)] propose a novel unsupervised feature selection method called Multi-View Feature Selection (MVFS). Similar to AUMFS, MVFS uses spectral clustering with the affinity matrix from different views to learn the pseudo class labels. It differs from AUMFS as it learns one feature weight matrix for each view to fit the pseudo class labels by the joint least squares loss and 2,1\ell_{2,1}-norm regularization. The optimization problem of MVFS can be formulated as follows:

mintr(𝐅i=1mλi𝐋i𝐅)+i=1mβ(𝐗i𝐖i𝐅2,1+α𝐖i2,1)s.t. 𝐅𝐅=𝐈c,𝐅0,i=1mλi=1,λi0.\small\begin{split}\min tr({\bf F}^{\prime}&\sum_{i=1}^{m}\lambda_{i}{\bf L}_{i}{\bf F})+\sum_{i=1}^{m}\beta(\|{\bf X}_{i}{\bf W}_{i}-{\bf F}\|_{2,1}+\alpha\|{\bf W}_{i}\|_{2,1})\\ \mbox{s.t. }&\quad{\bf F}^{\prime}{\bf F}={\bf I}_{c},{\bf F}\geq 0,\sum_{i=1}^{m}\lambda_{i}=1,\lambda_{i}\geq 0.\end{split}\vskip-3.61371pt (54)

The parameter λi\lambda_{i} is employed to control the contribution of each view and i=1mλi=1\sum_{i=1}^{m}\lambda_{i}=1.

4.3.3 Multi-View Clustering and Feature Learning via Structured Sparsity

In some cases, features from a certain view contain more discriminative information than features from other views. One example is that in image processing, the color features are more useful than other types of features in identifying stop signs. To address this issue in multi-view feature selection, a novel feature selection algorithm is proposed in [Wang et al. (2013)] with a joint group 1\ell_{1}-norm and 2,1\ell_{2,1}-norm regularization.

For the feature weight matrix 𝐖1,,𝐖m{\bf W}_{1},...,{\bf W}_{m} from mm different views, the group 1\ell_{1}-norm is defined as 𝐖G1=j=1ci=1m𝐖i(:,j)\|{\bf W}\|_{G_{1}}=\sum_{j=1}^{c}\sum_{i=1}^{m}\|{\bf W}_{i}(:,j)\|. Crucially, the group 1\ell_{1}-norm regularization term is able to capture the global relations among different views and is able to achieve view-wise sparsity such that only a few views are selected. In addition to group 1\ell_{1}-norm, a 2,1\ell_{2,1}-norm regularizer on 𝐖{\bf W} is also included to achieve feature sparsity among selected views. Hence, the objective function of the proposed method is formulated as follows:

min𝐖,𝐅𝐗𝐖𝐅F2+α𝐖2,1+β𝐖G1s.t. 𝐅𝐅=𝐈c,𝐅0,\small\begin{split}\min_{{\bf W},{\bf F}}&\|{\bf XW}-{\bf F}\|_{F}^{2}+\alpha\|{\bf W}\|_{2,1}+\beta\|{\bf W}\|_{G_{1}}\\ \mbox{s.t. }&\quad{\bf F}^{\prime}{\bf F}={\bf I}_{c},{\bf F}\geq 0,\end{split} (55)

where α\alpha and β\beta are used to control inter-view sparsity and intra-view sparsity.

Discussion: Feature selection algorithms for heterogeneous data can handle various types of data simultaneously. By fusing multiple data sources together, the selected features are able to capture the inherent characteristics of data and could better serve other learning problems on such data. However, most of the proposed algorithms in this family use matrices to represent the data and often convert the feature selection problem into an optimization algorithm. The resulted optimization problem often requires complex matrix operations which is computationally expensive, and also, limits the scalability of these algorithms for large-scale data. How to design efficient and distributed algorithms to speed up the computation is still a fertile area and needs deeper investigation

5 Feature Selection with Streaming Data

Previous methods assume that all data instances and features are known in advance. However, it is not the case in many real-world applications that we are more likely faced with data streams and feature streams. In the worst cases, the size of data or features are unknown or even infinite. Thus it is not practical to wait until all data instances or features are available to perform feature selection. For streaming data, one motivating example online spam email detection problem, where new emails are continuously arriving; it is not easy to employ batch-mode feature selection methods to select relevant features in a timely manner. On an orthogonal setting, feature selection for streaming features also has its practical significances. For example, Twitter produces more than 500 million tweets every day, and a large amount of slang words (features) are continuously being generated. These slang words promptly grab users’ attention and become popular in a short time. Therefore, it is preferred to perform streaming feature selection to adapt to the changes on the fly. There are also some attempts to study these two dual problems together, which is referred as feature selection on Trapezoidal data streams [Zhang et al. (2015)]. We will review some representative algorithms for these two orthogonal problems.

5.1 Feature Selection Algorithms with Feature Streams

For the feature selection problems with streaming features, the number of instances is considered to be constant while candidate features arrive one at a time; the task is to timely select a subset of relevant features from all features seen so far [Perkins and Theiler (2003), Zhou et al. (2005), Wu et al. (2010), Yu et al. (2014), Li et al. (2015b)]. At each time step, a typical streaming feature selection algorithm first determines whether to accept the most recently arrived feature; if the feature is added to the selected feature set, it then determines whether to discard some existing features. The process repeats until no new features show up anymore. Different algorithms have different implementations in the first step. The second step which checks existing features is an optional step.

5.1.1 Grafting Algorithm

The first attempt to perform streaming feature selection is credited to [Perkins and Theiler (2003)]. Their method is based on a stagewise gradient descent regularized risk framework [Perkins et al. (2003)]. Grafting is a general technique that can deal with a variety of models that are parameterized by a feature weight vector 𝐰{\bf w} subject to 1\ell_{1}-norm regularization, such as Lasso.

The basic idea of Grafting is based on the observation – incorporating a new feature into the Lasso model involves adding a new penalty term into the model. For example, at the time step jj, when a new feature fjf_{j} arrives, it incurs a regularization penalty of α|𝐰j|\alpha|{\bf w}_{j}|. Therefore, the addition of the new feature fjf_{j} reduces the objective function value in Lasso only when the reduction in the loss function part loss(𝐰;𝐗,𝐲)loss({\bf w};{\bf X},{\bf y}) outweighs the increase in the 1\ell_{1}-norm regularization. With this observation, the condition of accepting the new feature fjf_{j} is |loss(𝐰;𝐗,𝐲)𝐰j|>α\left|\frac{\partial loss({\bf w};{\bf X},{\bf y})}{\partial{\bf w}_{j}}\right|>\alpha. Otherwise, the Grafting algorithm will set the feature coefficient 𝐰j{\bf w}_{j} of the new feature fjf_{j} to be zero. In the second step, when new features are accepted and included in the model, Grafting adopts a conjugate gradient (CG) procedure to optimize the model with respect to all current parameters to exclude some outdated features.

5.1.2 Alpha-investing Algorithm

Alpha-investing [Zhou et al. (2005)] is an adaptive complexity penalty method which dynamically changes the threshold of error reduction that is required to accept a new feature. It is motivated by a desire to control the false discovery rate (FDR) of newly arrived features, such that a small portion of spurious features do not affect the model’s accuracy significantly. The detailed algorithm works as follows: (1) it initializes w0=0w_{0}=0 (probability of false positives), i=0i=0 (index of features), and selected features in the model to be empty; (2) it sets αi=wi/2i\alpha_{i}=w_{i}/2i when a new feature arrives; (3) it sets wi+1=wiαiw_{i+1}=w_{i}-\alpha_{i} if p_value(fi,SF)αip\_value(f_{i},SF)\geq\alpha_{i}; or set wi+1=wi+αΔαiw_{i+1}=w_{i}+\alpha_{\Delta}-\alpha_{i}, SF=SFfiSF=SF\cup f_{i} if p_value(fi,SF)<αip\_value(f_{i},SF)<\alpha_{i}. The threshold αi\alpha_{i} corresponds to the probability of selecting a spurious feature at the time step ii. It is adjusted by the wealth wiw_{i}, which denotes the acceptable number of false positively detected features at the current moment. The wealth wiw_{i} increases when a feature is added to the model. Otherwise, it decreases when a feature is not included to save for future features. More precisely, at each time step, the method calculates the pp-value by using the fact that Δ\DeltaLogliklohood is equivalent to t-statistics. The pp-value denotes the probability that a feature coefficient could be set to nonzero when it is not (false positively detected). The basic idea of alpha-investing is to adaptively adjust the threshold such that when new features are selected and included into the model, it allows a higher chance of including incorrect features in the future. On the other hand, each time when a new feature is not included, the wealth is wasted and lowers the chance of finding more spurious features.

5.1.3 Online Streaming Feature Selection Algorithm

Some other researchers study the streaming feature selection problem from an information theoretic perspective [Wu et al. (2010)]. According to the definition, the whole feature set consists of four types of features: irrelevant, redundant, weakly relevant but non-redundant, and strongly relevant features. An optimal feature selection should select non-redundant and strongly relevant features. But as features continuously arrive in a streaming fashion, it is difficult to find all strongly relevant and non-redundant features. The proposed method, OSFS is able to capture these non-redundant and strongly relevant features via two steps: (1) online relevance analysis, and (2) online redundancy analysis. In the online relevance analysis step, OSFS discovers weakly relevant and strongly relevant features, and these features are added into the best candidate features (BCF). Otherwise, if the newly arrived feature is not relevant to the class label, it is discarded and not considered in future steps. In online redundancy analysis step, OSFS dynamically eliminates redundant features in the selected subset using Markov Blanket. For each feature fjf_{j} in the best candidate set BCFBCF, if there exists a subset of BCFBCF making fjf_{j} and the class label conditionally independent, then fjf_{j} is removed from BCFBCF.

5.1.4 Unsupervised Streaming Feature Selection in Social Media

Vast majority of streaming feature selection methods are supervised which utilize label information to guide feature selection. However, in social media, it is easy to amass vast quantities of unlabeled data, while it is time and labor consuming to obtain labels. To deal with large-scale unlabeled data in social media, authors in [Li et al. (2015b)] propose the USFS algorithm to tackle unsupervised streaming feature selection. The key idea of USFS is to utilize source information such as link information. USFS first uncovers hidden social factors from link information by mixed membership stochastic blockmodel [Airoldi et al. (2009)]. After obtaining the social latent factors 𝚷n×k{\bf\Pi}\in\mathbb{R}^{n\times k} for each linked instance, USFS takes advantage of them as a constraint to perform selection. At a specific time step tt, let 𝐗(t){\bf X}^{(t)}, 𝐖(t){\bf W}^{(t)} denote the corresponding feature matrix and feature coefficient respectively. To model feature information, USFS constructs a graph 𝒢\mathcal{G} to represent feature similarity and 𝐀(t){\bf A}^{(t)} denotes the adjacency matrix of the graph, 𝐋(t){\bf L}^{(t)} is the corresponding Laplacian matrix from 𝐗(t){\bf X}^{(t)}. Then the objective function to achieve feature selection at the time step tt is given as follows:

min𝐖(t)12𝐗(t)𝐖(t)𝚷F2+αi=1k(𝐰(t))i1+β2𝐖(t)F2+γ2(𝐗(t)𝐖(t))(𝐋(t))12F2,\small\min_{{\bf W}^{(t)}}\frac{1}{2}\|{\bf X}^{(t)}{\bf W}^{(t)}-\boldsymbol{\Pi}\|_{F}^{2}+\alpha\sum_{i=1}^{k}\|({\bf w}^{(t)})^{i}\|_{1}+\frac{\beta}{2}\|{\bf W}^{(t)}\|_{F}^{2}+\frac{\gamma}{2}\|({\bf X}^{(t)}{\bf W}^{(t)})^{\prime}({\bf L}^{(t)})^{\frac{1}{2}}\|_{F}^{2},\vskip-3.61371pt (56)

where α\alpha is a sparse regularization parameter, β\beta controls the robustness of the model, and γ\gamma balances link information and feature information. Assume at the next time step t+1t+1 a new feature arrives, to test the new feature, USFS takes a similar strategy as Grafting to perform gradient test. Specifically, if the inclusion of the new feature is going to reduce the objective function in Eq. (56), the feature is accepted; otherwise the new feature can be removed. When new features are continuously being generated, some existing features may become outdated, therefore, USFS also investigates if it is necessary to remove any existing features by re-optimizing the model through a BFGS method [Boyd and Vandenberghe (2004)].

5.2 Feature Selection Algorithms with Data Streams

In this subsection, we review the problem of feature selection with data streams which is considered as a dual problem of streaming feature selection.

5.2.1 Online Feature Selection

In [Wang et al. (2014b)], an online feature selection algorithm (OFS) for binary classification is proposed. Let {𝐱1,𝐱2,,𝐱t}\{{\bf x}_{1},{\bf x}_{2},...,{\bf x}_{t}...\} and {y1,y2,,yt}\{y_{1},y_{2},...,y_{t}...\} denote a sequence of input data instances and input class labels respectively, where each data instance 𝐱id{\bf x}_{i}\in\mathbb{R}^{d} is in a dd-dimensional space and class label yi{1,+1}y_{i}\in\{-1,+1\}. The task of OFS is to learn a linear classifier 𝐰(t)d{\bf w}^{(t)}\in\mathbb{R}^{d} that can be used to classify each instance 𝐱i{\bf x}_{i} by a linear function sign(𝐰(t)𝐱i{{\bf w}^{(t)}}^{\prime}{\bf x}_{i}). To achieve feature selection, it requires that the linear classifier 𝐰(t){\bf w}^{(t)} has at most BB-nonzero elements such that 𝐰(t)0B\|{\bf w}^{(t)}\|_{0}\leq B. It indicates that at most BB features will be used for classification. With a regularization parameter λ\lambda and a step size η\eta, the algorithm of OFS works as follows: (1) get a new data instance 𝐱t{\bf x}_{t} and its class label yty_{t}; (2) make a class label prediction sign(𝐰(t)𝐱t{{\bf w}^{(t)}}^{\prime}{\bf x}_{t}) for the new instance; (3) if 𝐱t{\bf x}_{t} is misclassified such that yi𝐰(t)𝐱t<0y_{i}{{\bf w}^{(t)}}^{\prime}{\bf x}_{t}<0, then 𝐰~t+1=(1λη)𝐰t+ηyt𝐱t\tilde{{\bf w}}_{t+1}=(1-\lambda\eta){\bf w}_{t}+\eta y_{t}{\bf x}_{t}, 𝐰^t+1=min{1,1/λ𝐰~t+12}𝐰~t+1\hat{{\bf w}}_{t+1}=\min\{1,1/\sqrt{\lambda}\|\tilde{{\bf w}}_{t+1}\|_{2}\}\tilde{{\bf w}}_{t+1}, and 𝐰t+1=Truncate(𝐰^t+1,B){\bf w}_{t+1}=Truncate(\hat{{\bf w}}_{t+1},B); (4) 𝐰t+1=(1λη)𝐰t{\bf w}_{t+1}=(1-\lambda\eta){\bf w}_{t}. In particular, each time when a training instance 𝐱t{\bf x}_{t} is misclassified, 𝐰t{\bf w}_{t} is first updated by online gradient descent and then it is projected to a 2\ell_{2}-norm ball to ensure that the classifier is bounded. After that, the new classifier 𝐰^t+1\hat{{\bf w}}_{t+1} is truncated by taking the most important BB features. A subset of BB features is returned at each time step. The process repeats until there are no new data instances arrive anymore.

5.2.2 Unsupervised Feature Selection on Data Streams

To timely select a subset of relevant features when unlabeled data is continuously being generated, [Huang et al. (2015)] propose a novel unsupervised feature selection method (FSDS) with only one pass of the data and with limited storage. The basic idea of FSDS is to use matrix sketching to efficiently maintain a low-rank approximation of the current observed data and then apply regularized regression to obtain the feature coefficients, which can further be used to obtain the importance of features. The authors empirically show that when some orthogonality conditions are satisfied, the ridge regression can replace the Lasso for feature selection, which is more computationally efficient. Assume at a specific time step tt, 𝐗(t)nt×d{\bf X}^{(t)}\in\mathbb{R}^{n_{t}\times d} denotes the data matrix at that time step, the feature coefficients can be obtained by minimizing the following:

min𝐖(t)𝐁(t)𝐖(t){𝐞1,,𝐞k}F2+α𝐖(t)F2,\small\min_{{\bf W}^{(t)}}\|{\bf B}^{(t)}{\bf W}^{(t)}-\{{\bf e}_{1},...,{\bf e}_{k}\}\|_{F}^{2}+\alpha\|{\bf W}^{(t)}\|_{F}^{2}, (57)

where 𝐁(t)×d{\bf B}^{(t)}\in\mathbb{R}^{\ell\times d} denote the sketching matrix of 𝐗(t){\bf X}^{(t)} (nt\ell\ll n_{t}), 𝐞i{\bf e}_{i}\in\mathbb{R}^{\ell} is a vector with its ii-th location as 1 and other locations as 0. By solving the optimization problem in Eq. (57), the importance of each feature fif_{i} is score(j)=maxi|𝐖(t)(j,i)|score(j)=\max_{i}|{\bf W}^{(t)}(j,i)|. The higher the feature score, the more important the feature is.

Discussion: As data is often not static and is generated in a streaming fashion, feature selection algorithms for both feature and data streams are often more desired in practical usage. Most of the existing algorithms in this family employ various strategies to speed up the selection process such that it can deal with new data samples or new features upon the arrival. However, it should be mentioned that most of these algorithms require multiple pass of the data and some even need to store all the historically generated data, which jeopardizes the usage of these algorithms when we only have limited memory or disk storage. It requires further efforts to design streaming algorithms that are effective and efficient with limited storage costs

6 Performance Evaluation

We first introduce our efforts in developing an open-source feature selection repository. Then we use algorithms included in the repository as an example to show how to evaluate different feature selection algorithms.

6.1 Feature Selection Repository

First, we introduce our attempt in developing a feature selection repository – scikit-feature. The purpose of this feature selection repository is to collect some widely used feature selection algorithms that have been developed in the feature selection research to serve as a platform to facilitate their application, comparison, and joint study. The feature selection repository also effectively assists researchers to achieve more reliable evaluation in the process of developing new feature selection algorithms.

We develop the open source feature selection repository scikit-feature by one of the most popular programming language – python. It contains around 40 popular feature selection algorithms. It is built upon one widely used machine learning package scikit-learn and two scientific computing packages Numpy and Scipy. At the same time, we also maintain a website (http://featureselection.asu.edu/) for this project which offers several sources such as publically available benchmark datasets, performance evaluation of algorithms and test cases to run each algorithm. The source code of this repository is available at Github (https://github.com/jundongl/scikit-feature). An interactive tool of the repository is also available [Cheng et al. (2016)]. We welcome researchers in this community to contribute algorithms and datasets to our repository.

6.2 Evaluation Methods and Metrics

As an example, we empirically show how to evaluate the performance of feature selection algorithms in the repository. The experimental results can be obtained from our repository project website (http://featureselection.asu.edu/datasets.php). In our project website, for each dataset, we list all applicable feature selection algorithms along with its evaluation on either classification or clustering. Next, we will provide detailed information how these algorithms are evaluated, including evaluation criteria and experimental setup. Different feature selection algorithms can be categorized by the following two criteria: (1) labels: supervised or unsupervised; (2) output: feature weighting or subset selection. The first criterion determines whether we need to use the label information to perform feature selection or not. The second criterion categorizes these algorithms based on the output. Feature weighing algorithms give each feature a score for ranking and feature subset algorithms only show which features are selected.

Next, we introduce the widely adopted way to evaluate the performance of feature selection algorithms. We have different evaluation metrics for supervised and unsupervised methods. For different output types, different evaluation strategies are used: (1) if it is a feature weighting method that outputs the feature scores, then the quality of the first {5,10,15,,295,300}\{5,10,15,...,295,300\} features are evaluated respectively; (2) if it is a feature subset selection method that only outputs which features are selected, then we use all the selected features to perform the evaluation.

Supervised Methods

To test the performance of supervised feature selection algorithms, we divide the whole dataset into two parts - the training set 𝒯\mathcal{T} and test set 𝒰\mathcal{U}. Feature selection algorithms will be first applied to the training set 𝒯\mathcal{T} to obtain a subset of relevant features 𝒮\mathcal{S}. Then the test set on the selected features acts as input to a classification model for the testing purpose. In the experiments, we use classification accuracy to evaluate the classification performance and three classification models, Linear SVM, Decision Tree, Naïve Bayes are used. To get more reliable results, 10-folds cross validation is used. Normally, the higher the classification performance, the better the selected features are.

Unsupervised Methods

Following the standard way to assess unsupervised feature selection, we evaluate unsupervised algorithms in terms of clustering performance. Two commonly used clustering performance metrics [Cai et al. (2010)], i.e., normalized mutual information (NMI) and accuracy (ACC) are used. Each feature selection algorithm is first applied to select features; then K-means clustering is performed based on the selected features. We repeat the K-means algorithm 20 times and report the average clustering results since K-means may converge to a local optimal. The higher the clustering performance, the better the selected features are.

We also list the following information of main algorithms reviewed in this paper in Table 2: (1) the type of data: conventional data or other types of data; (2) usage of labels: supervised or unsupervised111feature selection for regression can also be regarded as a supervised method, here we focus on feature selection for classification problems; (3) output: feature weighting or subset selection; (4) feature type: numerical variables or discrete variables (numerical variables can also be divided into continuous variables and discrete variables). For supervised feature selection methods, we also list if the methods are designed to tackle binary-class or multi-class classification problems. Based on the above information, the practitioners can have a more intuitive sense about the applicable scenarios of different methods.

Table 2: Detailed information of main feature selection algorithms reviewed in the paper.
Data Methods Supervision Output of Features Feature Type
Binary Multi-class Unsupervised Ranking Subset Numerical Categorical
Continuous Discrete
Conventional–Flat Features Fisher Score [Duda et al. (2012)]
ReliefF [Robnik-Šikonja and Kononenko (2003)]
Trace Ratio [Nie et al. (2008)]
Laplacian Score [He et al. (2005)]
SPEC [Zhao and Liu (2007)]
MIM [Lewis (1992)]
MIFS [Battiti (1994)]
MRMR [Peng et al. (2005)]
CIFE [Lin and Tang (2006)]
JMI [Meyer et al. (2008)]
CMIM [Fleuret (2004)]
IF [Vidal-Naquet and Ullman (2003)]
ICAP [Jakulin (2005)]
DISR [Meyer and Bontempi (2006)]
FCBF [Yu and Liu (2003)]
p\ell_{p}-regularized [Liu et al. (2009b)]
p,q\ell_{p,q}-regularized [Liu et al. (2009b)]
REFS [Nie et al. (2010)]
MCFS [Cai et al. (2010)]
UDFS [Yang et al. (2011)]
NDFS [Li et al. (2012)]
Low Variance [Pedregosa et al. (2011)]
T-score [Davis and Sampson (1986)]
Chi-square [Liu and Setiono (1995)]
Gini [Gini (1912)]
CFS [Hall and Smith (1999)]
Conventional – Structured Feature Group Lasso 
Sparse Group Lasso [Friedman et al. (2010)]
Tree Lasso [Liu and Ye (2010)]
Graph Lasso [Ye and Liu (2012)]
GFLasso [Kim and Xing (2009)]
GOSCAR [Yang et al. (2012)]
Linked Data FSNet [Gu and Han (2011)]
LinkedFS [Tang and Liu (2012a)]
LUFS [Tang and Liu (2012b)]
NetFS [Li et al. (2016b)]
Multi-Source GDCOV [Zhao and Liu (2008)]
Multi-View AUMFS [Feng et al. (2013)]
MVFS [Tang et al. (2013)]
Streaming Feature Grafting [Perkins and Theiler (2003)]
Alpha-Investing [Zhou et al. (2005)]
OSFS [Wu et al. (2010)]
USFS [Li et al. (2015b)]
Streaming Data OFS [Wang et al. (2014b)]
FSDS [Huang et al. (2015)]

7 Open Problems and Challenges

Over the past two decades, there has been a significant number of attempts in developing feature selection algorithms for both theoretical analysis and real-world applications. However, we still believe there is more work that can be done in this field. Here are several challenges and concerns that we need to mention and discuss.

7.1 Scalability

With the tremendous growth in the size of data, the scalability of most current feature selection algorithms may be jeopardized. In many scientific and business applications, data is usually measured in terabyte (1TB = 101210^{12} bytes). Normally, datasets in the scale of terabytes cannot be loaded into the memory directly and therefore limits the usability of most feature selection algorithms. Currently, there are some attempts to use distributed programming frameworks to perform parallel feature selection for large-scale datasets [Singh et al. (2009), Zhao et al. (2013), Yamada et al. (2014), Zadeh et al. (2017)]. In addition, most of the existing feature selection methods have a time complexity proportional to O(d2)O(d^{2}) or even O(d)3O(d)^{3}, where dd is the feature dimension. Recently, big data of ultrahigh-dimensionality has emerged in many real-world applications such as text mining and information retrieval. Most feature selection algorithms do not scale well on the ultrahigh-dimensional data whose efficiency deteriorates quickly or is even computationally infeasible. In this case, well-designed feature selection algorithms in linear or sublinear running time are preferred [Fan et al. (2009), Tan et al. (2014)]. Moreover, in some online classification or online clustering tasks, the scalability of feature selection algorithms is also a big issue. For example, the data streams or feature streams may be infinite and cannot be loaded into the memory, hence we can only make one pass of the data where the second pass is either unavailable or computationally expensive. Even though feature selection algorithms can reduce the issue of scalability for online classification or clustering, these methods either require to keep all features in the memory or require iterative processes to visit data instances more than once, which limit their practical usage. In conclusion, even though there is some preliminary work to increase the scalability of feature selection algorithms, we believe that more focus should be given to the scalability problem to keeping pace with the rapid growth of very large-scale and streaming data.

7.2 Stability

For supervised feature selection algorithms, their performance is usually evaluated by the classification accuracy. In addition to accuracy, the stability of these algorithms is also an important consideration when developing new feature selection algorithms. It is defined as the sensitivity of a feature selection algorithm to perturbation in the training data [Kalousis et al. (2007), He and Yu (2010), Saeys et al. (2008), Loscalzo et al. (2009), Yang and Mao (2011)]. The perturbation of data could be in various format such as addition/deletion of data samples and the inclusion of noisy/outlier samples. More rigorous definition on the stability of feature selection algorithms can be referred to [Kalousis et al. (2007)]. The stability of feature selection algorithms has significant implications in practice as it can help domain experts gain more confidence on the selected features. A motivating example in bioinformatics indicates that domain experts would like to see the same set or similar set of genes (features) selected each time when they obtain new data samples. Otherwise, domain experts would not trust these algorithms and may never use them again. It is observed that many well-known feature selection algorithms suffer from the low stability problem after the small data perturbation is introduced in the training set. It is also found in [Alelyani et al. (2011)] that the underlying characteristics of data may greatly affect the stability of feature selection algorithms and the stability issue may also be data dependent. These factors include the dimensionality of the feature, the number of data instances, etc. In against with supervised feature selection, the stability of unsupervised feature selection algorithms has not been well studied yet. Studying stability for unsupervised feature selection is much more difficult than that of the supervised methods. The reason is that in unsupervised feature selection, we do not have enough prior knowledge about the cluster structure of the data. Thus, we are uncertain that if the new data instance, i.e., the perturbation belongs to any existing clusters or will introduce new clusters. While in supervised feature selection, we have prior knowledge about the label of each data instance, and a new sample that does not belong to any existing classes will be considered as an outlier and we do not need to modify the selected feature set to adapt to the outliers. In other words, unsupervised feature selection is more sensitive to noise and the noise will affect the stability of these algorithms.

7.3 Model Selection

For most feature selection algorithms especially for feature weighting methods, we have to specify the number of selected features. However, it is often unknown what is the optimal number of selected features. A large number of selected features will increase the risk in including noisy, redundant and irrelevant features, which may jeopardize the learning performance. On the other hand, it is also not good to include too small number of selected features, since some relevant features may be eliminated. In practice, we usually adopt a heuristic way to grid search the number of selected features and pick the number that has the best classification or clustering performance, but the whole process is computationally expensive. It is still an open and challenging problem to determine the optimal number of selected features. In addition to the optimal number of selected features, we also need to specify the number of clusters or pseudo classes for unsupervised feature selection algorithms. In real-world problems, we usually have limited knowledge about the clustering structure of the data. Choosing different numbers of clusters may merge totally different small clusters into one big cluster or split one big cluster into smaller ones. As a consequence, it may result in finding totally different subsets of features. Some work has been done to estimate these tricky parameters. For instance, in [Tibshirani et al. (2001)], a principled way to estimate the number of suitable clusters in a dataset is proposed. However, it is still not clear how to find the best number of clusters directly for unsupervised feature selection. All in all, we believe that the model selection is an important issue and needs deeper investigation.

8 Conclusion

Feature selection is effective in preprocessing data and reducing data dimensionality. Meanwhile, it is essential to successful data mining and machine learning applications. It has been a challenging research topic with practical significance in many areas such as statistics, pattern recognition, machine learning, and data mining (including web, text, image, and microarrays). The objectives of feature selection include: building simpler and more comprehensive models, improving data mining performance, and helping prepare clean and understandable data. The past few years have witnessed the development of many new feature selection methods. This survey article aims to provide a comprehensive review about recent advances in feature selection. We first introduce basic concepts of feature selection and emphasize the importance of applying feature selection algorithms to solve practical problems. Then we classify conventional feature selection methods from the label perspective and the selection strategy perspective. As current categorization cannot meet the rapid development of feature selection research especially in the era of big data, we propose to review recent advances in feature selection algorithms from a data perspective. In particular, we survey feature selection algorithms in four parts: (1) feature selection with conventional data with flat features; (2) feature selection with structured features; (3) feature selection with heterogeneous data; and (4) feature selection with streaming data. Specifically, we further classify conventional feature selection algorithms for conventional data (flat features) into similarity based, information theoretical based, sparse learning based and statistical based methods, and other types of methods according to the used techniques. For feature selection with structured features, we consider three types of structured features, namely group, tree and graph features. The third part studies feature selection with heterogeneous data, including feature selection with linked data, multi-source and multi-view feature selection. The last part consists of feature selection algorithms for streaming data and streaming features. We analyze the advantages and shortcomings of these different types of feature selection algorithms. To facilitate the research on feature selection, this survey is accompanied by a feature selection repository - scikit-feature, which includes some of the most popular feature selection algorithms that have been developed in the past few decades. Some suggestions are given on how to evaluate these feature selection algorithms, either supervised or unsupervised methods. At the end of the survey, we present some open problems that require future research. It also should be mentioned that the aim of the survey is not to claim the superiority of some feature selection algorithms over others, but to provide a comprehensive structured list of recent advances in feature selection algorithms from a data perspective and a feature selection repository to promote the research in this community.

References

  • [1]
  • Abeel et al. (2010) Thomas Abeel, Thibault Helleputte, Yves Van de Peer, Pierre Dupont, and Yvan Saeys. 2010. Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 26, 3 (2010), 392–398.
  • Airoldi et al. (2009) Edoardo M Airoldi, David M Blei, Stephen E Fienberg, and Eric P Xing. 2009. Mixed membership stochastic blockmodels. In NIPS. 33–40.
  • Alelyani et al. (2011) Salem Alelyani, Huan Liu, and Lei Wang. 2011. The effect of the characteristics of the dataset on the selection stability. In ICTAI. 970–977.
  • Alelyani et al. (2013) Salem Alelyani, Jiliang Tang, and Huan Liu. 2013. Feature selection for clustering: a review. Data Clustering: Algorithms and Applications 29 (2013).
  • Ang et al. (2016) Jun Chin Ang, Andri Mirzal, Habibollah Haron, and Haza Nuzly Abdull Hamed. 2016. Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection. IEEE/ACM TCBB 13, 5 (2016), 971–989.
  • Arai et al. (2016) Hiromasa Arai, Crystal Maung, Ke Xu, and Haim Schweitzer. 2016. Unsupervised feature selection by heuristic search with provable bounds on suboptimality. In AAAI. 666–672.
  • Bach (2008) Francis R Bach. 2008. Consistency of the group lasso and multiple kernel learning. JMLR 9 (2008), 1179–1225.
  • Backstrom and Leskovec (2011) Lars Backstrom and Jure Leskovec. 2011. Supervised random walks: predicting and recommending links in social networks. In WSDM. 635–644.
  • Battiti (1994) Roberto Battiti. 1994. Using mutual information for selecting features in supervised neural net learning. IEEE TNN 5, 4 (1994), 537–550.
  • Bilgic et al. (2010) Mustafa Bilgic, Lilyana Mihalkova, and Lise Getoor. 2010. Active learning for networked data. In ICML. 79–86.
  • Boyd and Vandenberghe (2004) Stephen Boyd and Lieven Vandenberghe. 2004. Convex optimization. Cambridge university press.
  • Brown et al. (2012) Gavin Brown, Adam Pocock, Ming-Jie Zhao, and Mikel Luján. 2012. Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. JMLR 13, 1 (2012), 27–66.
  • Cai et al. (2010) Deng Cai, Chiyuan Zhang, and Xiaofei He. 2010. Unsupervised feature selection for multi-cluster data. In KDD. 333–342.
  • Cai et al. (2013) Xiao Cai, Feiping Nie, and Heng Huang. 2013. Exact top-k feature selection via 2,0\ell_{2,0}-norm constraint. In IJCAI. 1240–1246.
  • Chandrashekar and Sahin (2014) Girish Chandrashekar and Ferat Sahin. 2014. A survey on feature selection methods. Computers & Electrical Engineering 40, 1 (2014), 16–28.
  • Chang et al. (2014) Xiaojun Chang, Feiping Nie, Yi Yang, and Heng Huang. 2014. A convex formulation for semi-supervised multi-label feature selection. In AAAI. 1171–1177.
  • Chen et al. (2016) Chen Chen, Hanghang Tong, Lei Xie, Lei Ying, and Qing He. 2016. FASCINATE: fast cross-layer dependency inference on multi-layered networks. In KDD. 765–774.
  • Cheng et al. (2016) Kewei Cheng, Jundong Li, and Huan Liu. 2016. FeatureMiner: a tool for interactive feature selection. In CIKM. 2445–2448.
  • Cheng et al. (2017) Kewei Cheng, Jundong Li, and Huan Liu. 2017. Unsupervised feature selection in signed social networks. In KDD. 777–786.
  • d’Aspremont et al. (2007) Alexandre d’Aspremont, Laurent El Ghaoui, Michael I Jordan, and Gert RG Lanckriet. 2007. A direct formulation for sparse PCA using semidefinite programming. SIAM Rev. 49, 3 (2007), 434–448.
  • Davis and Sampson (1986) John C Davis and Robert J Sampson. 1986. Statistics and data analysis in geology. Vol. 646. Wiley New York et al.
  • Ding et al. (2006) Chris Ding, Ding Zhou, Xiaofeng He, and Hongyuan Zha. 2006. R 1-PCA: rotational invariant 1\ell_{1}-norm principal component analysis for robust subspace factorization. In ICML. 281–288.
  • Du and Shen (2015) Liang Du and Yi-Dong Shen. 2015. Unsupervised feature selection with adaptive structure learning. In KDD. 209–218.
  • Du et al. (2013) Liang Du, Zhiyong Shen, Xuan Li, Peng Zhou, and Yi-Dong Shen. 2013. Local and global discriminative learning for unsupervised feature selection. In ICDM. 131–140.
  • Duda et al. (2012) Richard O Duda, Peter E Hart, and David G Stork. 2012. Pattern classification. John Wiley & Sons.
  • Dutkowski and Gambin (2007) Janusz Dutkowski and Anna Gambin. 2007. On consensus biomarker selection. BMC bioinformatics 8, 5 (2007), S5.
  • El Akadi et al. (2008) Ali El Akadi, Abdeljalil El Ouardighi, and Driss Aboutajdine. 2008. A powerful feature selection approach based on mutual information. International Journal of Computer Science and Network Security 8, 4 (2008), 116.
  • Fan et al. (2009) Jianqing Fan, Richard Samworth, and Yichao Wu. 2009. Ultrahigh dimensional feature selection: beyond the linear model. JMLR 10 (2009), 2013–2038.
  • Farahat et al. (2011) Ahmed K Farahat, Ali Ghodsi, and Mohamed S Kamel. 2011. An efficient greedy method for unsupervised feature selection. In ICDM. 161–170.
  • Fellbaum (1998) Christiane Fellbaum. 1998. WordNet. Wiley Online Library.
  • Feng et al. (2013) Yinfu Feng, Jun Xiao, Yueting Zhuang, and Xiaoming Liu. 2013. Adaptive unsupervised multi-view feature selection for visual concept recognition. In ACCV. 343–357.
  • Fleuret (2004) François Fleuret. 2004. Fast binary feature selection with conditional mutual information. JMLR 5 (2004), 1531–1555.
  • Friedman et al. (2010) Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2010. A note on the group lasso and a sparse group lasso. arXiv preprint arXiv:1001.0736 (2010).
  • Fukunaga (2013) Keinosuke Fukunaga. 2013. Introduction to statistical pattern recognition. Academic Press.
  • Gao et al. (2016) Shuyang Gao, Greg Ver Steeg, and Aram Galstyan. 2016. Variational information maximization for feature selection. In NIPS. 487–495.
  • Gini (1912) CW Gini. 1912. Variability and mutability, contribution to the study of statistical distribution and relaitons. Studi Economico-Giuricici Della R (1912).
  • Golberg (1989) David E Golberg. 1989. Genetic algorithms in search, optimization, and machine learning. Addion Wesley 1989 (1989).
  • Gu et al. (2012) Quanquan Gu, Marina Danilevsky, Zhenhui Li, and Jiawei Han. 2012. Locality preserving feature learning. In AISTATS. 477–485.
  • Gu and Han (2011) Quanquan Gu and Jiawei Han. 2011. Towards feature selection in network. In CIKM. 1175–1184.
  • Gu et al. (2011a) Quanquan Gu, Zhenhui Li, and Jiawei Han. 2011a. Correlated multi-label feature selection. In CIKM. ACM, 1087–1096.
  • Gu et al. (2011b) Quanquan Gu, Zhenhui Li, and Jiawei Han. 2011b. Generalized fisher score for feature selection. In UAI. 266–273.
  • Gu et al. (2011c) Quanquan Gu, Zhenhui Li, and Jiawei Han. 2011c. Joint feature selection and subspace learning. In IJCAI. 1294–1299.
  • Guo and Nixon (2009) Baofeng Guo and Mark S Nixon. 2009. Gait feature subset selection by mutual information. IEEE TMSC(A) 39, 1 (2009), 36–46.
  • Guyon and Elisseeff (2003) Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and feature selection. JMLR 3 (2003), 1157–1182.
  • Guyon et al. (2008) Isabelle Guyon, Steve Gunn, Masoud Nikravesh, and Lofti A Zadeh. 2008. Feature extraction: foundations and applications. Springer.
  • Hall and Smith (1999) Mark A Hall and Lloyd A Smith. 1999. Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In FLAIRS. 235–239.
  • Hara and Maehara (2017) Satoshi Hara and Takanori Maehara. 2017. Enumerate lasso solutions for feature selection. In AAAI. 1985–1991.
  • Hastie et al. (2005) Trevor Hastie, Robert Tibshirani, Jerome Friedman, and James Franklin. 2005. The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer 27, 2 (2005), 83–85.
  • He et al. (2005) Xiaofei He, Deng Cai, and Partha Niyogi. 2005. Laplacian score for feature selection. In NIPS. 507–514.
  • He and Yu (2010) Zengyou He and Weichuan Yu. 2010. Stable feature selection for biomarker discovery. Computational Biology and Chemistry 34, 4 (2010), 215–225.
  • Hou et al. (2011) Chenping Hou, Feiping Nie, Dongyun Yi, and Yi Wu. 2011. Feature selection via joint embedding learning and sparse regression. In IJCAI. 1324–1329.
  • Hu et al. (2013) Xia Hu, Jiliang Tang, Huiji Gao, and Huan Liu. 2013. ActNeT: active learning for networked texts in microblogging. In SDM. 306–314.
  • Huang et al. (2015) Hao Huang, Shinjae Yoo, and S Kasiviswanathan. 2015. Unsupervised feature selection on data streams. In CIKM. 1031–1040.
  • Huang et al. (2011) Junzhou Huang, Tong Zhang, and Dimitris Metaxas. 2011. Learning with structured sparsity. JMLR 12 (2011), 3371–3412.
  • Jacob et al. (2009) Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. 2009. Group lasso with overlap and graph lasso. In ICML. 433–440.
  • Jakulin (2005) Aleks Jakulin. 2005. Machine learning based on attribute interactions. Ph.D. Dissertation. Univerza v Ljubljani.
  • Jenatton et al. (2011) Rodolphe Jenatton, Jean-Yves Audibert, and Francis Bach. 2011. Structured variable selection with sparsity-inducing norms. JMLR 12 (2011), 2777–2824.
  • Jenatton et al. (2010) Rodolphe Jenatton, Julien Mairal, Francis R Bach, and Guillaume R Obozinski. 2010. Proximal methods for sparse hierarchical dictionary learning. In ICML. 487–494.
  • Jian et al. (2016) Ling Jian, Jundong Li, Kai Shu, and Huan Liu. 2016. Multi-label informed feature selection. In IJCAI. 1627–1633.
  • Jiang and Ren (2011) Yi Jiang and Jiangtao Ren. 2011. Eigenvalue sensitive feature selection. In ICML. 89–96.
  • Kalousis et al. (2007) Alexandros Kalousis, Julien Prados, and Melanie Hilario. 2007. Stability of feature selection algorithms: a study on high-dimensional spaces. KAIS 12, 1 (2007), 95–116.
  • Kim and Xing (2009) Seyoung Kim and Eric P Xing. 2009. Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genet 5, 8 (2009).
  • Kim and Xing (2010) Seyoung Kim and Eric P Xing. 2010. Tree-Guided Group Lasso for Multi-Task Regression with Structured Sparsity. In ICML. 543–550.
  • Kira and Rendell (1992) Kenji Kira and Larry A Rendell. 1992. A practical approach to feature selection. In ICML Workshop. 249–256.
  • Kohavi and John (1997) Ron Kohavi and George H John. 1997. Wrappers for feature subset selection. Artificial Intelligence 97, 1 (1997), 273–324.
  • Koller and Sahami (1995) Daphne Koller and Mehran Sahami. 1995. Toward optimal feature selection. In ICML. 284–292.
  • Lanckriet et al. (2004) Gert RG Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I Jordan. 2004. Learning the kernel matrix with semidefinite programming. JMLR 5 (2004), 27–72.
  • Lewis (1992) David D Lewis. 1992. Feature selection and feature extraction for text categorization. In Proceedings of the Workshop on Speech and Natural Language. 212–217.
  • Li et al. (2017a) Jundong Li, Harsh Dani, Xia Hu, and Huan Liu. 2017a. Radar: Residual Analysis for Anomaly Detection in Attributed Networks. In IJCAI. 2152–2158.
  • Li et al. (2016a) Jundong Li, Xia Hu, Ling Jian, and Huan Liu. 2016a. Toward time-evolving feature selection on dynamic networks. In ICDM. 1003–1008.
  • Li et al. (2015b) Jundong Li, Xia Hu, Jiliang Tang, and Huan Liu. 2015b. Unsupervised streaming feature selection in social media. In CIKM. 1041–1050.
  • Li et al. (2016b) Jundong Li, Xia Hu, Liang Wu, and Huan Liu. 2016b. Robust unsupervised feature selection on networked data. In SDM. 387–395.
  • Li and Liu (2017) Jundong Li and Huan Liu. 2017. Challenges of feature selection for big data analytics. IEEE Intelligent Systems 32, 2 (2017), 9–15.
  • Li et al. (2017b) Jundong Li, Jiliang Tang, and Huan Liu. 2017b. Reconstruction-based unsupervised feature selection: an embedded approach. In IJCAI. 2159–2165.
  • Li et al. (2017c) Jundong Li, Liang Wu, Osmar R Zaïane, and Huan Liu. 2017c. Toward personalized relational learning. In SDM. 444–452.
  • Li et al. (2015a) Yifeng Li, Chih-Yu Chen, and Wyeth W Wasserman. 2015a. Deep feature selection: theory and application to identify enhancers and promoters. In RECOMB. 205–217.
  • Li et al. (2012) Zechao Li, Yi Yang, Jing Liu, Xiaofang Zhou, and Hanqing Lu. 2012. Unsupervised feature selection using nonnegative spectral analysis. In AAAI. 1026–1032.
  • Liben-Nowell and Kleinberg (2007) David Liben-Nowell and Jon Kleinberg. 2007. The link-prediction problem for social networks. JASIST 58, 7 (2007), 1019–1031.
  • Lin and Tang (2006) Dahua Lin and Xiaoou Tang. 2006. Conditional infomax learning: an integrated framework for feature extraction and fusion. In ECCV. 68–82.
  • Liu et al. (2016a) Hongfu Liu, Haiyi Mao, and Yun Fu. 2016a. Robust multi-view feature selection. In ICDM. 281–290.
  • Liu and Motoda (2007) Huan Liu and Hiroshi Motoda. 2007. Computational methods of feature selection. CRC Press.
  • Liu and Setiono (1995) Huan Liu and Rudy Setiono. 1995. Chi2: Feature selection and discretization of numeric attributes. In ICTAI. 388–391.
  • Liu et al. (2016b) Hongfu Liu, Ming Shao, and Yun Fu. 2016b. Consensus guided unsupervised feature selection. In AAAI. 1874–1880.
  • Liu et al. (2009a) Jun Liu, Shuiwang Ji, and Jieping Ye. 2009a. Multi-task feature learning via efficient 2,1\ell_{2,1}-norm minimization. In UAI. 339–348.
  • Liu et al. (2009b) Jun Liu, Shuiwang Ji, and Jieping Ye. 2009b. SLEP: sparse learning with efficient projections. Arizona State University. http://www.public.asu.edu/~jye02/Software/SLEP
  • Liu and Ye (2010) Jun Liu and Jieping Ye. 2010. Moreau-Yosida regularization for grouped tree structure learning. In NIPS. 1459–1467.
  • Liu et al. (2014) Xinwang Liu, Lei Wang, Jian Zhang, Jianping Yin, and Huan Liu. 2014. Global and local structure preservation for feature selection. TNNLS 25, 6 (2014), 1083–1095.
  • Long et al. (2006) Bo Long, Zhongfei Mark Zhang, Xiaoyun Wu, and Philip S Yu. 2006. Spectral clustering for multi-type relational data. In ICML. 585–592.
  • Long et al. (2007) Bo Long, Zhongfei Mark Zhang, and Philip S Yu. 2007. A probabilistic framework for relational clustering. In KDD. 470–479.
  • Loscalzo et al. (2009) Steven Loscalzo, Lei Yu, and Chris Ding. 2009. Consensus group stable feature selection. In KDD. 567–576.
  • Ma et al. (2007) Shuangge Ma, Xiao Song, and Jian Huang. 2007. Supervised group Lasso with applications to microarray data analysis. BMC Bioinformatics 8, 1 (2007), 60.
  • Macskassy and Provost (2007) Sofus A Macskassy and Foster Provost. 2007. Classification in networked data: a toolkit and a univariate case study. JMLR 8 (2007), 935–983.
  • Marsden and Friedkin (1993) Peter V Marsden and Noah E Friedkin. 1993. Network studies of social influence. Sociological Methods & Research 22, 1 (1993), 127–151.
  • Masaeli et al. (2010) Mahdokht Masaeli, Yan Yan, Ying Cui, Glenn Fung, and Jennifer G Dy. 2010. Convex principal feature selection. In SDM. 619–628.
  • Maung and Schweitzer (2013) Crystal Maung and Haim Schweitzer. 2013. Pass-efficient unsupervised feature selection. In NIPS. 1628–1636.
  • McAuley et al. (2005) James McAuley, Ji Ming, Darryl Stewart, and Philip Hanna. 2005. Subband correlation and robust speech recognition. IEEE TSAP 13, 5 (2005), 956–964.
  • McPherson et al. (2001) Miller McPherson, Lynn Smith-Lovin, and James M Cook. 2001. Birds of a feather: Homophily in social networks. Annual Review of Sociology (2001), 415–444.
  • Meier et al. (2008) Lukas Meier, Sara Van De Geer, and Peter Bühlmann. 2008. The group lasso for logistic regression. JRSS(B) 70, 1 (2008), 53–71.
  • Meyer and Bontempi (2006) Patrick E Meyer and Gianluca Bontempi. 2006. On the use of variable complementarity for feature selection in cancer classification. In Applications of Evolutionary Computing. 91–102.
  • Meyer et al. (2008) Patrick Emmanuel Meyer, Colas Schretter, and Gianluca Bontempi. 2008. Information-theoretic feature selection in microarray data using variable complementarity. IEEE Journal of Selected Topics in Signal Processing 2, 3 (2008), 261–274.
  • Narendra and Fukunaga (1977) Patrenahalli M Narendra and Keinosuke Fukunaga. 1977. A branch and bound algorithm for feature subset selection. IEEE Trans. Comput. 100, 9 (1977), 917–922.
  • Netzer et al. (2009) Michael Netzer, Gunda Millonig, Melanie Osl, Bernhard Pfeifer, Siegfried Praun, Johannes Villinger, Wolfgang Vogel, and Christian Baumgartner. 2009. A new ensemble-based algorithm for identifying breath gas marker candidates in liver disease using ion molecule reaction mass spectrometry. Bioinformatics 25, 7 (2009), 941–947.
  • Nguyen et al. (2014) Xuan Vinh Nguyen, Jeffrey Chan, Simone Romano, and James Bailey. 2014. Effective global approaches for mutual information based feature selection. In KDD. 512–521.
  • Nie et al. (2010) Feiping Nie, Heng Huang, Xiao Cai, and Chris H Ding. 2010. Efficient and robust feature selection via joint 2,1\ell_{2,1}-norms minimization. In NIPS. 1813–1821.
  • Nie et al. (2008) Feiping Nie, Shiming Xiang, Yangqing Jia, Changshui Zhang, and Shuicheng Yan. 2008. Trace ratio criterion for feature selection. In AAAI. 671–676.
  • Nie et al. (2016) Feiping Nie, Wei Zhu, Xuelong Li, et al. 2016. Unsupervised feature selection with structured graph optimization. In AAAI. 1302–1308.
  • Obozinski et al. (2007) Guillaume Obozinski, Ben Taskar, and Michael Jordan. 2007. Joint covariate selection for grouped classification. Technical Report. Technical Report, Statistics Department, UC Berkeley.
  • Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, Oct (2011), 2825–2830.
  • Peng and Fan (2016) Hanyang Peng and Yong Fan. 2016. Direct sparsity optimization based feature selection for multi-class classification. In IJCAI. 1918–1924.
  • Peng and Fan (2017) Hanyang Peng and Yong Fan. 2017. A general framework for sparsity regularized feature selection via iteratively reweighted least square minimization. In AAAI. 2471–2477.
  • Peng et al. (2005) Hanchuan Peng, Fuhui Long, and Chris Ding. 2005. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE TPAMI 27, 8 (2005), 1226–1238.
  • Peng et al. (2010) Jie Peng, Ji Zhu, Anna Bergamaschi, Wonshik Han, Dong-Young Noh, Jonathan R Pollack, and Pei Wang. 2010. Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. The Annals of Applied Statistics 4, 1 (2010), 53.
  • Perkins et al. (2003) Simon Perkins, Kevin Lacker, and James Theiler. 2003. Grafting: Fast, incremental feature selection by gradient descent in function space. JMLR 3 (2003), 1333–1356.
  • Perkins and Theiler (2003) Simon Perkins and James Theiler. 2003. Online feature selection using grafting. In ICML. 592–599.
  • Qian and Zhai (2013) Mingjie Qian and Chengxiang Zhai. 2013. Robust unsupervised feature selection. In IJCAI. 1621–1627.
  • Quattoni et al. (2009) Ariadna Quattoni, Xavier Carreras, Michael Collins, and Trevor Darrell. 2009. An efficient projection for 1,\ell_{1,\infty} regularization. In ICML. 857–864.
  • Robnik-Šikonja and Kononenko (2003) Marko Robnik-Šikonja and Igor Kononenko. 2003. Theoretical and empirical analysis of ReliefF and RReliefF. Machine learning 53, 1-2 (2003), 23–69.
  • Roy et al. (2015) Debaditya Roy, K Sri Rama Murty, and C Krishna Mohan. 2015. Feature selection using deep neural networks. In IJCNN. 1–6.
  • Saeys et al. (2008) Yvan Saeys, Thomas Abeel, and Yves Van de Peer. 2008. Robust feature selection using ensemble feature selection techniques. ECMLPKDD, 313–325.
  • Saeys et al. (2007) Yvan Saeys, Iñaki Inza, and Pedro Larrañaga. 2007. A review of feature selection techniques in bioinformatics. Bioinformatics 23, 19 (2007), 2507–2517.
  • Sandler et al. (2009) Ted Sandler, John Blitzer, Partha P Talukdar, and Lyle H Ungar. 2009. Regularized learning with networks of features. In NIPS. 1401–1408.
  • Sen et al. (2008) Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. 2008. Collective classification in network data. AI Magazine 29, 3 (2008), 93.
  • Shen et al. (2012) Qiang Shen, Ren Diao, and Pan Su. 2012. Feature selection ensemble. Turing-100 10 (2012), 289–306.
  • Shi and Malik (2000) Jianbo Shi and Jitendra Malik. 2000. Normalized cuts and image segmentation. IEEE TPAMI 22, 8 (2000), 888–905.
  • Shi et al. (2014) Lei Shi, Liang Du, and Yi-Dong Shen. 2014. Robust spectral learning for unsupervised feature selection. In ICDM. 977–982.
  • Shishkin et al. (2016) Alexander Shishkin, Anastasia Bezzubtseva, Alexey Drutsa, Ilia Shishkov, Ekaterina Gladkikh, Gleb Gusev, and Pavel Serdyukov. 2016. Efficient high-order interaction-aware feature selection based on conditional mutual information. In NIPS. 4637–4645.
  • Singh et al. (2009) Sameer Singh, Jeremy Kubica, Scott Larsen, and Daria Sorokina. 2009. Parallel large scale feature selection for logistic regression. In SDM. 1172–1183.
  • Tan et al. (2014) Mingkui Tan, Ivor W Tsang, and Li Wang. 2014. Towards ultrahigh dimensional feature selection for big data. JMLR 15, 1 (2014), 1371–1429.
  • Tang et al. (2014a) Jiliang Tang, Salem Alelyani, and Huan Liu. 2014a. Feature selection for classification: a review. Data Classification: Algorithms and Applications (2014), 37.
  • Tang et al. (2013) Jiliang Tang, Xia Hu, Huiji Gao, and Huan Liu. 2013. Unsupervised feature selection for multi-view data in social media. In SDM. 270–278.
  • Tang et al. (2014b) Jiliang Tang, Xia Hu, Huiji Gao, and Huan Liu. 2014b. Discriminant analysis for unsupervised feature selection. In SDM. 938–946.
  • Tang and Liu (2012a) Jiliang Tang and Huan Liu. 2012a. Feature selection with linked data in social media. In SDM. 118–128.
  • Tang and Liu (2012b) Jiliang Tang and Huan Liu. 2012b. Unsupervised feature selection for linked social media data. In KDD. 904–912.
  • Tang and Liu (2013) Jiliang Tang and Huan Liu. 2013. Coselect: Feature selection with instance selection for social media data. In SDM. 695–703.
  • Tang and Liu (2009) Lei Tang and Huan Liu. 2009. Relational learning via latent social dimensions. In KDD. 817–826.
  • Tibshirani (1996) Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (1996), 267–288.
  • Tibshirani et al. (2005) Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. 2005. Sparsity and smoothness via the fused lasso. JRSS(B) 67, 1 (2005), 91–108.
  • Tibshirani et al. (2001) Robert Tibshirani, Guenther Walther, and Trevor Hastie. 2001. Estimating the number of clusters in a data set via the gap statistic. JRSS(B)) 63, 2 (2001), 411–423.
  • Vetterling et al. (1992) William T Vetterling, Saul A Teukolsky, and William H Press. 1992. Numerical recipes: example book (C). Press Syndicate of the University of Cambridge.
  • Vidal-Naquet and Ullman (2003) Michel Vidal-Naquet and Shimon Ullman. 2003. Object recognition with informative features and linear classification. In ICCV. 281–288.
  • Wang et al. (2013) Hua Wang, Feiping Nie, and Heng Huang. 2013. Multi-view clustering and feature learning via structured sparsity. In ICML. 352–360.
  • Wang et al. (2007) Huan Wang, Shuicheng Yan, Dong Xu, Xiaoou Tang, and Thomas Huang. 2007. Trace ratio vs. ratio trace for dimensionality reduction. In CVPR. 1–8.
  • Wang and Ye (2015) Jie Wang and Jieping Ye. 2015. Multi-layer feature reduction for tree structured group lasso via hierarchical projection. In NIPS. 1279–1287.
  • Wang et al. (2014b) Jialei Wang, Peilin Zhao, Steven CH Hoi, and Rong Jin. 2014b. Online feature selection and its applications. IEEE TKDE 26, 3 (2014), 698–710.
  • Wang et al. (2014a) Qian Wang, Jiaxing Zhang, Sen Song, and Zheng Zhang. 2014a. Attentional neural network: Feature selection using cognitive feedback. In NIPS. 2033–2041.
  • Wei et al. (2016a) Xiaokai Wei, Bokai Cao, and Philip S Yu. 2016a. Nonlinear joint unsupervised feature selection. In SDM. 414–422.
  • Wei et al. (2016b) Xiaokai Wei, Bokai Cao, and Philip S Yu. 2016b. Unsupervised feature selection on networks: a generative view. In AAAI. 2215–2221.
  • Wei et al. (2015) Xiaokai Wei, Sihong Xie, and Philip S Yu. 2015. Efficient partial order preserving unsupervised feature selection on networks. In SDM. 82–90.
  • Wei and Yu (2016) Xiaokai Wei and Philip S Yu. 2016. Unsupervised feature selection by preserving stochastic neighbors. In AISTATS. 995–1003.
  • Wu et al. (2017) Liang Wu, Jundong Li, Xia Hu, and Huan Liu. 2017. Gleaning wisdom from the past: Early detection of emerging rumors in social media. In SDM. SIAM, 99–107.
  • Wu et al. (2010) Xindong Wu, Kui Yu, Hao Wang, and Wei Ding. 2010. Online streaming feature selection. In ICML. 1159–1166.
  • Xu et al. (2014) Zhixiang Xu, Gao Huang, Kilian Q Weinberger, and Alice X Zheng. 2014. Gradient boosted feature selection. In KDD. 522–531.
  • Yamada et al. (2014) Makoto Yamada, Avishek Saha, Hua Ouyang, Dawei Yin, and Yi Chang. 2014. N3LARS: minimum redundancy maximum relevance feature selection for large and high-dimensional data. arXiv preprint arXiv:1411.2331 (2014).
  • Yang and Mao (2011) Feng Yang and KZ Mao. 2011. Robust feature selection for microarray data based on multicriterion fusion. IEEE/ACM TCBB 8, 4 (2011), 1080–1092.
  • Yang and Moody (1999) Howard Hua Yang and John E Moody. 1999. Data visualization and feature selection: new algorithms for nongaussian data. In NIPS. 687–693.
  • Yang et al. (2012) Sen Yang, Lei Yuan, Ying-Cheng Lai, Xiaotong Shen, Peter Wonka, and Jieping Ye. 2012. Feature grouping and selection over an undirected graph. In KDD. 922–930.
  • Yang et al. (2011) Yi Yang, Heng Tao Shen, Zhigang Ma, Zi Huang, and Xiaofang Zhou. 2011. 2,1\ell_{2,1}-norm regularized discriminative feature selection for unsupervised learning. In IJCAI. 1589–1594.
  • Yang et al. (2010) Yi Yang, Dong Xu, Feiping Nie, Shuicheng Yan, and Yueting Zhuang. 2010. Image clustering using local discriminant models and global integration. IEEE TIP 19, 10 (2010), 2761–2773.
  • Yang et al. (2005) Yee Hwa Yang, Yuanyuan Xiao, and Mark R Segal. 2005. Identifying differentially expressed genes from microarray experiments via statistic synthesis. Bioinformatics 21, 7 (2005), 1084–1093.
  • Ye and Liu (2012) Jieping Ye and Jun Liu. 2012. Sparse methods for biomedical data. ACM SIGKDD Explorations Newsletter 14, 1 (2012), 4–15.
  • Yu et al. (2014) Kui Yu, Xindong Wu, Wei Ding, and Jian Pei. 2014. Towards scalable and accurate online feature selection for big data. In ICDM. 660–669.
  • Yu and Liu (2003) Lei Yu and Huan Liu. 2003. Feature selection for high-dimensional data: a fast correlation-based filter solution. In ICML. 856–863.
  • Yu and Shi (2003) Stella X Yu and Jianbo Shi. 2003. Multiclass spectral clustering. In ICCV. 313–319.
  • Yuan et al. (2011) Lei Yuan, Jun Liu, and Jieping Ye. 2011. Efficient methods for overlapping group lasso. In NIPS. 352–360.
  • Yuan and Lin (2006) Ming Yuan and Yi Lin. 2006. Model selection and estimation in regression with grouped variables. JRSS(B) 68, 1 (2006), 49–67.
  • Zadeh et al. (2017) Sepehr Abbasi Zadeh, Mehrdad Ghadiri, Vahab S Mirrokni, and Morteza Zadimoghaddam. 2017. Scalable feature selection via distributed diversity maximization. In AAAI. 2876–2883.
  • Zhang et al. (2008) Jian Zhang, Zoubin Ghahramani, and Yiming Yang. 2008. Flexible latent variable models for multi-task learning. Machine Learning 73, 3 (2008), 221–242.
  • Zhang et al. (2014) Miao Zhang, Chris HQ Ding, Ya Zhang, and Feiping Nie. 2014. Feature selection at the discrete limit. In AAAI. 1355–1361.
  • Zhang et al. (2015) Qin Zhang, Peng Zhang, Guodong Long, Wei Ding, Chengqi Zhang, and Xindong Wu. 2015. Towards mining trapezoidal data streams. In ICDM. 1111–1116.
  • Zhao et al. (2015) Lei Zhao, Qinghua Hu, and Wenwu Wang. 2015. Heterogeneous feature selection with multi-modal deep neural networks and sparse group lasso. IEEE TMM 17, 11 (2015), 1936–1948.
  • Zhao et al. (2009) Peng Zhao, Guilherme Rocha, and Bin Yu. 2009. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics (2009), 3468–3497.
  • Zhao et al. (2016) Zhou Zhao, Xiaofei He, Deng Cai, Lijun Zhang, Wilfred Ng, and Yueting Zhuang. 2016. Graph regularized feature selection with data reconstruction. IEEE TKDE 28, 3 (2016), 689–700.
  • Zhao and Liu (2007) Zheng Zhao and Huan Liu. 2007. Spectral feature selection for supervised and unsupervised learning. In ICML. 1151–1157.
  • Zhao and Liu (2008) Zheng Zhao and Huan Liu. 2008. Multi-source feature selection via geometry-dependent covariance analysis. In FSDM. 36–47.
  • Zhao et al. (2010) Zheng Zhao, Lei Wang, Huan Liu, et al. 2010. Efficient spectral feature selection with minimum redundancy. In AAAI. 673–678.
  • Zhao et al. (2013) Zheng Zhao, Ruiwen Zhang, James Cox, David Duling, and Warren Sarle. 2013. Massively parallel feature selection: an approach based on variance preservation. Machine Learning 92, 1 (2013), 195–220.
  • Zhou et al. (2005) Jing Zhou, Dean Foster, Robert Stine, and Lyle Ungar. 2005. Streaming feature selection using alpha-investing. In KDD. 384–393.
  • Zhou et al. (2012) Jiayu Zhou, Jun Liu, Vaibhav A Narayan, and Jieping Ye. 2012. Modeling disease progression via fused sparse group lasso. In KDD. 1095–1103.
  • Zhou and He (2017) Yao Zhou and Jingrui He. 2017. A Randomized Approach for Crowdsourcing in the Presence of Multiple Views. In ICDM.
  • Zhou (2012) Zhi-Hua Zhou. 2012. Ensemble methods: foundations and algorithms. CRC press.
  • Zhu et al. (2004) Ji Zhu, Saharon Rosset, Robert Tibshirani, and Trevor J Hastie. 2004. 1-norm support vector machines. In NIPS. 49–56.
  • Zhu et al. (2016) Pengfei Zhu, Qinghua Hu, Changqing Zhang, and Wangmeng Zuo. 2016. Coupled dictionary learning for unsupervised feature selection. In AAAI. 2422–2428.