Causal Multi-Label Feature Selection
in Federated Setting
Abstract
Multi-label feature selection serves as an effective mean for dealing with high-dimensional multi-label data. To achieve satisfactory performance, existing methods for multi-label feature selection often require the centralization of substantial data from multiple sources. However, in federated setting, centralizing data from all sources and merging them into a single dataset is often infeasible. To tackle this issue, in this paper, we study a new problem of causal multi-label feature selection in federated setting and propose a Federated Causal Multi-label Feature Selection (FedCMFS) algorithm with three novel subroutines. Specifically, FedCMFS first uses the FedCFL subroutine that considers the correlations among label-label, label-feature, and feature-feature to learn the relevant features (candidate parents and children) of each class label while preserving data privacy without centralizing data. Second, FedCMFS employs the FedCFR subroutine to selectively recover the missed true relevant features. Finally, FedCMFS utilizes the FedCFC subroutine to remove false relevant features. The extensive experiments using eight datasets have validated the effectiveness of FedCMFS.
Index Terms:
Multi-label data, Causal feature selection, Federated learning, Privacy preserving data, Parallel Optimization.I Introduction
Multi-label learning has become an important research direction in the field of machine learning [1, 2]. As information technology rapidly develops, multi-label data is becoming increasingly complex, potentially leading to the problem of the curse of dimensionality [3]. Therefore, feature selection, as one of the effective tools to solve the curse of dimensionality, is widely used in multi-label learning [4]. It aims to reduce the dimensionality of features by designing a metric for feature importance, with the goal of selecting a subset of features that contains irrelevant or redundant features as few as possible.
Existing multi-label feature selection algorithms are typically based on statistical co-occurrence relationships to determine feature dependency without providing an explanation for why they are dependent. To tackle this issue, researchers have proposed causal multi-label feature selection algorithms based on causal structures [5]. The causal relationship describes the causal relationship between two variables, revealing the underlying mechanisms of how these variables interact. A causal structure always employ a Directed Acyclic Graphs (DAG) to represent causal relationships between variables. In a DAG, the existence of a directed edge from A to B represents that A is a parent (direct cause) of B, and conversely, B is a child (direct effect) of A [6]. Exploring the relationships between variables by learning the global causal structure can be computationally expensive, particularly when dealing with high-dimensional datasets. In contrast, local causal structure learning methods offer an efficient alternative. These methods directly identify the parents and children of a given label variable, making them more efficient compared to global causal structure algorithms.
Currently those existing multi-label feature selection algorithms typically require access to all the data to determine important features and do not consider the data privacy. However, in many real-world application scenarios, data often originates from multiple sources, and the aggregation of data requires consideration of data privacy. For instance, chronic disease research may require patient data from various hospitals, leading to the risk of leaking patient data privacy.
To protect data privacy, federated learning has garnered considerable attention [7]. Federated learning builds machine learning models through multi-party collaboration and is primarily divided into vertical federated learning and horizontal federated learning. Vertical federated learning shares the same samples but each client holds different features, while clients in horizontal federated learning share the same feature space while holding different samples [8]. In this study, we consider the horizontal federated learning setting.
Since currently there are no studies on multi-label feature selection for considering data privacy, to fill this gap, in this paper, we propose the Federated Causal Multi-label Feature Selection (FedCMFS) algorithm in federated setting, which comprises three subroutines: the Federated Causal Feature Learning subroutine (FedCFL), the Federated Causal Feature Retrieval subroutine (FedCFR), and the Federated Causal Feature Correction subroutine (FedCFC). And we demonstrate the effectiveness of FedCMFS by simulating the federated setting with a large number of real datasets as well as conducting experiments using 5 comparison algorithms.
II Related Work
In recent years, scholars have proposed various multi-label feature selection algorithms to address the curse of dimensionality and improve prediction accuracy [9, 10]. Multi-label feature selection algorithms are generally classified into four major categories [11]: methods based on mutual information, mutual information-based, regularization-based, manifold learning-based, and causal structure learning-based methods. Methods based on mutual information, such as FIMF [12], SCLS [13], and SRLG-LMA [14], select the most relevant features by measuring the mutual information relationship between features and labels. Methods based on regularization, such as SFUS [15], JFSC [16], and MLFS-GLOCAL [17], constrain the complexity of the model by introducing regularization terms, thereby selecting the most important features. Methods based on manifold learning, such as MCLS [18], MSSL [19], and MDFS [20], select features with important information by considering the local geometric structure and manifold characteristics of the data. Methods based on causal structure learning, such as MB-MCF [5], explore causal relationships between variables to construct a potential causal structure and select causally relevant features.
Fan et al. proposed a new method called LCIFS [21], which integrates manifold learning, adaptive spectral graph, and redundancy analysis into an ensemble framework to learn relevant information for multi-label feature selection. LCIFS utilizes the structural correlation of labels and simultaneously controls the use of redundant features, thereby achieving multi-label feature selection with a clear objective function.
In the near past, some scholars have focused on feature selection in federated learning. Federated learning can implement algorithms on multiple datasets while preserving privacy. Specifically, Hu et al. proposed a multi-participant federated evolutionary feature selection algorithm [22] for imbalanced data under privacy protection. They introduced a multi-level joint sample filling strategy to address imbalanced or empty classes on each participant. Subsequently, a federated evolutionary feature selection method based on supervised particle swarm optimization with multiple participants was proposed by periodically sharing the optimal feature subset among participants. Banerjee et al. introduced an information-theoretic multi-label feature selection method called Fed-FiS [23]. Fed-FiS evaluates feature-feature mutual information and feature-class mutual information to obtain local feature subsets and a global feature set.
In summary, some feature selection algorithms have been proposed in federated environments, but they have not considered feature selection in multi-label scenarios and have not addressed causal relationships between labels and labels, labels and features, and features and features. Therefore, this paper proposes a novel causal multi-label feature selection algorithm considering data privacy issues.
III Notations and Definitions
In this section, we initially present some key concepts and symbols related to causal structures and Bayesian networks in federated setting. We define as a set of features, as a set of labels, and as the node set encompassing all labels and features. Assuming that there are clients, with each client’s local data having a sample size of . During the learning process within the federated environment, the server sends a triplet to each client for performing the conditional independence test between and under the conditioning set of ( can be empty set ). Each client returns the correlation value and the P value , enabling the server to aggregate the results into the corresponding weighted correlation value and weighted P value . The significance level of the conditional independence test is denoted by (). The notation (where and ) denotes that is conditionally independent from given , while indicates that and are dependent given . The term represents the set of local causal structure (parent-child node set) for the label .
Symbol | Meaning |
---|---|
a set of labels | |
a set of features | |
the number of labels | |
the number of features | |
the number of clients | |
a label in | |
a feature in | |
a node in | |
the sample size of the local data for the th client () | |
the triplet sent by the server | |
the correlation value between and computed by the th client under the condition set | |
the weighted correlation value | |
the P value between and computed by the th client under the condition set | |
the weighted P value | |
the significance level of the CI test | |
and are independent under the condition set | |
and are dependent under the condition set | |
the parent and children set of the label variable |
Definition III.1.
(Bayesian Network, BN) [24]. Let be the joint probability distribution over and represent a directed acyclic graph (DAG) with nodes and edges . The triplet is called a BN if and only if satisfies the Markov condition: every node of is independent of any subset of its non-descendants conditioning on the parents of the node.
In a DAG of BN, if there is a directed edge from A to B, A is the direct cause of B and B is the direct effect of A, then the DAG is called a causal DAG (i.e. causal structure) [6].
Definition III.2.
(Parent and Child, PC [6]). The parents and children of variable in a causal structure consists of the parents and children of , called .
Definition III.3.
(Markov blanket, MB [6]). The MB of a variable in a causal structure consists of the variable’s parents (direct causes), children (direct effects), and spouses (other parents of the variable’s children).
Theorem III.1.
[6] In a DAG, given the MB of variable ,, for , is conditionally independent of given .
Theorem III.1 indicates that in classification tasks, the MB of a label variable is the optimal feature subset for predicting the label variable [25]. Furthermore, recent studies suggest that in real-world scenarios, the prediction quality of a label variable’s PC set is almost identical to that of the MB of the class variable [10, 26]. Therefore, we employ a well-established local causal structure learning algorithm, HITON-PC [26], to learn the PC set of a class variable (any other state-of-the-art local causal structure learning algorithms can be used here).

IV Proposed FedCMFS Algorithm
IV-A Overview of the FedCMFS Algorithm
In this paper, we simulate a federated environment using a client-server architecture, and propose the FedCMFS algorithm for causal multi-label feature selection in federated setting. FedCMFS is a horizontal federated learning algorithm, which uses a central server and multiple clients to perform causal feature selection on standard multi-label data. Specifically, FedCMFS sequentially executes the following three subroutines to select causal features: (1) Federated Causal Feature Learning algorithm (FedCFL); (2) Federated Causal Feature Retrieval algorithm (FedCFR); and (3) Federated Causal Feature Correction algorithm (FedCFC).
FedCFL treats both labels and features as ordinary variables, and it independently computes local causal variables for each label on each client. Throughout local causal structure learning, the clients interact with the server, ultimately obtaining causally relevant feature sets for all label variables: for all labels on the server.
To tackle the issue of missing true relevant features, in FedCFR, the server communicate with each client to selectively identify potentially missing true relevant features, obtaining an updated causally relevant feature sets for all label variables: .
Due to the data quality issue, utilizing the symmetry property of causal neighbors, FedCFC selectively corrects false causally relevant features in and achieves the final feature set .
IV-B Federated Causal Feature Learning Algorithm (FedCFL)
HITON-PC [26] is a widely used algorithm for learning a PC set of a given variable from a single-label dataset, which adopts a forward-backward strategy, exhibiting notable performance in causal feature learning. In this paper, we extend HITION-PC to the learn PC set of a label variable for multi-label data in federated setting and propose the federated causal feature learning algorithm, FedCFL, to address the causal multi-label feature selection problem in federated setting.
A simple strategy for applying the HITON-PC algorithm to multi-label feature selection in federated setting is that each client learns a PC set independently for each label variable, and then aggregates the PC sets at the service. However, due to the different quality of samples from different clients, the PC sets of a given label variable learned from different clients are often different. To deal with this issue, we design the FedCFL algorithm consisting of two learning phases, as shown in Alg. 1.
In Phase I of FedCFL, we initially identify potential PC variables for each class label, where the variables contain both features and labels, so that correlations between not only label-features, feature-features, but also label-labels can be learned in the process of learning the local causal structure. The computation for this phase is carried out at each client and ultimately converges at the server. Assuming that there are clients (denoted as client 1, client 2, …, client N) and one server. At the beginning of phase I, upon the server’s request, each client independently computes the correlation value and the P value between each class label and the other variables on a local client, using the empty set as the condition set via conditional independence tests. The node , correlation value and P value are then added to the initial correlation set (where represents the client number , and represents the label number ). After computing the correlations of all labels and all features, each client obtains a local initial correlation set , which includes the initial correlation sets for all class labels. Each client sends the learned local initial correlation set to the server.

The server receives the local initial correlation sets learned by all clients, merges and prunes them according to Eq. 1 to obtain the global initial correlation set . Specifically, for each node that may be dependent with the target label , the server computes the weighted average of the P value of the pair of variables and across all clients. Assuming that a client with a large number of data samples may have a higher probability of representing the true statistical patterns, we assign the weight of each client, , based on the proportion of data with regard to the total data samples across all clients.
(1) |
If the weighted P value is less than the significance level , the server determines that and are not independent given the empty set as the condition set and computes the weighted average of the correlation value of the pair of variables and across all clients according to Eq. 2. Subsequently, the server adds , , and to .
(2) |
After the global initial correlation set for each label is completed, the server sorts the variables in by the weighted correlation value in descending order. Finally, the global initial correlation set is obtained at the server.
In Phase II of FedCFL, we utilize a forward-backward strategy to progressively update the variables in the candidate parent and children set (initially is an empty set ) until a complete local causal structure is learned. To reduce the number of conditional independence tests, the computation performed by each client in Phase II is uniformly controlled by the server. The server sequentially adds the variable with the highest weighted correlation value in , along with its corresponding weighted correlation value and weighted P value , to the candidate set . Whenever a new variable is added to , the server needs to determine whether each variable in the current will be independent of the target label under the condition that the new variable is added, and prune the based on the above result. Therefore, the server sends the triplet (where , are the variables to be tested, and is the conditional set, to all clients, requesting each to conduct the corresponding conditional independence tests.
Once receiving the triplet , each client independently conducts conditional independence tests between and given the condition set , and subsequently returns the P value between and to the server.
Subsequently, the server receives the results from all clients and computes the weighted average of the P value under across all clients according to Eq. 1 (the weight of each client is the proportion of data contained in that client out of the total data volume of all clients.). If the weighted P value exceeds the significance level , is independent of , and along with its corresponding and are removed from , otherwise they are retained.

The server and each client interact to execute the aforementioned steps, and the server achieves the set , which is a collection that not only contains the causal feature sets of each class label, but also includes the weighted correlation of the variables with their corresponding labels with an empty conditioning set. The completion of this step signifies the ending of the FedCFL algorithm.
IV-C Federated Causal Feature Retrieval Algorithm (FedCFR)
In FedCFL, all labels and features are treated as ordinary variables to simultaneously consider three types of correlations among variables in a multi-label dataset: feature-label, feature-feature, and label-label correlations. However, due to the correlation among labels, some true PC features may become independent of the labels, resulting in missing true PC features.

Taking the structure shown in the Figure 4 as an example, suppose the PC set of the class label C consists of . Due to the correlation between class label C and class label D, feature E, which serves as a parent node of the class label C, is not selected to the PC set of C. Therefore, to tackle this issue, we design the Federated Causal Feature Retrieval algorithm (FedCFR) and its pseudocodes are shown in Alg. 2.
The initial step of Phase I in FedCFR involves identifying potential missed PC features from all discarded features. From Steps 3 to 15, the FedCFR subroutine determines whether each label needs to selectively retrieve missed PC features by judging whether its contains other class labels. If contains the label ( and ), the possible missed PC features are searched for in discarded variables other than the selected features (i.e., ). The judgment rule is: if and , it is considered to be a possible missed PC feature and added to the candidate feature set .
To aggregate the conditional independence test results among all clients in federated setting, the server sequentially sends the triplet to all clients, requesting each client to perform the conditional independence tests with the empty set as the conditioning set and returning their results and . Upon receiving the returned results from all clients, the server performs the calculations according to Eq. 1 and Eq. 2. If the weighted average of the returned results from all clients is less than the significance level , it is considered dependent. The variable and its corresponding and , are then added to . Otherwise, it is considered independent.
Recent research has shown that a causal structure in real-world scenarios is often relatively sparse [27, 28]. For instance, in a dataset containing 1000 features, a class variable may only have 10 PC features. This implies that the discarded set contains very few missed PC features. If all variables in are tested, it would cost a significant amount of time. Among those candidate variables, the variables with high correlations to a class variable are most likely to be the missed PC features. Thus in Step 8, FedCFR addresses this issue by sorting the candidate feature set in descending order according to the weighted correlation value with the empty conditioning set, then selects the top % variables.
In the second phase, FedCFR utilizes the available structural information to determine candidate features that may have been missed by FedCFL. Taking Figure 4 as an example, supposing the learned PC set of label C is , and E has not been correctly added to the PC set. After the first phase of FedCFR, feature E is added to the candidate feature set . In this case, there must exist a set containing the class label D such that . If the class label D is removed from , C will be dependent of E.
Therefore, the server traverses , and when class label appears in of the class label , it examines the top % variables in the candidate feature set . If a set is found such that and , and , then is considered a missed PC feature. and its corresponding and are added to .
To coordinate all clients to complete the above operations in federated setting, the server sends the triplet to all clients and receives the results returned by each client. It then uses Eq. 1 to obtain the weighted P value . If the weighted P value exceeds the significance level , it is considered independent. The server then sends the triplet to each client and confirms whether the results are dependent. After the successful completion of both tests, , and are added to . When all operations against in are completed, the label is removed from . After the execution of the FedCFR algorithm, the set of the missed PC features is obtained, denoted as .
IV-D Federated Causal Feature Correction Algorithm (FedCFC)
In the first two subroutines of FedCMFS, the uneven quality of data across clients may lead to erroneous aggregation results using the conditional independence tests. It means that both FedCFL and FedCFR may learn false PC variables. To remove these false PC variables in (obtained by FedCFL and FedCFR), by the symmetry property of a parent and its children in a DAG (if A is a parent of B, then B must be a child of A), we design the FedCFC subroutine which examines whether the PC of a variable in includes , to eliminate false PC features in . The pseudocodes of the FedCFC algorithm is provided below.
Given that the true contains fewer false features, the features with the smaller correlation to the class variable are more likely to be mistakenly selected. To avoid calculating the PC of all variables in , the server initially sorts the weighted correlation value with an empty set and store them in ascending order. It then selects the top features with the smallest correlations to the class label to be included in the correction set .
In Steps 3 to 9, the FedCFC Algorithm applies the FedCFL algorithm to the feature in and obtains the result at the server. It then determines whether the set of the feature contains the label . If not, is removed from the set of . After the correction process for all labels is completed, the finally selected feature set is obtained.
IV-E Acceleration Method
The three subroutines of FedCMFS all involve conditional independence tests (CI tests). Thus the time complexity of FedCMFS can be measured by the number of CI tests performed by a single client. The time complexity of FedCMFS is relatively high, with the time complexity of the subroutine FedCFL being , the time complexity of the subroutine FedCFR being , and the time complexity of the subroutine FedCFC being . This indicates that the number of CI tests performed is exponentially related to the size of the learned PC set. As the dimensionality of the dataset increases, the learned PC set may become larger, and the number of tests correspondingly increases. Therefore, to improve the execution speed of FedCMFS, this paper proposes the following three strategies to accelerate CI tests by leveraging the parallel computing ability of GPU.
The first strategy involves data-level parallel processing for two types of CI tests: the G² test [29] for discrete data and the Fisher’s Z test [29] for continuous data. A brief introduction to these two CI testing methods is provided below.
Assuming that variables and are conditionally independent given , the formula for the G² test algorithm is as follows:
(3) |
is a statistic, is a random variable, whose value is the number of times , , and simultaneously occur in a data sample. The null hypothesis of independence is rejected by calculating the p-value of the statistic and comparing it to a predetermined level of significance.
Assuming that variables and are conditionally independent given , the formula for the Fisher’s Z test algorithm is as follows:
(4) |
is a statistic, is the sample size, and denotes a random variable whose value is the partial correlation coefficient of and given the condition set . The null hypothesis of independence is rejected by calculating the p-value of the statistic and comparing it to a predetermined level of significance.
GPUs have many processing cores that can handle multiple data streams simultaneously, enabling massively parallel computing. In this paper, we fully utilize the parallel computing capabilities of GPUs to implement the parallel computation of the G² test and Fisher’s Z test. Next, we will take the calculation of as an example to explain in detail.
When calculating , for each sample in the dataset, all possible combinations of are traversed to determine whether , , and are satisfied, and the count is updated accordingly. Assuming , , and each have 100 values, in the worst case, conditional judgments need to be performed to ensure that all possible combinations are considered.
On the GPU, is organized as a three-dimensional tensor, with each dimension corresponding to , , and respectively. For the dimension of the variable , the task of finding an index with the value is assigned in parallel to multiple processing units. Each processing unit is responsible for determining whether a value in the dimension is equal to , and if so, returning the index. Similarly, the indexes with values and are found in the and dimensions using the same method. After obtaining the indexes, is updated by incrementing the corresponding count. Using this parallelization method, only three parallel operations on the GPU are required to perform a task that would otherwise take operations on the CPU, resulting in a significant increase in computational speed.
The second strategy is the vectorized parallelization of Fisher’s Z test. Function vectorization involves applying a function to each element of a vector, allowing multiple data elements to be processed simultaneously. For example, consider the simple function . If this function is vectorized, then for a vector containing multiple elements, the square of each element can be computed directly without the need for an explicit loop to compute them one by one.
In this paper, Fisher’s Z test is treated as a function and its vectorized parallelism is implemented on GPUs. When dealing with continuous data, the execution process of the FedCMFS algorithm is optimized such that the server sends all CI test requests of the current phase to the client in batches. The client then executes all test requests in parallel, instead of using an explicit loop to perform the tests sequentially, and returns all the results to the server at once. Next, the first phase of the FedCFL subroutine is described in detail as an example.
In the first phase of FedCFL, the server sends requests to the clients to perform tests on and under the empty set () one at a time. Clients perform these tests sequentially, with each client needing to conduct CI tests.
After using the vectorized parallelization of Fisher’s Z test, as shown in Algorithm 4, the server records all triples into and sends to all clients. Upon receiving it, clients perform CI tests in batches. The tests within the same batch are executed in parallel on the GPU, and the results are recorded into . Subsequently, clients extract and from and add them to . Finally, clients send back to the server. The batch size needs to be set according to the GPU’s performance. If the batch size is set to 100, each client only needs to execute batches of CI tests.
The third strategy is the CI test recording mechanism. Sometimes, two variables under the same condition set have already been tested, but a duplicate test is performed in a subsequent run, resulting in a waste of computational resources. We record the results of each test in the CI test record. Before executing a test, we first check the CI test record to see if a corresponding record already exists. If it does, the result in the record is used directly without repeating the test; if not, the test continues to be executed. This mechanism effectively reduces the number of CI tests and accelerates the algorithm.
V Experiment
V-A Datasets
To evaluate the effectiveness of FedCMFS, we use eight real-datasets to simulate the federated setting. The specific information of the eight real-datasets is shown in Table II.
Datasets | Instances | Label | Feature | Domain | Data type |
---|---|---|---|---|---|
CHD_49 | 555 | 6 | 49 | Medicine | Continuous |
VirusGo | 207 | 6 | 749 | Biology | Discrete |
Yeast | 2417 | 14 | 103 | Biology | Continuous |
Flags | 194 | 7 | 19 | Image | Discrete |
Image | 2000 | 5 | 294 | Image | Continuous |
Slashdot | 3782 | 22 | 1079 | Text | Discrete |
Business | 5000 | 30 | 438 | Text | Continuous |
Education | 5000 | 33 | 550 | Text | Continuous |
V-B Evaluation Metrics
We select six commonly used multi-label metrics to evaluate the performance of the methods, including Average precision (AP), Coverage (CV), Hamming loss (HL), Rank loss (RL), Macro-F1 (Fma), and Micro-F1 (Fmi). Suppose there is a multi-labeled dataset , where and are the feature set and label set corresponding to the current th sample, respectively. The ranking function denotes the predicted ranking of the th sample corresponding to label . The detailed explanation of the six evaluation metrics is as follows:
(1) Average precision: the average of the scores of the correct labels for evaluating label alignment, where , the higher value the better performance is.
(5) | ||||
(2) Coverage: the average of the number of steps required by the sample to traverse all labels, the smaller value of the performance is.
(6) |
(3) Hamming loss: it is used to evaluate the proportion of samples that are incorrectly matched. Here, stands for the number of labels, is the predicted label set, represents the true label set of the current th sample, and is the symmetric difference between the predicted and the true label sets. , the smaller value the better performance is.
(7) |
(4) Ranking Loss: evaluate the ranking of relevant labels over irrelevant labels for the samples, , the smaller value the better performance is.
(8) | |||
(5)Macro-F1: arithmetic mean of F1 scores.
(9) |
(6)Micro-F1: weighted average of F1 scores.
(10) |
The two metrics mentioned above consider both the recall and precision of the model. Here, represents the number of true positives in the model’s predictions, while and represent the number of false positives and false negatives, respectively. , and a larger value indicates better performance.
V-C Comparison Algorithms
There are no existing multi-label feature selection methods designed in federated environments for preserving data privacy. To validate our FedCMFS, we select five state-of-the-art feature selection methods, MB-MCF (Markov blanket-based multi-label causal feature selection) [5], GLFS (Group-preserving label-specific feature selection for multi-label learning) [30], PDMFS (Parallel Dual-channel Multi-label Feature Selection) [31], GRROOR (Global Redundancy and Relevance Optimization in Orthogonal Regression for Embedded Multi-label Feature Selection) [32], and PMFS (Pareto-based feature selection algorithm for multi-label classification) [33], and then we adopt a simple strategy to make the above five comparison algorithms work in federated setting. Specifically, for each comparison algorithm, each client independently executes the algorithm locally. Once all clients have completed their execution, they send the evaluation metric results to the server. The server then calculates a weighted average (with weights being the proportion of the data contained in each client to the total data volume) and records the weighted average of the evaluation metrics from all clients as the final result for comparison.
V-D Experimental Environment and Parameter Settings
All experiments were conducted on a Ubuntu server equipped with an Intel(R) Xeon(R) Platinum 8375C CPU @2.90GHz CPU, 64GB of memory, and an NVIDIA A100-SXM 40GB GPU. The programming environment utilized was Python 3.8, with PyTorch version 2.1.0. The batch size was set to 100.
For FedCMFS algorithm proposed in this paper, parameters and for FedCFR and FedCFC are set within . The significance level for conditional independence tests is . All other comparison algorithms used default parameter settings.
The ML-KNN [34] method is chosen for the experiments to evaluate the results of federated feature selection, which is a widely adopted classifier in multi-label learning(any other multi-label classification algorithms can be used here), and the parameter of the algorithm is set to 10 according to the default setting. Among all the algorithms, only MB-MCF algorithm selects a fixed number of features, while other comparison algorithms select features randomly. To effectively track metric variations in the final feature subset, all algorithms, except for MB-MCF, align with the FedCMFS algorithm using the identical number of features for federated causal multi-label feature selection.
To simulate the federated setting, this study assumes a total of clients, where , each with data items. The data division method for the simulated federated setting is as follows: (1) For smaller datasets: CHD_49, VirusGo, Yeast, Flags, and Image, each client randomly extracts of the original training set data from the real dataset without repetition to construct their dataset. (2) For larger datasets: Slashdot, Business, and Education, each client randomly extracts of the original training set data from the real dataset without repetition to construct their dataset. Although the data within a client is unique, there may be data overlap among different clients.
V-E Experimental Results and Analysis
The comparison results of the six algorithms, FedCMFS, MB-MCF, GLFS, PDMFS, GRROOR, and PMFS are shown in Tables III to VIII. In the table, the optimal results are shown in bold, with higher ↑ values being better, lower ↓ values being better, and Average representing the average ranking of the current method among the six methods.
Datasets | Clients | FedCMFS | MB-MCF | GLFS | PDMFS | GRROOR | PMFS |
CHD_49 | 3 | 0.7722 | 0.7708 | 0.7707 | 0.7537 | 0.7634 | 0.7693 |
5 | 0.7783 | 0.7647 | 0.7671 | 0.7577 | 0.7567 | 0.7608 | |
10 | 0.7732 | 0.7664 | 0.7685 | 0.7525 | 0.7571 | 0.7607 | |
VirusGo | 3 | 0.9452 | 0.9432 | 0.6712 | 0.6333 | 0.6340 | 0.6279 |
5 | 0.9442 | 0.9341 | 0.6516 | 0.6354 | 0.6330 | 0.6321 | |
10 | 0.9442 | 0.9353 | 0.6620 | 0.6334 | 0.6320 | 0.6291 | |
Yeast | 3 | 0.7571 | 0.7503 | 0.8210 | 0.7306 | 0.7395 | 0.8156 |
5 | 0.7542 | 0.7523 | 0.8170 | 0.7304 | 0.7340 | 0.8210 | |
10 | 0.7590 | 0.7535 | 0.8172 | 0.7286 | 0.7364 | 0.8190 | |
Flags | 3 | 0.8251 | 0.8113 | 0.8283 | 0.8534 | 0.7709 | 0.7719 |
5 | 0.8224 | 0.8019 | 0.8195 | 0.8476 | 0.7899 | 0.7786 | |
10 | 0.7856 | 0.7992 | 0.8397 | 0.8343 | 0.7850 | 0.7812 | |
Image | 3 | 0.7199 | 0.6893 | 0.5688 | 0.5277 | 0.6771 | 0.6585 |
5 | 0.7133 | 0.6838 | 0.5657 | 0.5386 | 0.6993 | 0.7025 | |
10 | 0.7247 | 0.6883 | 0.5303 | 0.5296 | 0.6415 | 0.6911 | |
Slashdot | 3 | 0.7477 | 0.7465 | 0.7413 | 0.7397 | 0.7411 | 0.7398 |
5 | 0.7480 | 0.7452 | 0.7406 | 0.7397 | 0.7399 | 0.7427 | |
10 | 0.7461 | 0.7457 | 0.7399 | 0.7388 | 0.7401 | 0.7414 | |
Business | 3 | 0.8793 | 0.8759 | 0.8692 | 0.8787 | 0.8688 | 0.8743 |
5 | 0.8798 | 0.8768 | 0.8690 | 0.8733 | 0.8688 | 0.8719 | |
10 | 0.8769 | 0.8770 | 0.8719 | 0.8806 | 0.8682 | 0.8700 | |
Education | 3 | 0.5735 | 0.5668 | 0.5436 | 0.5622 | 0.5239 | 0.5345 |
5 | 0.5705 | 0.5704 | 0.5376 | 0.5354 | 0.5278 | 0.5295 | |
10 | 0.5763 | 0.5696 | 0.5427 | 0.5317 | 0.5316 | 0.5323 | |
Average | 1.5833 | 2.7083 | 3.2500 | 4.4167 | 5.0000 | 4.0417 |
Datasets | Clients | FedCMFS | MB-MCF | GLFS | PDMFS | GRROOR | PMFS |
CHD_49 | 3 | 0.4702 | 0.4891 | 0.4911 | 0.5069 | 0.5010 | 0.5089 |
5 | 0.4702 | 0.4940 | 0.4952 | 0.5238 | 0.5119 | 0.5101 | |
10 | 0.4881 | 0.4917 | 0.4908 | 0.5205 | 0.5199 | 0.5137 | |
VirusGo | 3 | 0.0703 | 0.0730 | 0.2296 | 0.2450 | 0.2456 | 0.2423 |
5 | 0.0622 | 0.0719 | 0.2373 | 0.2450 | 0.2442 | 0.2390 | |
10 | 0.0622 | 0.0735 | 0.2301 | 0.2454 | 0.2452 | 0.2428 | |
Yeast | 3 | 0.4642 | 0.4631 | 0.4081 | 0.4775 | 0.4722 | 0.4128 |
5 | 0.4618 | 0.4623 | 0.4122 | 0.4783 | 0.4761 | 0.4127 | |
10 | 0.4531 | 0.4604 | 0.4122 | 0.4804 | 0.4730 | 0.4129 | |
Flags | 3 | 0.5165 | 0.5253 | 0.5465 | 0.5289 | 0.5568 | 0.5802 |
5 | 0.5077 | 0.5543 | 0.5464 | 0.5323 | 0.5508 | 0.5824 | |
10 | 0.5407 | 0.5512 | 0.5251 | 0.5391 | 0.5637 | 0.5635 | |
Image | 3 | 0.2180 | 0.2257 | 0.3523 | 0.4133 | 0.2490 | 0.2767 |
5 | 0.2185 | 0.2293 | 0.3602 | 0.3993 | 0.2285 | 0.2358 | |
10 | 0.2130 | 0.2298 | 0.3905 | 0.4065 | 0.2635 | 0.2439 | |
Slashdot | 3 | 0.0346 | 0.0347 | 0.0396 | 0.0401 | 0.0388 | 0.0400 |
5 | 0.0349 | 0.0354 | 0.0400 | 0.0410 | 0.0395 | 0.0390 | |
10 | 0.0365 | 0.0348 | 0.0405 | 0.0411 | 0.0403 | 0.0394 | |
Business | 3 | 0.0761 | 0.0767 | 0.0798 | 0.0768 | 0.0800 | 0.0788 |
5 | 0.0756 | 0.0770 | 0.0804 | 0.0780 | 0.0805 | 0.0803 | |
10 | 0.0765 | 0.0772 | 0.0795 | 0.0766 | 0.0804 | 0.0798 | |
Education | 3 | 0.1130 | 0.1138 | 0.1206 | 0.1158 | 0.1248 | 0.1207 |
5 | 0.1145 | 0.1139 | 0.1199 | 0.1221 | 0.1230 | 0.1218 | |
10 | 0.1132 | 0.1139 | 0.1199 | 0.1241 | 0.1224 | 0.1216 | |
Average | 1.4583 | 2.4583 | 3.3750 | 4.8750 | 4.7917 | 4.0417 |
Datasets | Clients | FedCMFS | MB-MCF | GLFS | PDMFS | GRROOR | PMFS |
CHD_49 | 3 | 0.3006 | 0.3214 | 0.3284 | 0.3333 | 0.3284 | 0.3284 |
5 | 0.3095 | 0.3173 | 0.3095 | 0.3494 | 0.3405 | 0.3363 | |
10 | 0.3065 | 0.3235 | 0.3193 | 0.3443 | 0.3339 | 0.3366 | |
VirusGo | 3 | 0.0462 | 0.0469 | 0.2001 | 0.1988 | 0.1988 | 0.2001 |
5 | 0.0402 | 0.0498 | 0.1996 | 0.1988 | 0.1988 | 0.2000 | |
10 | 0.0402 | 0.0528 | 0.1990 | 0.1988 | 0.1988 | 0.1990 | |
Yeast | 3 | 0.1985 | 0.2010 | 0.1692 | 0.2155 | 0.2099 | 0.1724 |
5 | 0.2038 | 0.2019 | 0.1721 | 0.2148 | 0.2125 | 0.1726 | |
10 | 0.1982 | 0.2012 | 0.1724 | 0.2157 | 0.2116 | 0.1721 | |
Flags | 3 | 0.2857 | 0.2872 | 0.2989 | 0.3011 | 0.3267 | 0.3480 |
5 | 0.2989 | 0.3116 | 0.3059 | 0.3059 | 0.2941 | 0.3323 | |
10 | 0.3165 | 0.3046 | 0.2958 | 0.3066 | 0.3198 | 0.3345 | |
Image | 3 | 0.2065 | 0.2015 | 0.2320 | 0.2312 | 0.2117 | 0.2015 |
5 | 0.1995 | 0.2191 | 0.2255 | 0.2288 | 0.2009 | 0.1892 | |
10 | 0.2145 | 0.2147 | 0.2321 | 0.2327 | 0.2296 | 0.1935 | |
Slashdot | 3 | 0.0177 | 0.0179 | 0.0249 | 0.0215 | 0.0202 | 0.0262 |
5 | 0.0183 | 0.0179 | 0.0217 | 0.0215 | 0.0209 | 0.0270 | |
10 | 0.0177 | 0.0177 | 0.0215 | 0.0213 | 0.0215 | 0.0232 | |
Business | 3 | 0.0267 | 0.0273 | 0.0283 | 0.0265 | 0.0281 | 0.0266 |
5 | 0.0270 | 0.0272 | 0.0283 | 0.0274 | 0.0283 | 0.0275 | |
10 | 0.0271 | 0.0272 | 0.0277 | 0.0264 | 0.0284 | 0.0278 | |
Education | 3 | 0.0400 | 0.0402 | 0.0418 | 0.0399 | 0.0426 | 0.0422 |
5 | 0.0400 | 0.0404 | 0.0422 | 0.0420 | 0.0422 | 0.0426 | |
10 | 0.0401 | 0.0404 | 0.0419 | 0.0417 | 0.0425 | 0.0426 | |
Average | 1.8750 | 2.6250 | 3.8958 | 4.0417 | 4.2917 | 4.2708 |
Datasets | Clients | FedCMFS | MB-MCF | GLFS | PDMFS | GRROOR | PMFS |
CHD_49 | 3 | 0.2277 | 0.2369 | 0.2373 | 0.2547 | 0.2473 | 0.2459 |
5 | 0.2177 | 0.2442 | 0.2393 | 0.2603 | 0.2544 | 0.2532 | |
10 | 0.2369 | 0.2452 | 0.2384 | 0.2623 | 0.2588 | 0.2505 | |
VirusGo | 3 | 0.0429 | 0.0449 | 0.2324 | 0.2536 | 0.2544 | 0.2506 |
5 | 0.0352 | 0.0445 | 0.2426 | 0.2536 | 0.2526 | 0.2464 | |
10 | 0.0352 | 0.0457 | 0.2336 | 0.2541 | 0.2538 | 0.2509 | |
Yeast | 3 | 0.1761 | 0.1781 | 0.1284 | 0.1929 | 0.1869 | 0.1332 |
5 | 0.1773 | 0.1763 | 0.1318 | 0.1946 | 0.1901 | 0.1300 | |
10 | 0.1717 | 0.1757 | 0.1318 | 0.1966 | 0.1889 | 0.1318 | |
Flags | 3 | 0.2087 | 0.2126 | 0.2079 | 0.1828 | 0.2535 | 0.2648 |
5 | 0.1910 | 0.2325 | 0.2159 | 0.1894 | 0.2402 | 0.2571 | |
10 | 0.2354 | 0.2368 | 0.1945 | 0.2027 | 0.2479 | 0.2459 | |
Image | 3 | 0.2304 | 0.2394 | 0.3958 | 0.4702 | 0.2685 | 0.3021 |
5 | 0.2281 | 0.2445 | 0.4039 | 0.4520 | 0.2429 | 0.2518 | |
10 | 0.2254 | 0.2451 | 0.4432 | 0.4612 | 0.2883 | 0.2608 | |
Slashdot | 3 | 0.0381 | 0.0374 | 0.0427 | 0.0430 | 0.0423 | 0.0431 |
5 | 0.0375 | 0.0382 | 0.0427 | 0.0432 | 0.0424 | 0.0422 | |
10 | 0.0395 | 0.0377 | 0.0428 | 0.0435 | 0.0428 | 0.0423 | |
Business | 3 | 0.0403 | 0.0413 | 0.0432 | 0.0404 | 0.0436 | 0.0422 |
5 | 0.0397 | 0.0411 | 0.0439 | 0.0418 | 0.0438 | 0.0434 | |
10 | 0.0406 | 0.0412 | 0.0431 | 0.0403 | 0.0441 | 0.0433 | |
Education | 3 | 0.0867 | 0.0873 | 0.0936 | 0.0890 | 0.0978 | 0.0936 |
5 | 0.0876 | 0.0873 | 0.0931 | 0.0953 | 0.0964 | 0.0951 | |
10 | 0.0866 | 0.0874 | 0.0932 | 0.0968 | 0.0955 | 0.0949 | |
Average | 1.6667 | 2.5417 | 3.3542 | 4.6667 | 4.8542 | 3.9167 |
Datasets | Clients | FedCMFS | MB-MCF | GLFS | PDMFS | GRROOR | PMFS |
CHD_49 | 3 | 0.4736 | 0.4245 | 0.3956 | 0.3473 | 0.4007 | 0.3564 |
5 | 0.4855 | 0.4418 | 0.4287 | 0.3908 | 0.3403 | 0.3851 | |
10 | 0.4619 | 0.4277 | 0.4310 | 0.4199 | 0.3543 | 0.3711 | |
VirusGo | 3 | 0.5862 | 0.6125 | 0.0000 | 0.0108 | 0.0217 | 0.0000 |
5 | 0.6743 | 0.6124 | 0.0000 | 0.0195 | 0.0000 | 0.0000 | |
10 | 0.6743 | 0.5477 | 0.0103 | 0.0130 | 0.0065 | 0.0000 | |
Yeast | 3 | 0.3545 | 0.3472 | 0.4237 | 0.2755 | 0.3080 | 0.4017 |
5 | 0.3508 | 0.3467 | 0.3961 | 0.2759 | 0.2911 | 0.4052 | |
10 | 0.3530 | 0.3472 | 0.3957 | 0.2699 | 0.2949 | 0.4066 | |
Flags | 3 | 0.5470 | 0.4900 | 0.4449 | 0.4785 | 0.4944 | 0.4431 |
5 | 0.5619 | 0.4700 | 0.4349 | 0.4608 | 0.4937 | 0.4290 | |
10 | 0.3951 | 0.4792 | 0.4628 | 0.4502 | 0.4768 | 0.4312 | |
Image | 3 | 0.4598 | 0.4418 | 0.1491 | 0.1448 | 0.4346 | 0.3731 |
5 | 0.5032 | 0.4334 | 0.1825 | 0.1362 | 0.4441 | 0.4375 | |
10 | 0.4593 | 0.4300 | 0.1508 | 0.1501 | 0.3769 | 0.4352 | |
Slashdot | 3 | 0.0833 | 0.0795 | 0.0356 | 0.0393 | 0.0408 | 0.0357 |
5 | 0.0514 | 0.0856 | 0.0387 | 0.0392 | 0.0395 | 0.0333 | |
10 | 0.0679 | 0.0873 | 0.0389 | 0.0393 | 0.0393 | 0.0374 | |
Business | 3 | 0.0941 | 0.0943 | 0.0641 | 0.0741 | 0.0644 | 0.0630 |
5 | 0.0998 | 0.0954 | 0.0623 | 0.0681 | 0.0647 | 0.0549 | |
10 | 0.0991 | 0.0984 | 0.0658 | 0.0727 | 0.0583 | 0.0624 | |
Education | 3 | 0.0780 | 0.0744 | 0.0594 | 0.0571 | 0.0395 | 0.0511 |
5 | 0.0749 | 0.0773 | 0.0559 | 0.0387 | 0.0464 | 0.0503 | |
10 | 0.0830 | 0.0759 | 0.0587 | 0.0399 | 0.0475 | 0.0495 | |
Average | 1.6667 | 2.2500 | 3.9792 | 4.5208 | 4.0625 | 4.5208 |
Datasets | Clients | FedCMFS | MB-MCF | GLFS | PDMFS | GRROOR | PMFS |
CHD_49 | 3 | 0.6599 | 0.6242 | 0.5993 | 0.5988 | 0.6217 | 0.5926 |
5 | 0.6623 | 0.6258 | 0.6368 | 0.6075 | 0.5837 | 0.6060 | |
10 | 0.6532 | 0.6193 | 0.6250 | 0.6313 | 0.5965 | 0.6059 | |
VirusGo | 3 | 0.8757 | 0.8746 | 0.0000 | 0.0249 | 0.0498 | 0.0000 |
5 | 0.8958 | 0.8684 | 0.0000 | 0.0449 | 0.0000 | 0.0000 | |
10 | 0.8958 | 0.8580 | 0.0143 | 0.0299 | 0.0150 | 0.0000 | |
Yeast | 3 | 0.6316 | 0.6275 | 0.6998 | 0.5804 | 0.5997 | 0.6891 |
5 | 0.6276 | 0.6239 | 0.6887 | 0.5819 | 0.5917 | 0.6877 | |
10 | 0.6287 | 0.6274 | 0.6870 | 0.5766 | 0.5922 | 0.6898 | |
Flags | 3 | 0.7032 | 0.6749 | 0.6583 | 0.6774 | 0.6444 | 0.6132 |
5 | 0.6909 | 0.6604 | 0.6452 | 0.6636 | 0.6756 | 0.6438 | |
10 | 0.6230 | 0.6673 | 0.6637 | 0.6631 | 0.6534 | 0.6326 | |
Image | 3 | 0.4643 | 0.4450 | 0.1522 | 0.1637 | 0.4378 | 0.4106 |
5 | 0.5006 | 0.4156 | 0.1925 | 0.1525 | 0.4688 | 0.4918 | |
10 | 0.4211 | 0.4237 | 0.1583 | 0.1644 | 0.3718 | 0.4717 | |
Slashdot | 3 | 0.7848 | 0.7793 | 0.6750 | 0.7628 | 0.7722 | 0.6476 |
5 | 0.7807 | 0.7819 | 0.7464 | 0.7619 | 0.7638 | 0.6257 | |
10 | 0.7864 | 0.7853 | 0.7502 | 0.7627 | 0.7617 | 0.7197 | |
Business | 3 | 0.7028 | 0.6954 | 0.6760 | 0.6943 | 0.6804 | 0.6910 |
5 | 0.7011 | 0.6957 | 0.6761 | 0.6847 | 0.6774 | 0.6812 | |
10 | 0.7007 | 0.6970 | 0.6810 | 0.6944 | 0.6757 | 0.6807 | |
Education | 3 | 0.2501 | 0.2442 | 0.1594 | 0.2144 | 0.1168 | 0.1479 |
5 | 0.2454 | 0.2350 | 0.1479 | 0.1217 | 0.1320 | 0.1368 | |
10 | 0.2582 | 0.2394 | 0.1657 | 0.1315 | 0.1291 | 0.1339 | |
Average | 1.5833 | 2.5000 | 3.9792 | 4.0417 | 4.4167 | 4.3958 |
The purpose of these experiments is to evaluate the performance of the FedCMFS algorithm on various metrics and datasets. The results show that FedCMFS consistently maintains the highest average ranking across all six evaluation metrics. In most cases, its performance is very close to that of the MB-MCF algorithm. Although FedCMFS performs slightly worse on the Yeast and Flags datasets, ranking in the top three, it performs well on high-dimensional datasets such as Slashdot, Business, and Education. This change in performance can be attributed to the effect of data sparsity on the Yeast and Flags datasets. This sparsity hinders the accuracy of statistical methods used for conditional independence tests, affecting the ability of FedCMFS to accurately learn causal structure. In contrast, in high-dimensional datasets, FedCMFS effectively identifies three different types of correlations between labels and features, demonstrating its superior performance.
V-F Parameter Sensitivity Analysis
This section focuses on investigating the impact of parameters and on the FedCMFS using two datasets, Flags and Image. The analysis include diverse client numbers, with detailed outcomes presented in Figure 5.
In the case of the low-dimensional Flags dataset, it was found that the parameter has a relatively small effect on the results, while has a negligible effect on the results. Specifically, when the number of clients is 3 and 5, setting in the range can obtain better experimental results. In contrast, for 10 customers, the optimal range of . This variation in results is attributed to the limited number of features in the Flags dataset, resulting in fewer feature bases filtered out by FedCFR and FedCFC, thus reducing the impact of and . Therefore, it may be more effective to use only FedCFL and FedCFR for feature selection. In addition, the experimental results are affected by the number of clients, with an inversion of the effective parameter range at 10 clients. This variation may be caused by the uneven distribution of samples in the smaller dataset when simulating the federated environment.






In the case of the high-dimensional Image dataset, the parameters and have a significant impact on the experimental results. In all three client scenarios, the results improve significantly when is set within and is set within . This sensitivity is attributed to the large number of features of the dataset, which allows FedCFR and FedCFC to effectively process and screen more features due to the complex associations present in the high-dimensional data. By correcting the top 10% to 30% of possible erroneous features, FedCFR achieves the best performance. However, increasing the number of corrected features may lead to counterproductive adjustments, caused by excessive noise and strict symmetry constraints. Moreover, the complexity of the image dataset means that the data distribution is unlikely to be affected by the sampling method used to simulate the federated environment, making the number of clients irrelevant to the experimental results. The consistency of parameter sensitivities across different clients further demonstrates the effectiveness of FedCFR and FedCFC in the federated context.
V-G Statistical Hypothesis Testing
To fully establish the superiority of FedCMFS over prior methods, we conduct the Friedman test () on the six metrics [35]. Table IX shows the specific results. We observe that the Friedman statistic values on all metrics are higher than the critical value, which means the null hypothesis of no significant difference among the algorithms is rejected.
Metric | Critical Value() | |
---|---|---|
AvP | 18.2674 | 2.293 |
Cov | 25.0398 | |
HaL | 9.4003 | |
RaL | 17.9430 | |
Macro-F1 | 17.4678 | |
Micro-F1 | 15.0658 |
Since rejecting the null hypothesis, we further employs the Nemenyi test [35] as a post-hoc test. The Nemenyi test indicates a significant difference in the performance of two methods if the mean rank difference between them exceed a critical difference (CD). The results are shown in Figure 6, where each rank is sequentially marked on the axis and the lowest is on the right. Notably, FedCMFS achieves the lowest rank across all metrics and significantly outperforms other methods.






V-H Acceleration Effect
To evaluate the effectiveness of the acceleration methods, we compared the runtime before and after applying the acceleration methods. As shown in Table X, the acceleration effect is particularly noticeable for high-dimensional datasets such as Slashdot and Business. However, for the low-dimensional dataset Flags, the runtime after acceleration is longer than before. This anomaly is due to the relatively high overhead associated with GPU initialization and communication, which outweighs the shorter computation time. Next, we conducted a comparative analysis of its runtime against that of other algorithms. The results are summarized in the table XI. After acceleration, the runtime of FedCMFS is close to that of PDMFS, outperforming MB-MCF, and demonstrating excellent performance on high-dimensional datasets.
Datasets | Client number | Original FedCMFS(s) | Accelerated FedCMFS(s) | Speedup |
---|---|---|---|---|
VirusGo | 3 | 35.0103 | 6.1075 | 5.7323 |
5 | 63.6809 | 9.5575 | 6.6630 | |
10 | 114.7019 | 18.0596 | 6.3513 | |
Flags | 3 | 2.6883 | 3.0977 | 0.8678 |
5 | 3.0073 | 3.5928 | 0.8370 | |
10 | 8.0959 | 4.9475 | 1.6364 | |
Image | 3 | 107.6387 | 10.9457 | 9.8339 |
5 | 296.3440 | 20.4531 | 14.4890 | |
10 | 503.1186 | 27.3849 | 18.3721 | |
Slashdot | 3 | 5121.2536 | 62.9691 | 81.3296 |
5 | 8985.2121 | 103.4271 | 86.8748 | |
10 | 18305.3374 | 195.4889 | 93.6388 | |
Business | 3 | 4323.8704 | 103.6646 | 41.7102 |
5 | 6527.8086 | 144.6297 | 45.1346 | |
10 | 16983.4118 | 337.8335 | 50.2715 |
Datasets | Clients | FedCMFS | MB-MCF | GLFS | PDMFS | GRROOR | PMFS |
---|---|---|---|---|---|---|---|
VirusGo | 3 | 6.1075 | 7.2146 | 2.6246 | 42.1955 | 17.7764 | 1.1384 |
5 | 9.5575 | 10.8515 | 2.3958 | 79.6399 | 29.3349 | 0.6189 | |
10 | 18.0596 | 19.6982 | 5.1917 | 149.7295 | 63.3173 | 1.3123 | |
Flags | 3 | 3.0977 | 0.3931 | 0.1060 | 0.2187 | 0.2147 | 0.0837 |
5 | 3.5928 | 0.5043 | 0.1291 | 0.2988 | 0.5543 | 0.0270 | |
10 | 4.9475 | 1.3031 | 0.3207 | 0.7390 | 0.8962 | 0.1349 | |
Image | 3 | 10.9457 | 102.3162 | 2.2511 | 44.1345 | 4.0944 | 0.1340 |
5 | 20.4531 | 221.9948 | 4.7947 | 86.7157 | 8.0327 | 0.2245 | |
10 | 27.3849 | 370.4468 | 8.7092 | 166.2046 | 14.4684 | 0.4853 | |
Slashdot | 3 | 62.9691 | 1872.0302 | 9.3750 | 597.7361 | 66.8130 | 1.5003 |
5 | 103.4271 | 2579.8462 | 20.7131 | 900.6759 | 104.1713 | 2.3110 | |
10 | 195.4889 | 5953.3369 | 28.4496 | 1842.9548 | 228.8878 | 4.9612 | |
Business | 3 | 103.6646 | 945.8063 | 2.3636 | 117.6500 | 10.2650 | 0.2526 |
5 | 144.6297 | 1504.0571 | 2.8750 | 191.7148 | 17.1086 | 0.4465 | |
10 | 337.8335 | 3096.6454 | 5.5434 | 408.8154 | 36.7804 | 0.9499 |
VI Conclusion and Further Work
To solve the problem of causal multi-label feature selection in federated setting, this paper proposes the FedCMFS algorithm based on the local causal structure learning method and horizontal federated learning framework.
The experimental results show that FedCMFS achieves the best experimental results in federated setting. Specifically, FedCMFS is able to directly determine the parent-child relationship of variables by mining the causal relationship between variables, thus providing excellent interpretability. Second, the FedCMFS algorithm operates in a distributed data environment and maintains data privacy during the transmission of encrypted semantics. Finally, FedCMFS is also able to effectively correct the effects of noise and differences in client data quality on the algorithm performance. However, there are still issues that need to be further investigated to advance the field. We observe that FedCMFS, which is based on statistical methods, may incorrectly learn the causal relationships when a dataset has small-sized data samples, leading to incorrectly selected features. Therefore, exploring federated causal multi-label feature selection in small-sized datasets is also a promising research direction [36].
References
- [1] M.-L. Zhang and Z.-H. Zhou, “A review on multi-label learning algorithms,” IEEE transactions on knowledge and data engineering, vol. 26, no. 8, pp. 1819–1837, 2013.
- [2] Y. Tian, K. Bai, X. Yu, and S. Zhu, “Causal multi-label learning for image classification,” Neural Networks, vol. 167, pp. 626–637, 2023.
- [3] S. Kashef, H. Nezamabadi-pour, and B. Nikpour, “Multilabel feature selection: A comprehensive review and guiding experiments,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 8, no. 2, p. e1240, 2018.
- [4] J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, and H. Liu, “Feature selection: A data perspective,” ACM computing surveys (CSUR), vol. 50, no. 6, pp. 1–45, 2017.
- [5] X. Wu, B. Jiang, K. Yu, H. Chen, and C. Miao, “Multi-label causal feature selection,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 04, 2020, pp. 6430–6437.
- [6] J. Pearl, Causality. Cambridge university press, 2009.
- [7] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial intelligence and statistics. PMLR, 2017, pp. 1273–1282.
- [8] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning: Concept and applications,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 10, no. 2, pp. 1–19, 2019.
- [9] R. Cai, Z. Zhang, and Z. Hao, “Bassum: A bayesian semi-supervised method for classification feature selection,” Pattern Recognition, vol. 44, no. 4, pp. 811–820, 2011.
- [10] K. Yu, X. Guo, L. Liu, J. Li, H. Wang, Z. Ling, and X. Wu, “Causality-based feature selection: Methods and evaluations,” ACM Computing Surveys (CSUR), vol. 53, no. 5, pp. 1–36, 2020.
- [11] R. B. Pereira, A. Plastino, B. Zadrozny, and L. H. Merschmann, “Categorizing feature selection methods for multi-label classification,” Artificial intelligence review, vol. 49, pp. 57–78, 2018.
- [12] J. Lee and D.-W. Kim, “Fast multi-label feature selection based on information-theoretic feature ranking,” Pattern Recognition, vol. 48, no. 9, pp. 2761–2771, 2015.
- [13] ——, “Scls: Multi-label feature selection based on scalable criterion for large label set,” Pattern Recognition, vol. 66, pp. 342–352, 2017.
- [14] J. Dai, W. Huang, C. Zhang, and J. Liu, “Multi-label feature selection by strongly relevant label gain and label mutual aid,” Pattern Recognition, vol. 145, p. 109945, 2024.
- [15] Z. Ma, F. Nie, Y. Yang, J. R. Uijlings, and N. Sebe, “Web image annotation via subspace-sparsity collaborated feature selection,” IEEE Transactions on Multimedia, vol. 14, no. 4, pp. 1021–1030, 2012.
- [16] J. Huang, G. Li, Q. Huang, and X. Wu, “Joint feature selection and classification for multilabel learning,” IEEE transactions on cybernetics, vol. 48, no. 3, pp. 876–889, 2017.
- [17] M. Faraji, S. A. Seyedi, F. A. Tab, and R. Mahmoodi, “Multi-label feature selection with global and local label correlation,” Expert Systems with Applications, vol. 246, p. 123198, 2024.
- [18] R. Huang, W. Jiang, and G. Sun, “Manifold-based constraint laplacian score for multi-label feature selection,” Pattern Recognition Letters, vol. 112, pp. 346–352, 2018.
- [19] Z. Cai and W. Zhu, “Multi-label feature selection via feature manifold learning and sparsity regularization,” International journal of machine learning and cybernetics, vol. 9, pp. 1321–1334, 2018.
- [20] J. Zhang, Z. Luo, C. Li, C. Zhou, and S. Li, “Manifold regularized discriminative feature selection for multi-label learning,” Pattern Recognition, vol. 95, pp. 136–150, 2019.
- [21] Y. Fan, J. Liu, J. Tang, P. Liu, Y. Lin, and Y. Du, “Learning correlation information for multi-label feature selection,” Pattern Recognition, vol. 145, p. 109899, 2024.
- [22] Y. Hu, Y. Zhang, D. Gong, and X. Sun, “Multi-participant federated feature selection algorithm with particle swarm optimizaiton for imbalanced data under privacy protection,” IEEE Transactions on Artificial Intelligence, 2022.
- [23] S. Banerjee, E. Elmroth, and M. Bhuyan, “Fed-fis: A novel information-theoretic federated feature selection for learning stability,” in International Conference on Neural Information Processing. Springer, 2021, pp. 480–487.
- [24] J. Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan kaufmann, 1988.
- [25] K. Yu, L. Liu, and J. Li, “A unified view of causal and non-causal feature selection,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 15, no. 4, pp. 1–46, 2021.
- [26] C. F. Aliferis, A. Statnikov, I. Tsamardinos, S. Mani, and X. D. Koutsoukos, “Local causal and markov blanket induction for causal discovery and feature selection for classification part i: algorithms and empirical evaluation.” Journal of Machine Learning Research, vol. 11, no. 1, 2010.
- [27] N. Friedman, M. Linial, I. Nachman, and D. Pe’er, “Using bayesian networks to analyze expression data,” in Proceedings of the fourth annual international conference on Computational molecular biology, 2000, pp. 127–135.
- [28] J. Peng, P. Wang, N. Zhou, and J. Zhu, “Partial correlation estimation by joint sparse regression models,” Journal of the American Statistical Association, vol. 104, no. 486, pp. 735–746, 2009.
- [29] R. E. Neapolitan et al., Learning bayesian networks. Pearson Prentice Hall Upper Saddle River, 2004, vol. 38.
- [30] J. Zhang, H. Wu, M. Jiang, J. Liu, S. Li, Y. Tang, and J. Long, “Group-preserving label-specific feature selection for multi-label learning,” Expert Systems with Applications, vol. 213, p. 118861, 2023.
- [31] J. Miao, Y. Wang, Y. Cheng, and F. Chen, “Parallel dual-channel multi-label feature selection,” Soft Computing, vol. 27, no. 11, pp. 7115–7130, 2023.
- [32] J. Zhang, Y. Lin, M. Jiang, S. Li, Y. Tang, and K. C. Tan, “Multi-label feature selection via global relevance and redundancy optimization.” in IJCAI, 2020, pp. 2512–2518.
- [33] A. Hashemi, M. B. Dowlatshahi, and H. Nezamabadi-pour, “An efficient pareto-based feature selection algorithm for multi-label classification,” Information Sciences, vol. 581, pp. 428–447, 2021.
- [34] M.-L. Zhang and Z.-H. Zhou, “Ml-knn: A lazy learning approach to multi-label learning,” Pattern recognition, vol. 40, no. 7, pp. 2038–2048, 2007.
- [35] J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” The Journal of Machine learning research, vol. 7, pp. 1–30, 2006.
- [36] G. Xiang, H. Wang, K. Yu, X. Guo, F. Cao, and Y. Song, “Bootstrap-based layer-wise refining for causal structure learning,” IEEE Transactions on Artificial Intelligence, 2023.