Multi-VFL: A Vertical Federated Learning System for Multiple Data and Label Owners

Vaikkunth Mugunthan vaik@mit.edu Massachusetts Institute of TechnologyCambridgeMAUSA , Pawan Goyal pawan14@mit.edu Massachusetts Institute of TechnologyCambridgeMAUSA and Lalana Kagal lkagal@mit.edu Massachusetts Institute of TechnologyCambridgeMAUSA

Abstract.

Vertical Federated Learning (VFL) refers to the collaborative training of a model on a dataset where the features of the dataset are split among multiple data owners, while label information is owned by a single label owner. In this paper, we propose a novel method, Multi Vertical Federated Learning (Multi-VFL), to train VFL models when there are multiple data and label owners. Our approach is the first to consider the setting where $D$ -data owners (across which features are distributed) and $K$ -label owners (across which labels are distributed) exist. This proposed configuration allows different entities to train and learn optimal models without having to share their data. Our framework makes use of split learning and adaptive federated optimizers to solve this problem. For empirical evaluation, we run experiments on the MNIST and FashionMNIST datasets. Our results show that using adaptive optimizers for model aggregation fastens convergence and improves accuracy.

Federated Learning, Vertical Federated Learning, Deep Learning, Split Learning

^†^†ccs: Computing methodologies Machine learning^†^†ccs: Computing methodologies Distributed artificial intelligence^†^†ccs: Computing methodologies Modeling and simulation

1. Introduction

Deep learning models benefit from large training datasets. However, for high-quality models, training needs to happen on large diverse datasets which usually are spread across multiple organizations. Deep learning solutions are extremely useful for for several domains including healthcare organizations and financial companies. While dealing with sensitive data (e.g.) patient information, credit card details, etc., government regulations such as GDPR (Voigt and Von dem Bussche, 2017) and CCPA (Pardau, 2018) restrict how entities can share data with other organizations. To solve this problem, Federated Learning (FL) was proposed. FL (Konečnỳ et al., 2016; McMahan et al., 2017) is a collaborative learning paradigm that allows distributed clients to train machine learning models without having to share their sensitive data.

There are two main categories of FL. In horizontal FL, clients have the same set of features for a different set of clients, whereas, vertical FL handles the case when multiple clients possess the data of the same set of individuals, but each client has a unique set of features. There has been a lot of research in horizontal FL, however, there are few approaches for vertical FL (Liu et al., 2019; Nock et al., 2018). Split learning is the most commonly used approach to address vertical FL (Vepakomma et al., 2018; Ceballos et al., 2020; Angelou et al., 2020). Split learning is a technique to collaboratively train a deep learning network that is split across different clients.

Usually, vertical FL is used to solve the problem when dataset features are distributed across multiple data owners and there exists a single label owner. However, in reality, there might be multiple label owners. In this paper, we consider the setting where $D$ data owners (across which dataset features are distributed) and $K$ label owners (across which dataset labels are distributed) exist. For example, in the case of healthcare, the $D$ data owners may correspond to different specialist hospitals (heart disease hospital, lung disease hospital, cancer hospital, etc.) that have different data for the same set of patients and $K$ label owners correspond to COVID-testing centers that have the COVID results corresponding to these patients.

To solve this problem, we make use of split learning and adaptive optimizer-based horizontal FL algorithms. Split learning solves the problem when $D$ data owners and $1$ label owner exists. However, when we have to aggregate the results from $K$ label owners, we need an aggregation mechanism. The most widely used FL algorithm for aggregation is FedAvg (McMahan et al., 2017), where the weights of the models obtained from different label owners are just averaged. FedAvg is successful when label owners have IID (independent and identically distributed) data. However, in the real-world, label owners have non-IID data leading to client drift (Hsu et al., 2019), which is what happens when gradients computed by different label owners are skewed causing local models to move away from globally optimal models. This substantially affects the performance of FedAvg, especially in scenarios where label owners have a high degree of variance in their data distributions. Under such scenarios, adaptive learning rates and momentum are beneficial as they incorporate knowledge from prior iterations. When participating label owners have sparse data distributions and possess a limited subset of labels, SGD with server momentum improves the FL performance (Hsu et al., 2019). (Reddi et al., 2020) proposed federated versions of adaptive optimizers and showed that these optimizers improved the performance of FL under non-iid settings. To improve the performance of Adam in local settings, (Chen et al., 2019) applied a decaying momentum (Demon) rule to Adam. We extend this approach to the federated setting.

An entity resolution protocol such as Private Set Intersection (PSI) (Angelou et al., 2020) is used to identify intersecting sets of individuals across different data owners and label owners by comparing the encrypted versions of the sets.

2. Related Work

Vertical FL is a distributed learning paradigm on vertically partitioned datasets. Conventional approaches for vertical FL make use of Multi-Party Computation and Homomorphic Encryption (Hardy et al., 2017). However, these approaches suffer from performance and communication problems. To address these issues, split learning was proposed. The primary benefit of using split learning is that participants don’t have to share their sensitive data with each other while learning the shared model.

(Ceballos et al., 2020) proposed a VFL technique using split neural networks for $D$ data owners and $1$ label owner. In their approach, each data owner trains a partial network up to a cut layer and sends the activations (outputs) of the cut layer to the label owner. The label owner concatenates the activations coming from different data owners and completes the forward pass. As the label owner has the labels corresponding to the individuals on the datasets possessed by the data owners, it carries out its backpropagation up to the cut layer. The gradients at the cut layer are split and passed to the corresponding data owners. This process repeats for a fixed number of rounds or till convergence is achieved.

SplitFed, an approach that combined FL and split learning, was proposed by (Thapa et al., 2020). Their framework was proposed for $D$ data owners and $1$ label owner as well. In their framework, a 3rd party server aggregates the models of different data owners using Federated Averaging (FedAvg). In addition, data owners update their models sequentially, hence increasing the total time per federated round. Also, they expect data owners to have the same model architecture which is a hard constraint on resource-deficit data owners.

Both papers consider $D$ data owners and $1$ label owner in their setting. In this paper, we propose a solution to train vertical FL models in which $D$ data owners and $K$ label owners exist.

3. Framework Description

The primary goal of our framework is to learn an optimized vertical FL model when $D$ data owners and $K$ label owners exist.

In our framework, each data owner has a unique model architecture and all label owners have the same model architecture. Data owners perform forward propagation on their respective partial neural networks up to the cut layer and send their activations to label owners. Label owners concatenate the activations coming from different data owners and complete their forward pass. Label owners compute the loss, perform back-propagation, and send the corresponding gradients to data owners using which they complete their back-propagation. Label owners send their weights to the aggregation server which aggregates their weights using one of the following techniques: FedAvg (McMahan et al., 2017), FedAdam (Reddi et al., 2020), FedYogi (Reddi et al., 2020), etc. We also extend FedAdam to FedDemonAdam, a federated learning algorithm in which ”Demon(Chen et al., 2019) to Adam (DemonAdam)” is used for the server updates. The aggregation server returns the updated weights to the label owners which they use in the next round. The variables used in the Multi-VFL algorithm (Algorithm 1) are provided in Table 1.

An example real-life use case is depicted in Figure 1. Let us consider two hospitals (data owners) and two COVID testing centers (label owners). Our goal is to develop a model that would be able to predict how likely a patient is to get COVID based on the information available from cancer and lung disease hospitals. Algorithm 1 provides an optimized solution for this scenario.

Variable	Description
$r$	Federated Learning Round
$d,D$	$D$ data owners, each indexed by $d\in{1,..,D}$
$k,K$	$K$ label owners, each indexed by $k\in{1,..,K}$
$\textbf{W}_{k}^{S,r}$	Label Owner-side model of label owner $k$ at round $r$
$\textbf{W}_{d,k}^{C,r}$	Owner-side model of data owner $d$ with label owner $k$ at round $r$
$\textbf{W}_{d,k}^{C,r}$	Owner-side model of data owner $d$ with label owner $k$ at round $r$
$\textbf{A}_{d,k}^{S,r}$	Activations from data owner $d$ to label owner $k$ at round $r$
$\textbf{Y}_{d}$	True labels held by label owner corresponding to individuals in Data owner $d$

Table 1. Variables used in our algorithm.

Algorithm 1 Multi-VFL with DemonAdam / Adam Server Aggregation.

Initialization: Initialize $\textbf{W}_{k}^{S,r},\textbf{W}_{d,k}^{C,r}$ using Gaussian/Xavier initializer, Momentum parameters $\beta_{1},\beta_{2}\in[0,1)$ , Server learning rate $\eta_{s}$ , Stability constant $\mathcal{s}$ .

1:for global round

r=1,...R

2: for each label owner

k\in K

in parallel do

\textbf{A}_{d,k}^{r}\leftarrow

DataOwnerUpdate(d,r)

,\forall d\in D

\textbf{A}_{k}^{r}\leftarrow

Concatenate

\textbf{A}_{d,k}^{r},\forall d\in D

5: Forward propagation with

\textbf{A}_{k}^{r}

\textbf{W}_{k}^{S,r}

and compute

\hskip 30.00005pt\textbf{Y'}_{d}

6: Calculate loss with

\textbf{Y'}_{d}

and

\textbf{Y}_{d}

7: Back-Propagation :

\textbf{W}_{k}^{S,r+1}\leftarrow\textbf{W}_{k}^{S,r}-\eta_{k}\nabla l(\textbf{W}_{k}^{S,r};\textbf{A}_{k}^{r})

8: Send corresponding gradients to data owners by invoking

DataOwnerBackProp

\textbf{A}_{d,k}^{r}

\forall d\in D

\textbf{t}_{k}^{S,r+1}=\textbf{W}_{k}^{S,r+1}-\textbf{W}_{k}^{S,r}

10:

\textbf{W}_{k}^{S,r}\leftarrow

ServerAggregation

(

\textbf{t}_{k}^{S,r+1},opt

)

11: for each data owner

d\in D

in parallel do

12:

\textbf{W}_{d}^{C,r+1}\leftarrow\frac{\sum_{k=1}^{K}\textbf{W}_{d,k}^{C,r}}{K}

13:

\textbf{W}_{d,k}^{C,r+1}\leftarrow\textbf{W}_{d}^{C,r+1},\forall k\in K

$DataOwnerUpdate(d,r):$

\textbf{A}_{d}^{r}=\phi

2:for local epoch

e=1,..E

3: for batch

b\in B

4: Perform Forward propagation on

\textbf{W}_{d}^{C,r}

5: Concatenate final layer activations to

\textbf{A}_{d}^{r}

6:Return

\textbf{A}_{d}^{r}

$DataOwnerBackProp$ (d $\textbf{A}_{d,k}^{r}$ )

1:for batch

b\in B

2: Perform back propagation with d

\textbf{A}_{d,k}^{r}

and update

\textbf{W}_{d}^{C,r}

$ServerAggregation(t_{k}^{S,r},opt)$

1:if

opt=demon

then

\beta_{1}\leftarrow\beta_{1}\frac{(1-\frac{r}{R})}{(1-\beta_{1})+\beta_{1}(1-\frac{r}{R})}

m\leftarrow\beta_{1}m+\textbf{t}_{k}^{S,r}

4:else

m\leftarrow\beta_{1}m+(1-\beta_{1})\textbf{t}_{k}^{S,r}

v\leftarrow\beta_{2}v+(1-\beta_{2})m^{2}

\textbf{W}^{A}_{r+1}=\textbf{W}^{A}_{r}+\frac{\sqrt{1-\beta_{2}^{r}}}{1-\beta_{1}^{r}}\eta_{s}\frac{m}{\sqrt{v}+\mathcal{s}}

8:Return

\textbf{W}^{A}_{r+1}

Refer to caption — Figure 1. Real-Life Use Case for Multi-VFL

4. Experiments

To verify the validity of our proposed model, we ran experiments on two major datasets: MNIST (LeCun et al., 1998) and FashionMNIST (Xiao et al., 2017). Both datasets have 60,000 training data points and 10,000 testing data points with each image as $28\times 28$ . For our experiments, we consider 5 label owners and 4 data owners. The datasets (excluding labels) are vertically partitioned among the data owners, with each of them receiving images of dimension $(\frac{28}{4}=7)\times 28$ . The dataset labels are with the label owners. Each data owner has one convolutional layer with input and output channels as 1 and 32 respectively, and kernel size as 3. The output from the data owners are aggregated to form a tensor with 32 channels which are then passed to the label owner’s model which has one convolutional and two linear layers. All the experiments are run for 500 epochs with a batch size of 64. The local learning rate was set to 0.001. We study four major federated learning aggregation algorithms: FedAvg (Base Case), FedAdam, FedYogi, and FedDemonAdam. The values of the hyperparameter for these algorithms are summarized in Table 2.

Optimizer	beta 1	beta 2	lr	tau
FedAdam, FedYogi FedDemonAdam	0.9	0.99	1e-3	1e-3

Table 2. Optimizer’s Hyperparameters

We also allow different label owners to have different labels, thus mimicking the non-iid nature of the real world. Each label owner is given 5000 data points to train upon. However, some label owners are given an iid dataset, meaning all labels from 0 to 9 are given, while others are given only two labels each. We primarily considered 4 different ways to distribute labels and they are formulated in Table 3.

Scenario	Label Owner 1	Label Owner 2	Label Owner 3	Label Owner 4	Label Owner 5
1niid	0-9	0-9	0-9	0-9	{0,1}
2niid	0-9	0-9	0-9	{0,1}	{2,3}
3niid	0-9	0-9	{0,1}	{2,3}	{4,5}
4niid	0-9	{0,1}	{2,3}	{4,5}	{6,7}

Table 3. non-iid setup

4.1. Variation with number of non-IID servers

We first study the variation of accuracy with increase in the number of non-iid label owners. Primarily, we study the 1niid, 2niid, 3niid and 4niid scenarios and use the FedAvg algorithm to solely demonstrate the effect of non-iid-ness without any optimizers. In FedAvg, the aggregation server averages the weights collected from different label owners and sends them back. The label owners then update their weights. The results for MNIST and FashionMNIST are plotted in Fig 2.

As seen from Fig 2, the accuracy drops with and increase in the number of non-iid label owners. In addition, for the 4niid case, FedAvg fails to converge even after 500 iterations.

4.2. Variation with optimizers

In this experiment, we study the variation of accuracy for the 4niid scenario with different choices of optimizers. The results for MNIST and FashionMNIST are plotted in Fig 3. As clear from the figure, adaptive optimizers perform better and converge faster than the FedAvg algorithm, with greater effect in the FashionMNIST dataset where they improve the accuracy by 2-3%. Also, we see that FedAvg fails to converge for the FashionMNIST dataset.

5. Conclusion and Future Work

To the best of our knowledge, Multi-VFL is the first solution to address vertical federated learning when $D$ data owners and $K$ label owners exist. In addition, we ran experiments on the MNIST and FashionMNIST datasets for different non-IID label distribution scenarios and demonstrated the importance of the adaptive optimizer based aggregation. We plan to integrate differential privacy (Dwork, 2008) into our framework to thwart potential model inversion attacks and plan to conduct a detailed analysis on the privacy-accuracy trade-off.

References

(1)
Angelou et al. (2020) Nick Angelou, Ayoub Benaissa, Bogdan Cebere, William Clark, Adam James Hall, Michael A Hoeh, Daniel Liu, Pavlos Papadopoulos, Robin Roehm, Robert Sandmann, et al. 2020. Asymmetric Private Set Intersection with Applications to Contact Tracing and Private Vertical Federated Machine Learning. arXiv preprint arXiv:2011.09350 (2020).
Ceballos et al. (2020) Iker Ceballos, Vivek Sharma, Eduardo Mugica, Abhishek Singh, Alberto Roman, Praneeth Vepakomma, and Ramesh Raskar. 2020. SplitNN-driven Vertical Partitioning. arXiv preprint arXiv:2008.04137 (2020).
Chen et al. (2019) John Chen, Cameron Wolfe, Zhao Li, and Anastasios Kyrillidis. 2019. Demon: Momentum Decay for Improved Neural Network Training. arXiv preprint arXiv:1910.04952 (2019).
Dwork (2008) Cynthia Dwork. 2008. Differential privacy: A survey of results. In International conference on theory and applications of models of computation. Springer, 1–19.
Hardy et al. (2017) Stephen Hardy, Wilko Henecka, Hamish Ivey-Law, Richard Nock, Giorgio Patrini, Guillaume Smith, and Brian Thorne. 2017. Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption. arXiv preprint arXiv:1711.10677 (2017).
Hsu et al. (2019) Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. 2019. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335 (2019).
Konečnỳ et al. (2016) Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. 2016. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492 (2016).
LeCun et al. (1998) Yann LeCun, Corinna Cortes, and Christopher JC Burges. 1998. The MNIST database of handwritten digits, 1998. URL http://yann. lecun. com/exdb/mnist 10 (1998), 34.
Liu et al. (2019) Yang Liu, Yan Kang, Liping Li, Xinwei Zhang, Yong Cheng, Tianjian Chen, Mingyi Hong, and Qiang Yang. 2019. A communication efficient vertical federated learning framework. Unknown Journal (2019).
McMahan et al. (2017) Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics. PMLR, 1273–1282.
Nock et al. (2018) Richard Nock, Stephen Hardy, Wilko Henecka, Hamish Ivey-Law, Giorgio Patrini, Guillaume Smith, and Brian Thorne. 2018. Entity resolution and federated learning get a federated resolution. arXiv preprint arXiv:1803.04035 (2018).
Pardau (2018) Stuart L Pardau. 2018. The California Consumer Privacy Act: Towards a European-Style Privacy Regime in the United States. J. Tech. L. & Pol’y 23 (2018), 68.
Reddi et al. (2020) Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečnỳ, Sanjiv Kumar, and H Brendan McMahan. 2020. Adaptive federated optimization. arXiv preprint arXiv:2003.00295 (2020).
Thapa et al. (2020) Chandra Thapa, Mahawaga Arachchige Pathum Chamikara, and Seyit Camtepe. 2020. Splitfed: When federated learning meets split learning. arXiv preprint arXiv:2004.12088 (2020).
Vepakomma et al. (2018) Praneeth Vepakomma, Otkrist Gupta, Tristan Swedish, and Ramesh Raskar. 2018. Split learning for health: Distributed deep learning without sharing raw patient data. arXiv preprint arXiv:1812.00564 (2018).
Voigt and Von dem Bussche (2017) Paul Voigt and Axel Von dem Bussche. 2017. The eu general data protection regulation (gdpr). A Practical Guide, 1st Ed., Cham: Springer International Publishing (2017).
Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017).