Fast User Adaptation for Human Motion Prediction in Physical Human–Robot Interaction

Hee-Seung Moon and Jiwon Seo This work was supported in part by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2018R1D1A1B07043580) and in part by the Institute of Information and Communications Technology Planning and Evaluation (IITP) grant funded by the Korea government (KNPA) (2019-0-01291). (Corresponding author: Jiwon Seo.)The authors are with the School of Integrated Technology, Yonsei University, Incheon 21983, Korea (e-mail: hs.moon@yonsei.ac.kr; jiwon.seo@yonsei.ac.kr).

Abstract

Accurate prediction of human movements is required to enhance the efficiency of physical human–robot interaction. Behavioral differences across various users are crucial factors that limit the prediction of human motion. Although recent neural network-based modeling methods have improved their prediction accuracy, most did not consider an effective adaptations to different users, thereby employing the same model parameters for all users. To deal with this insufficiently addressed challenge, we introduce a meta-learning framework to facilitate the rapid adaptation of the model to unseen users. In this study, we propose a model structure and a meta-learning algorithm specialized to enable fast user adaptation in predicting human movements in cooperative situations with robots. The proposed prediction model comprises shared and adaptive parameters, each addressing the user’s general and individual movements. Using only a small amount of data from an individual user, the adaptive parameters are adjusted to enable user-specific prediction through a two-step process: initialization via a separate network and adaptation via a few gradient steps. Regarding the motion dataset that has 20 users collaborating with a robotic device, the proposed method outperforms existing meta-learning and non-meta-learning baselines in predicting the movements of unseen users.

Index Terms:

Physical human-robot interaction, deep learning methods, human motion prediction, fast user adaptation, meta-learning.

I Introduction

With recent advancements in robotics, collaborative robots are now considered to move in physical contact with humans [1, 2, 3, 4]. One representative example of the physical human–robot interaction (pHRI) is a situation in which a person performs a task while receiving physical assistance from a robot [5]. Under robotic guidance, humans can utilize the repeatability and accuracy of the robot, which leads to improved productivity and reduced workload [6]. The movement of robots for human assistance can be planned based on human behavior prediction [7, 8]. However, if the robot mispredicts the next human motion and a conflict between the human’s intention and the robotic guidance occurs, it can lead to a decrease in the collaborative task performance and an increase in the discomfort of the human operator [9]. Therefore, expanding the robot’s capability to predict human motion is garnering considerable interest in the HRI field [10, 11].

The different behavioral patterns of individual human operators (i.e., users) are contributing factors that limit the prediction of human motion. People have different behaviors owing to a variety of factors, such as their motor skills or personal preferences [12]. Recent deep learning-based approaches have succeeded in enhancing the accuracy of human motion prediction; however, only a few studies have addressed how to respond to different users. Most of the previous methods employed a neural network model with fixed parameters to predict the movements of various users. A straightforward alternative that can cope with various users is to train a new model from scratch or to fine-tune a pretrained model for each new user. However, acquiring sufficient data for every new user is time-consuming and impractical in real-world applications. Therefore, training a single prediction model with user-adaptive characteristics is crucial for further advancement in human motion prediction.

Refer to caption — Figure 1: Overview of the proposed fast user adaptation approach. (a) Model comprising shared and adaptive parameters predicts the next human action based on current knowledge. (b) The shared parameters, which are trained via meta-update, handle general movements across the users, and the adaptive parameters, which are updated according to each user, address the individual differences.

To address this challenge, a meta-learning approach can be considered as a possible solution. Meta-learning, learning to learn, is a promising machine learning technique for solving the fast adaptation problem, i.e., enabling a model to rapidly adapt to previously unseen tasks with small amounts of data. In the wake of model-agnostic meta-learning (MAML) [13], optimization-based meta-learning algorithms have demonstrated significant success in fast adaptation problems, such as few-shot image classification tasks.

The meta-learning approach is effective for solving fast adaptation problems. However, it is unclear whether the meta-learning approach is effective in the fast user adaptation problem, that is, enabling a single model to swiftly adapt to previously unseen users with a small amount of their behavioral data. Adapting to different user behaviors in a cooperative situation with a robot has different characteristics from the problems previously addressed by the meta-learning approach, because of its unique situation. For example, the movements of users to perform a specific task can be divided into general movements (i.e., with low variance between users) to achieve a goal and individual movements (i.e., with high variance between users) affected by individual factors. Therefore, it is necessary to distinguish between them to adapt effectively to different user behaviors. This distinction has not been the focus of previous applications using the meta-learning approach.

We have recently demonstrated in [8] that the parameter adaptation of the user prediction model using MAML is effective in providing personalized haptic guidance to each user. This was the first attempt to apply meta-learning to a fast user adaptation problem. However, that study was limited to the direct application of MAML, which is not specifically designed to solve fast user adaptation problems. Therefore, the applicability of various meta-learning algorithms to fast user adaptation still remains unclear.

In this study, we demonstrate the feasibility of various existing meta-learning algorithms for solving fast user adaptation problems. Moreover, we propose a model structure and a meta-learning algorithm that is specialized for fast user adaptation. We focus on dividing human motions into common user movements and additional movements triggered by individual differences. Therefore, the proposed prediction model consists of shared and adaptive parameters, each of which is responsible for general and individual movements, respectively (Fig. 1(a)). In particular, the proposed method has two unique approaches: determining the user-specific initialization of the adaptive parameters via a separate network and enforcing the adaptive parameters to handle individual differences via a meta-loss function.

We evaluated the human motion prediction performance of the proposed method and compared it with several major meta-learning methods using a dataset acquired in a situation wherein a user performed a task while being assisted by a robotic device. The meta-learning methods exhibited better prediction performance than the non-meta-learning methods. This implies that meta-learning methods can be applied to solve fast user adaptation problems. In particular, our proposed meta-learning method with the initialization network and meta-loss function exhibited the best prediction accuracy among the meta-learning algorithms. In addition, we analyzed how the proposed model distinguished different users by visualizing the adaptive parameters that were adjusted to different users.

The contributions of this study are presented as follows: (1) We validate the applicability of existing meta-learning algorithms for fast user adaptation. To the best of our knowledge, this is the first attempt to compare the performance of meta-learning methods in solving fast user adaptation problems. (2) We propose a novel model structure and meta-learning algorithm specialized for enabling fast user adaptation in predicting human movements during pHRI. (3) We experimentally demonstrate that the proposed meta-learning method with our initialization network and meta-loss function exhibits the best accuracy compared to other meta-learning algorithms for predicting user movements during pHRI.

II Related Work

II-A Human Motion Prediction

Early studies on human motion prediction were developed based on probabilistic models such as the hidden Markov model [14, 15], Gaussian mixture regression [10], and probabilistic movement primitives [16]. The recent use of deep neural network architectures has resulted in a remarkable improvement in prediction performance. Fragkiadaki et al. [17] first proposed an encoder-recurrent-decoder model structure, which successfully predicted human body movements using a dataset spanning multiple subjects and activity domains. Subsequent studies have focused on utilizing context information as clues for prediction, such as the motion data of nearby people [18], or multimodal responses of the user [19]. In addition, there have been attempts to consider other learning techniques that increase prediction accuracy and robustness. For example, variational autoencoder [20] or adversarial learning [21, 22] frameworks have been adopted in motion prediction.

We focus on a learning framework that enables the proposed model to effectively respond to differences between individuals for human motion prediction, which has not been addressed in the aforementioned studies that employed fixed model parameters for all users. To solve this problem, we present a meta-learning approach that can quickly adapt the prediction model to novel users.

There have been attempts to apply a meta-learning technique to adapt the prediction model to novel tasks. Proactive and adaptive meta-learning (PAML) [23] integrates MAML and model regression networks [24] to learn an effective adaptation strategy. MoPredNet [25] is a method with a parameter generation module that utilizes external memory. These two previous studies [23, 25] focused on predicting human motion across specific categories (e.g., walking, eating, or smoking) without any interaction with a robot.

Our work is distinguished from the prior works in that we consider a situation wherein a human and a robot physically interact. In the pHRI situation of our work (i.e., virtual air hockey environment, described in Section IV-A), the user behavior is strongly affected by the robotic guidance at every timestep, making it difficult to predict the user behavior over a long time horizon because the interacting robot’s guidance over the horizon is already unpredictable. The robotic guidance depends on the opponent’s play, which is obviously unpredictable. In [8], it has been shown that even one-step human motion prediction can improve the user’s task performance under a certain pHRI situation. Therefore, we focus on predicting the user’s movement at the immediate next timestep given the dynamically changing and unpredictable robotic guidance of the current timestep. It is worth mentioning that we have tested and compared the prediction performance when using the data of the past several timesteps as the input and when using the data of the current timestep only, but there was no significant performance difference. Therefore, we decided to use the knowledge at the current timestep for the prediction.

II-B Optimization-based Meta-learning

Optimization-based meta-learning is a technique that allows a model to learn a new task quickly via an optimization procedure based on small data samples. A powerful and representative example is MAML [13], a meta-learning algorithm with a dual-structured training procedure consisting of inner and outer loop updates. The training procedure for MAML aims to attain model parameters that can reach task-specific parameters of a new task within a few gradient steps. Since MAML outperformed previous methods, such as the meta-learner with recurrent layers [26], various other optimization-based algorithms have followed, for example, Reptile [27], which simplifies the second-order gradient computation of the MAML; LEO [28], which performs adaptation in the low-dimensional embedding of model parameters; and multimodal MAML [29], which pursues a more diverse task distribution through parameter modulation.

The aforementioned meta-learning algorithms updated all of the parameters of every layer during the adaptation process. However, Raghu et al. [30] discovered that the entire parameter adaptation changed the parameters of the last layer. Therefore, the authors proposed the ANIL algorithm that adapts only the parameters of the last layer while fixing the parameters of the body layers. Even with this simplification, ANIL exhibited a performance comparable to that of the other meta-learning algorithms utilizing the entire parameter adaptation. CAVIA [31] separates adaptive parameters (which are updated during the adaptation and condition the body layers) and shared parameters (which are fixed). By separating these parameters, CAVIA outperformed MAML despite adapting fewer parameters.

Fast user adaption can be a new application of existing meta-learning algorithms. However, the applicability of the existing meta-learning algorithms and a comparison of their performance have not yet been studied. We propose a novel meta-learning framework specialized in user movement prediction after investigating the user prediction performance of existing meta-learning algorithms. Inspired by the CAVIA model, our prediction model consists of shared and adaptive parameters. Shared and adaptive parameters predict general and individual user movements, respectively. Our model responds to movement differences between individual users by adjusting adaptive parameters. In the CAVIA algorithm, the adaptive parameters are updated for each new task from a zero vector. This initialization method can impede the ability to adapt to various tasks within a few gradient steps. Hence, we propose a model structure that determines the effective user-specific initialization of adaptive parameters. In addition, we suggest a meta-loss function for the meta-update that induces the adaptive parameters to handle individual differences exclusively.

III Proposed Method

III-A Problem Definition

The goal of our meta-learning approach is to train a human motion prediction model that can quickly adapt to a new user with small data samples. In a situation wherein a user performs a task while being guided by a robot, the prediction problem can be formulated as follows: to predict $\mathbf{y}$ , which indicates the human action at the next timestep, when given an input $\mathbf{x}=\{\mathbf{x}_{s},\mathbf{x}_{r},\mathbf{x}_{h}\}$ , which indicates knowledge at the current timestep consisting of $\mathbf{x}_{s}$ , interaction state (e.g., state of the cooperative task), $\mathbf{x}_{r}$ , robot action, and $\mathbf{x}_{h}$ , human action. We let $\mathcal{U}_{i}$ denote the dataset collected from one user of index $i$ . $\mathcal{U}_{i}$ consists of the interaction data of the $N_{i}$ timestep length (i.e., $(\mathbf{x},\mathbf{y})_{i}$ pairs of size $N_{i}$ ).

For the fast user adaptation problem, two batches $\mathcal{D}^{spt}_{i}$ and $\mathcal{D}^{qry}_{i}$ are given, each consisting of $(\mathbf{x},\mathbf{y})_{i}$ pairs of size $B$ ( $B\ll N_{i}$ ) and sampled from the same dataset $\mathcal{U}_{i}$ without overlapping. $\mathcal{D}^{spt}_{i}$ is employed to adapt the parameters of the prediction model $f$ to user $i$ , and the accuracy of the $\mathbf{x}\rightarrow\hat{\mathbf{y}}$ mapping of the adapted model $f^{\prime}$ for $\mathcal{D}^{qry}_{i}$ is investigated. For the model training, a meta-dataset $\mathcal{M}^{tr}$ is employed, which is composed of datasets spanning multiple users (e.g., $\{\mathcal{U}_{1},\mathcal{U}_{2},...,\mathcal{U}_{m}\}$ ). Another meta-dataset $\mathcal{M}^{test}$ , collected from users who do not overlap with the users of $\mathcal{M}^{tr}$ , is adopted to evaluate the performance of the fast user adaptation on previously unseen users.

III-B Model Overview

Our model parameters consist of $\theta$ , which is shared across all users, and $\phi$ , which is adapted according to each user. Because $\phi$ exhibits user-specific values after a series of initialization and adaptation processes, we refer to $\phi$ as the user embedding (UE). To implement a prediction model conditioned by $\phi$ , we concatenated $\phi$ to a hidden state vector of the model’s middle layer and adopted the concatenated vector as the input of the next layer, which was originally proposed in [31]. In addition, we propose a model structure that determines the effective initial values of user embedding $\phi_{i}$ for a user $i$ based on small data samples $\mathcal{D}^{spt}_{i}$ . Accordingly, as illustrated in Fig. 2, the entire structure of our model is composed of two parts: a prediction network that outputs user-specific predicted movements of the user, and an initialization network that outputs the effective initial point of user embedding before adaptation. The shared parameters $\theta$ contains all trainable parameters of the prediction network and the initialization network, except for the user embedding $\phi$ (see Fig. 3(c)).

In the prediction network, three types of inputs (i.e., $\mathbf{x}_{s},\mathbf{x}_{r},$ and $\mathbf{x}_{h}$ ) were fed into separate multilayer perceptron (MLP) blocks (with parameters $\theta_{s},\theta_{r}$ , and $\theta_{h}$ , which are subsets of $\theta$ ) that extract each feature. Subsequently, feature vectors passed through the integrating layers to produce $\hat{\mathbf{y}}$ , which is the predicted user movement. The user embedding $\phi$ was concatenated with the input vector of the second integrating layer to allow user-specific prediction.

The initialization network received the entire $\mathcal{D}^{spt}_{i}$ consisting of $B$ pairs of $(\mathbf{x}^{spt},\mathbf{y}^{spt})_{i}$ as the input. To extract the features of the $\mathbf{x}^{spt}$ , MLPs that shared the parameters $\theta_{s},\theta_{r},$ and $\theta_{h}$ from the prediction network were employed, and an MLP with parameter $\theta_{h}$ was again adopted to extract the features of $\mathbf{y}^{spt}$ . Note that $\mathbf{x}_{h}$ and $\mathbf{y}$ are vectors of the same format that represent human actions. To obtain a representative $\phi_{i}$ corresponding to $B$ pairs of $(\mathbf{x}^{spt},\mathbf{y}^{spt})_{i}$ , we first obtained the user embedding candidates and weight values, each corresponding to one $(\mathbf{x}^{spt},\mathbf{y}^{spt})_{i}$ pair, by feeding the feature vectors into a UE encoder and weight encoder, respectively. The weight value indicates the extent to which the corresponding pair expresses the user characteristics. The relative importance of the corresponding pair among the batches was determined by passing the weight value through the softmax function. Therefore, the representative $\phi_{i}$ ( $1\times S$ , where $S$ is the size of $\phi$ ) was acquired by matrix multiplication of the weight values ( $1\times B$ ) and $B$ user embedding candidates ( $B\times S$ ). We employed three-layered MLPs for each UE and weight encoder. For the UE encoder, because it is unnecessary to embed different users to have the same bias, we deleted the bias term of the last layer.

III-C Meta-learning Procedure

Require: Meta-dataset

\mathcal{M}^{tr}

Require: Hyper-parameters

\alpha,\beta,d

Randomly initialize

\theta

while max iteration do

Sample users

\mathbf{U}=\{\mathcal{U}_{i}\}

where

\mathcal{U}_{i}\sim\mathcal{M}^{tr}

for all $\mathcal{U}_{i}\in\mathbf{U}$ do

Sample

\mathcal{D}^{spt}_{i},\mathcal{D}^{qry}_{i}

from

\mathcal{U}_{i}

Initialize

\phi_{i}

using

\mathcal{D}^{spt}_{i}

(UE initialization)

Initialize

\phi^{\prime}_{i}=\phi_{i},\alpha^{\prime}=\alpha

for number of adaptation steps do

Compute regression loss

\mathcal{L}(\theta,\phi^{\prime}_{i};\mathcal{D}^{spt}_{i})

Update user embedding (UE adaptation):

\phi^{\prime}_{i}\leftarrow\phi^{\prime}_{i}-\alpha^{\prime}\nabla_{\phi^{\prime}_{i}}\mathcal{L}(\theta,\phi^{\prime}_{i};\mathcal{D}^{spt}_{i})

\alpha^{\prime}\leftarrow d\cdot\alpha^{\prime}

end for

Compute regression loss

\mathcal{L}(\theta,\phi^{\prime}_{i};\mathcal{D}^{qry}_{i})

end for

Compute meta-loss

\mathcal{L}^{meta}(\theta)

with Eq.(2)

Update shared parameters (Meta-update):

\theta\leftarrow\theta-\beta\nabla_{\theta}\mathcal{L}^{meta}(\theta)

end while

Algorithm 1 Meta-learning for Fast User Adaptation

The meta-training process of the proposed model is realized in three steps: (1) UE initialization, (2) UE adaptation, and (3) meta-update, as summarized in Algorithm 1 and Fig. 3. User-specific parameters (i.e., $\phi$ ) are determined via the UE initialization and UE adaptation steps, and the parameters shared across various users (i.e., $\theta$ ) are learned via the meta-update step.

During the first step, UE initialization, $\phi_{i}$ is obtained by passing $\mathcal{D}^{spt}_{i}$ through the initialization network described in Section III-B. Subsequently, in the UE adaptation step, $\phi_{i}$ is updated to $\phi^{\prime}_{i}$ by backpropagating $\mathcal{L}(\theta,\phi_{i};\mathcal{D}^{spt}_{i})$ , which is the regression loss of the prediction network conditioned by $\phi_{i}$ when using $\mathcal{D}^{spt}_{i}$ . Either one or a few gradient steps relative to $\phi_{i}$ can be taken. For example, using one gradient step, $\phi^{\prime}_{i}$ can be computed as follows:

\phi^{\prime}_{i}=\phi_{i}-\alpha\nabla_{\phi_{i}}\mathcal{L}(\theta,\phi_{i};\mathcal{D}^{spt}_{i}),

(1)

where $\alpha$ denotes the inner learning rate. In the case of adopting multiple gradient steps, decaying the inner learning rate by the decay rate $d$ for every update can benefit a delicate adaptation.

In the meta-update step, we update $\theta$ using $\mathcal{D}^{qry}_{i}$ , while fixing the $\phi^{\prime}_{i}$ determined by $\mathcal{D}^{spt}_{i}$ . The meta-update is performed by overseeing the user-specific prediction results and values of the adapted $\phi^{\prime}_{i}$ from multiple users. The meta-loss we propose consists of three terms expressed as:

\begin{split}\mathcal{L}^{meta}(\theta)=&\frac{1}{K}\sum_{i}(\mathcal{L}(\theta,\phi^{\prime}_{i};\mathcal{D}^{qry}_{i})\\ &+\lambda_{1}\lVert stopgrad(\phi^{\prime}_{i})-\phi_{i}\rVert^{2}_{2})\\ &+\lambda_{2}\bigg{\lVert}\frac{\sum_{i}\phi^{\prime}_{i}}{K}\bigg{\rVert}^{2}_{2},\end{split}

(2)

where $K$ represents the number of users sampled for the one meta-update, whereas $\lambda_{1}$ and $\lambda_{2}$ are the weights of each loss term. The first term aims to update $\theta$ to reduce the regression loss of $\mathcal{D}^{qry}_{i}$ in the prediction network with adapted user-specific parameters $\phi^{\prime}_{i}$ , which allows $\theta$ to infer general human movements across the users. The second term, which is inspired by [28], enables the initialization network to output an effective $\phi_{i}$ close to the adapted $\phi^{\prime}_{i}$ . The $stopgrad(\phi^{\prime}_{i})$ denotes that we consider it as a constant, therefore the derivative of $stopgrad(\phi^{\prime}_{i})$ with respect to $\theta$ is zero. The last term encourages the average of multiple $\phi^{\prime}_{i}$ , each adapted to a different user, to move toward a zero vector. The common nonzero bias from multiple $\phi^{\prime}_{i}$ indicates the general movement tendency of the users. Therefore, by forcibly reducing the bias, we induce a general tendency to learn by $\theta$ . In addition, the $\phi^{\prime}_{i}$ of different users are induced to be disentangled around zero and eventually learned to address individual differences. Taken together, $\theta$ is updated with a gradient step of the meta-loss; therefore,

\theta\leftarrow\theta-\beta\nabla_{\theta}\mathcal{L}^{meta}(\theta),

(3)

where $\beta$ indicates the outer learning rate. The prediction and initialization networks were gradually trained by repeatedly performing the three steps using the sampled batch for each iteration. During the evaluation phase, the meta-update step is omitted, and only the UE initialization and UE adaptation processes are performed to achieve user-adapted motion prediction.

IV Experiment and Results

IV-A Data Acquisition

To evaluate our meta-learning approach for the fast user adaptation problem, it is necessary to collect a multi-person motion dataset acquired in a situation wherein a human physically interacts with a robot. A representative pHRI situation occurs when a user performs a target task with haptic guidance from a robot. The haptic guidance system, which has been recognized as a promising human–machine interface [5], can be defined as a system in which the control input of the target task is determined by the interaction between the force exerted by the user and the guiding force of the robot [32]. Because users are free to decide how much guiding force they will accept every moment, different users exhibit different responses to the guiding force. Therefore, it is possible to obtain a wide variety of human motion data, which is suitable for evaluating fast user adaptation performance.

We utilized a dataset consisting of motion data from 20 participants performing a target task with haptic guidance, which was collected in our previous work [8]. In the experimental environment, as illustrated in Fig. 4(a), the participants were instructed to play a virtual air hockey game controlled with a haptic device. In the hockey environment, a user receives points by smashing the puck with their paddle and putting the puck into the opponent’s goalpost. Conversely, the user loses points if they fail to defend a puck heading to their goal. Through the haptic device, the participants consistently received a guiding force to assist them; however, they were allowed to choose whether to follow the guidance. The robotic guiding force dynamically changed according to the opponent’s play.

The entire dataset is composed of the following three data types suitable for our prediction model structure, as described in Section III-B. First, the human action data (corresponding to $\textbf{x}_{h}$ and y) consist of 2-D position vectors of the end-effector of the haptic device determined by the user, which was transmitted to the target task as the control input. Second, the robot action data (corresponding to $\textbf{x}_{r}$ ) consist of 2-D force vectors that the robot exerts on the user. Finally, the state data of the target task (corresponding to $\textbf{x}_{s}$ ) consist of the 2-D position and velocity vectors of the paddles and the puck in the virtual air hockey environment. Fig. 4(b) presents an example of the interaction process between a user and a robot that occurred on a haptic device. In the timestep $T$ , the user action corresponds to the position vector $\mathbf{x}_{h}$ of the end-effector of the haptic device (marked as an orange dot). Simultaneously, the user receives a guiding force $\mathbf{x}_{r}$ (marked as a yellow arrow). The example illustrates the next user action (i.e., the position vector $\mathbf{y}$ at timestep $T+1$ ) when the user follows the guiding force. If the guiding force does not match the user’s intention, the user is allowed to control the end-effector in their desired direction by applying a force exceeding the guiding force.

A total of 1.52 M timesteps of data (i.e., 1.52 M pairs of $(\mathbf{x},\mathbf{y})$ ) were collected from 20 participants aged $25.58\pm 2.09$ (mean $\pm$ standard deviation across participants) years. Each participant performed $84.25\pm 3.52$ trials, and the length of data collected per participant (i.e., $N_{i}$ ) was $76.2\pm 12.7$ K timesteps. The total duration of the data collection per participant was approximately 1–1.5 h, and the details of the procedure are described in [8]. All collected data were normalized. The mean values of each participant’s action (i.e., $\mathbf{y}$ ) were $-0.056\pm 0.043$ (horizontal direction) and $-0.152\pm 0.053$ (vertical direction).

According to Article 15 (2) of the Bioethics and Safety Act and Article 13 of the Enforcement Rule of Bioethics and Safety Act in Korea, a research project “which utilizes a measurement equipment with simple physical contact that does not cause any physical change in the subject” (Korean to English translation by the authors) is exempted from the approval. The entire experimental procedure was designed to use only a haptic device and a monitor that did not cause any physical changes in the subject.

IV-B Experimental Setting

Evaluation: To evaluate the fast user adaptation performance, we measured the prediction performance for $\mathcal{M}^{test}$ , which consisted of user datasets that were not used for training, that is, users in $\mathcal{M}^{test}$ were not included in $\mathcal{M}^{tr}$ . We assumed a realistic sampling situation during the evaluation procedure. If $\mathcal{D}^{spt}_{i}$ (i.e., batch for adaptation) and $\mathcal{D}^{qry}_{i}$ (i.e., batch for prediction) were randomly sampled from the entire user dataset $\mathcal{U}_{i}$ as in the training procedure (Algorithm 1), there may be unrealistic cases of predicting the current human motion while utilizing the later motion data within the same episode for adaptation. To prevent this, for the evaluation, we utilized data from different episodes to adapt the model and to validate the prediction, that is, $\mathcal{D}^{spt}_{i}$ and $\mathcal{D}^{qry}_{i}$ consisted of data from separate episodes.

As a metric for prediction performance, we adopted the mean squared error (MSE) between the predicted value and the ground-truth value of the next user action. Five-fold cross-validation was conducted to reduce the effect of the discrepancy between the training and test datasets. Therefore, the 20-user dataset was divided into five sub-datasets consisting of four users each, and a total of five training-validation processes were performed using five $(\mathcal{M}^{tr},\mathcal{M}^{test})$ pairs. The averaged values of the resulting five prediction errors (MSE) were used to compare the learning methods.

Baselines: We set baseline methods, including non-meta-learning and meta-learning approaches. Ahead of both approaches, the zero-velocity baseline [33] was adopted, which assumed that the user maintained the previous action. This helps to understand the prediction performance of the other learning methods at an appropriate scale.

As a non-meta-learning approach, we trained the same structured prediction model using a conventional supervised learning method. For a fair comparison with the adaptation process of the meta-learning approach, we aimed to determine the extent to which performance is improved when the model trained with a non-meta-learning method goes through the parameter update process with $\mathcal{D}^{spt}$ (i.e., fine-tuned). Therefore, we implemented the non-meta-learning baselines in two ways: when the model parameters were fixed and when they were fine-tuned with a few gradient steps, similar to the adaptation steps of the meta-learning approach.

For the meta-learning approaches, we evaluated the most representative methods, MAML and Reptile. In addition, we tested the performance of integrating the model regression network (MRN) [24] into the adaptation process of MAML, which is equivalent to PAML [23]. Our approach to solve the fast user adaptation problem was to separate the adaptive parameters and enforce them to represent only user-specific movements. Therefore, we also considered two other meta-learning baselines, ANIL and CAVIA, that distinguish between adaptive and shared parameters. There is a difference in how each method divides the parameters. Whereas ANIL has the body layers be fixed and the last layer adaptive, CAVIA adopts the adaptive parameters that join as an auxiliary input for the body layers.

Implementation details: All learning-based baseline methods were implemented using our prediction network structure (Fig. 2), except for the user embedding $\phi$ . Only the CAVIA method adopts additional adaptive parameters to the body layers; therefore, it can be implemented using the same structure as our prediction network. We set the hyper-parameters of our method and baseline methods to be as similar as possible. For the meta-learning approaches, including our method, the adaptation process was conducted through five gradient steps based on a stochastic gradient descent optimizer. CAVIA and the proposed method that updates $\phi$ (we set the size of $\phi$ to 32) adopted an inner learning rate of 0.05, and the other methods, which directly update the parameters of the network layers, employed an inner learning rate of 0.01, for both training and evaluation phases. Exceptionally, in the training phase of the MAML and ANIL algorithms, a small inner learning rate of 0.003 was adopted, because it exhibited more stable learning. Regarding the meta-update, an Adam optimizer with an outer learning rate of 0.001 was adopted for all methods, and each model was trained for 500 K steps. The batch sizes of $\mathcal{D}^{spt}$ and $\mathcal{D}^{qry}$ were set to 1 K samples. For the methods adopting multi-person data for one meta-update, such as Reptile or our method, the total number of data samples used for one meta-update was maintained at 1 K samples by adopting 200 samples each from five different users. Regarding the non-meta-learning approaches, an Adam optimizer with a learning rate of 0.001 was employed to train the model for 500 K steps, using batches consisting of 1 K samples. In the fine-tuned baseline case, five gradient steps with a learning rate of 0.01 were taken in the same manner as the adaptation process in the meta-learning approaches.

Methods		MSE
Zero-velocity [33]		$3.127\mathrm{e}{-3}$
Non-meta-learning	Fixed	$1.295\mathrm{e}{-5}$
Non-meta-learning	Fine-tuned	$1.268\mathrm{e}{-5}$
Meta-learning	MAML [13]	$1.194\mathrm{e}{-5}$
	Reptile [27]	$1.175\mathrm{e}{-5}$
	MAML + MRN [24]	$1.188\mathrm{e}{-5}$
	ANIL [30]	$1.221\mathrm{e}{-5}$
	CAVIA [31]	$1.115\mathrm{e}{-5}$
Ablation	w/o UE initialization	$1.064\mathrm{e}{-5}$
Ablation	w/o UE bias reduction	$1.090\mathrm{e}{-5}$
Our method		$\mathbf{1.045\mathrm{e}{-5}}$