This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

To Ask or Not To Ask: Human-in-the-loop Contextual Bandits
with Applications in Robot-Assisted Feeding

Rohan Banerjee1, Rajat Kumar Jenamani∗1, Sidharth Vasudev∗1,
Amal Nanavati2, Katherine Dimitropoulou3, Sarah Dean1, Tapomayukh Bhattacharjee1
*Equal Contribution, Equal Advising. This work was partly funded by NSF CCF 2312774 and NSF OAC-2311521, a LinkedIn Research Award, a gift from Wayfair, NSF IIS #2132846, CAREER #2238792, and NIH #T32HD113301. Full acknowledgements in Appendix.1Department of Computer Science, Cornell University {rbb242, rj277, sv355, sdean, tapomayukh}@cornell.edu2Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195 {amaln}@cs.washington.edu3Columbia University, New York City, NY, USA, kd2524@cumc.columbia.edu
Abstract

Robot-assisted bite acquisition involves picking up food items with varying shapes, compliance, sizes, and textures. Fully autonomous strategies may not generalize efficiently across this diversity. We propose leveraging feedback from the care recipient when encountering novel food items. However, frequent queries impose a workload on the user. We formulate human-in-the-loop bite acquisition within a contextual bandit framework and introduce LinUCB-QG, a method that selectively asks for help using a predictive model of querying workload based on query types and timings. This model is trained on data collected in an online study involving 14 participants with mobility limitations, 3 occupational therapists simulating physical limitations, and 89 participants without limitations. We demonstrate that our method better balances task performance and querying workload compared to autonomous and always-querying baselines and adjusts its querying behavior to account for higher workload in users with mobility limitations. We validate this through experiments in a simulated food dataset and a user study with 19 participants, including one with severe mobility limitations. Please check out our project website at: emprise.cs.cornell.edu/hilbiteacquisition/.

I INTRODUCTION

Feeding, an essential Activity of Daily Living (ADL) [1], is challenging for individuals with mobility limitations, often requiring caregiver support that may not be readily available. Approximately 1 million people in the U.S. cannot eat without assistance [2], so robotic systems could empower these individuals to feed themselves, promoting independence. Robot-assisted feeding comprises two key tasks [3]: bite acquisition [4, 5, 6, 7], the task of picking up food items, and bite transfer [\citeconsecutivegallenberger2019transfer, belkhale2022balancing], delivering them to the user’s mouth. Our work focuses on bite acquisition, aiming to learn policies that robustly acquire novel food items with diverse properties.

With over 40,000 unique food items in a typical grocery store [10], bite acquisition strategies must generalize well to novel food items. Current approaches formulate bite acquisition as a contextual bandit problem [\citeconsecutivegordon2020adaptive,gordon2021leveraging], using online learning techniques like LinUCB [11] to adapt to unseen food items. However, these methods may need many samples to learn the optimal action, which is problematic due to food fragility and potential user dissatisfaction from failures [12].

Our key insight is that we can leverage the care recipient’s presence to develop querying strategies that can adapt to novel foods, giving users a sense of agency. Under the assumption that the human can specify the robot’s acquisition action, we extend the contextual bandit formulation to include human-in-the-loop querying [13]. However, excessive querying may overwhelm users and reduce acceptance [14]. Thus, our key research question is: How do we balance imposing minimal querying workload on the user while achieving maximal bite acquisition success in a human-in-the-loop contextual bandit framework?

Refer to caption
Figure 1: We present a human-in-the-loop contextual bandit-based framework for robot-assisted bite acquisition and propose a novel method which decides whether to ask for help or autonomously act, by considering both certainty about action performance, and the estimated workload imposed on the human by querying.
Refer to caption
Figure 2: Human-in-the-loop contextual bandit pipeline for deciding whether to query (aqa_{q}) or autonomously select a robot action ara_{r}. Our proposed method, LinUCB-QG, takes into account action uncertainty as measured by the performance gap GG between the best and the second best robot action and incorporates a learned querying workload model (ff) to predict the workload of querying the human WLtWL_{t}.

Balancing querying workload requires accurately estimating the workload imposed on users. Existing methods rely on physiological sensors [15, 16, 17], which can be invasive. We develop a non-intrusive, data-driven method by conducting a study on 14 participants with mobility limitations, 3 occupational therapists (OTs) who simulate the physical limitations of those with mobility limitations, and 89 participants without mobility limitations. We gather data on how different query types and timings impact self-reported workload using a modified NASA-TLX scale [18]. We tailor queries to common autonomous system failures, ranging from multiple-choice questions to open-ended prompts for acquisition strategy suggestions.

Using this data, we train a model to predict user workload in response to queries, considering the nature of the query and the user’s prior query interactions with the system. We observe statistically significant differences in self-reported workload between users with and without mobility limitations, which are accounted for by our learned models.

Building on our workload model, we propose LinUCB-QueryGap (LinUCB-QG), a novel algorithm for human-in-the-loop contextual bandits. It decides when to query based on the estimated querying workload and performance uncertainty of candidate actions. Simulated experiments on a dataset of 16 food items [4] show that LinUCB-QG is able to better balance workload and bite acquisition success compared to three baselines: (i) LinUCB [5], a state-of-the-art fully autonomous approach; (ii) AlwaysQuery, which always queries; and (iii) LinUCB-ExpDecay, a naive querying algorithm. LinUCB-QG achieves higher task performance than LinUCB and LinUCB-ExpDecay, with lower workload than AlwaysQuery. It also queries less when using a workload model trained on users with mobility limitations, indicating sensitivity to the workload differences across diverse user populations.

We validate our method in a real-world user study with 19 individuals, including one with Multiple Sclerosis, using three foods with variable material properties: banana slices, baby carrots, and cantaloupe. LinUCB-QG achieves a statistically significant 26% higher task success compared to LinUCB, and a 47% lower change in querying workload compared to AlwaysQuery.

Our contributions are as follows:

  • A human-in-the-loop contextual bandit framework incorporating querying workload for bite acquisition.

  • A dataset (including users with mobility limitations) on how feeding-related queries affect workload, and a predictive workload model without exteroceptive sensor inputs.

  • A novel method, LinUCB-QG, that balances acquisition success and querying frequency using our workload model, shown through simulations and a real-world study with 19 users, including one with severe mobility limitations.

II RELATED WORK

Human-in-the-loop algorithms. Our problem is an instance of learning to defer to an expert [19], where an agent decides whether to act autonomously or defer to experts like humans or oracle models. Most approaches operate in supervised learning [19, 20, 21, 22] or reinforcement learning domains [23]. Given the constraints of robot-assisted bite acquisition (observing independent contexts and receiving sparse feedback) we focus on learning-to-defer policies in an online contextual bandit setting. While prior methods often assume a fixed cost for querying the expert [20, 21, 22] or deal with imperfect experts [21], we propose estimating a time-varying deferral penalty using a data-driven model.

Other research explores active learning in robotics [24, 25, 26], where agents decide which instances to query [\citeconsecutiveracca2019teacher, li2023embodied] or which type of feedback to solicit [24]. Some algorithms balance task performance and human querying cost, but our work differs by using a data-driven model of querying cost and applying it to the complex task of bite acquisition.

Querying workload modeling. Our algorithms need to monitor workload to avoid over-querying users. For individuals with mobility limitations, querying workload may involve physical [27] and cognitive components [18], depending on the query type. Most workload estimation literature relies on neurological/physiological signals (ECG/heart rate, respiration [27], physical posture [28] for physical workload; EEG [15], EDA [29], pupil metrics [\citeconsecutiveahmad2019trust, fridman2018cognitive] for cognitive workload), training supervised models based on these signals. However, measuring such signals requires specialized, often invasive equipment. Alternatively, workload can be estimated post-hoc using subjective metrics like NASA-TLX [15].

In contrast, we develop a data-driven predictive model of querying workload using self-reported, modified NASA-TLX survey results, without relying on specialized sensors. Our model estimates the expected workload of a query based on interaction history, timing, and query type. Additionally, unlike prior work [\citeconsecutiveshayesteh2021investigating, fridman2018cognitive, rajavenkatanarayanan2020towards, ahmad2019trust], we focus on workload estimation for users with mobility limitations.

III PROBLEM FORMULATION

Following previous work [\citeconsecutivegordon2020adaptive, gordon2021leveraging], we formulate bite acquisition as a contextual bandit problem. At each timestep, the learner receives a context xtx_{t} (food item), and selects actions at𝒜a_{t}\in\mathcal{A} that minimize regret over TT timesteps:

t=1T𝔼[rt|xt,at,t]t=1T𝔼[rt|xt,at,t]\sum_{t=1}^{T}\mathbb{E}[r_{t}|x_{t},a^{*}_{t},t]-\sum_{t=1}^{T}\mathbb{E}[r_{t}|x_{t},a_{t},t]

where rtr_{t} is the reward, ata(xt)a^{*}_{t}\triangleq a^{*}(x_{t}) is the optimal action maximizing rtr_{t} for xtx_{t}, and expectations are over the stochasticity in rtr_{t}. In our problem setting, we assume access to a dataset DD from a robotic manipulator, consisting of observations o𝒪o\in\mathcal{O}, actions a𝒜a\in\mathcal{A}, and rewards rr. This dataset is used to pretrain and validate our algorithms, but is not required for our online setting. Our contextual bandit setting is characterized by (more details in Appendix):

  • Observation space 𝒪\mathcal{O}: RGB images of single bite-sized food items on a plate, sampled from 16 food types [4].

  • Action space 𝒜\mathcal{A}: 7 actions in total: 6 robot actions ara_{r} shown in Fig. 2 (bottom) [\citeconsecutivebhattacharjee2019towards, feng2019robot], and 1 query action aqa_{q}.

  • Context space 𝒳\mathcal{X}: Lower-dimensional context xx derived from observations oo using SPANet [4], with dimensionality d=2048d=2048. As our bandit algorithms assume a linear relationship between xx and the expected reward 𝔼[r]\mathbb{E}[r], we use the penultimate activations of SPANet as our context since the final layer of SPANet is linear [5].

When the learner selects a robot action ara_{r}, it receives a binary reward indicating success for context xtx_{t}. If the learner selects aqa_{q}, it receives the optimal robot action a(xt)a^{*}(x_{t}) from the human. It then executes that action and receives a reward. Querying imposes a penalty on the learner, given by the human’s latent querying workload state WLtWL_{t} at time tt.

We model the total expected reward 𝔼[rt|xt,at,t]\mathbb{E}[r_{t}|x_{t},a_{t},t] as the sum of the expected task reward rtask(xt,at)r_{task}(x_{t},a_{t}) (the probability that action ata_{t} succeeds for food context xtx_{t}) and the expected querying workload reward rWL(xt,at,t)r_{WL}(x_{t},a_{t},t) associated with selecting query actions aqa_{q}:

𝔼[rt|xt,at,t]wtaskrtask(xt,at)+(1wtask)rWL(xt,at,t)\mathbb{E}[r_{t}|x_{t},a_{t},t]\triangleq w_{task}r_{task}(x_{t},a_{t})+(1-w_{task})r_{WL}(x_{t},a_{t},t)

where wtaskw_{task} is a scalar weight in [0,1][0,1]. We define rtask(xt,at)r_{task}(x_{t},a_{t}) to be sx(xt,at)s_{x}(x_{t},a_{t}) if at=ara_{t}=a_{r}, and sx(xt,a(xt))s_{x}(x_{t},a^{*}(x_{t})) if at=aqa_{t}=a_{q}, where sx(xt,at)s_{x}(x_{t},a_{t}) is the probability that action aa succeeds for context xx. We define rWL(xt,at,t)r_{WL}(x_{t},a_{t},t) to be 0 if at=ara_{t}=a_{r}, and WLt-WL_{t} if at=aqa_{t}=a_{q}. Our goal is to learn a policy π(a|x,t)\pi(a|x,t) that maximizes t=1T𝔼[rt|xt,at,t]\sum_{t=1}^{T}\mathbb{E}[r_{t}|x_{t},a_{t},t].

Refer to caption
Figure 3: Left: Querying workload user study setup, illustrating the format of the distraction task and robot query task, and the modified NASA-TLX survey recorded after every robot query. Right: Independent variables that affect user’s querying workload varied in the study.

IV querying workload Modeling

Our human-in-the-loop contextual bandit algorithms rely on estimating the querying workload state WLtWL_{t} (Sec. III). To estimate WLtWL_{t}, we first conduct a user study to understand how workload responds to queries (Sec. IV-A). We then learn a predictive workload model using this data (Sec. IV-C).

IV-A Querying Workload User Study

We design a user study (Fig. 3) to capture how different query types affect an individual’s querying workload in robotic caregiving scenarios. In such scenarios, a feeding system may request various assistance types from the user, such as semantic labels, bounding boxes for food items, or explanations for failed acquisition attempts. These response types can have varying impacts on workload [\citeconsecutivecui2021understanding, koppol2021interaction].

For instance, a semantic label response may impose less workload than an open-ended response. In addition, the end-user of the system and the caregiver may be busy with other tasks, such as watching TV while eating. We conduct an online user study that captures the above factors to determine how an individual’s querying workload changes in response to requests for assistance from an autonomous system.

Participants perform a distraction task while periodically receiving queries from a “robot,” simulating realistic caregiving situations where users might be engaged in other activities. The robot queries represent realistic assistance requests during bite acquisition. We vary four independent variables hypothesized to impact querying workload:

  1. 1.

    Robot query difficulty (dtd_{t}): 2 options – easy or hard.

  2. 2.

    Distraction task difficulty (disttdist_{t}): 3 options – no distraction task, an easy numeric addition task, or a hard video transcription task.

  3. 3.

    Time interval between queries (ΔT\Delta T): 2 options – either 1 minute or 2 minutes.

  4. 4.

    Response type (resptresp_{t}): 3 options – (a) a multiple-choice question (MCQ) asking for a semantic label, (b) asking the user to draw a bounding box (BB) around a food item, or (c) an open-ended (OE) question asking the user to explain why an acquisition attempt failed.

Each participant experiences 12 study conditions, one for each of the 2 settings of dtd_{t}, 3 settings of disttdist_{t}, and 2 settings of ΔT\Delta T, with the entire study taking roughly 1.5 hr. During each condition, lasting 5.5 minutes, participants engage in distraction tasks and respond to robot queries (with varying response types resptresp_{t}) as shown in Fig. 3. We measure self-reported workload (our dependent variable) after each robot query and at the end of each condition using 5 modified NASA-TLX subscales [\citeconsecutivehart1988development, hertzum2021reference] (mental/physical, temporal, performance, effort, frustration), each measured on a 5-point Likert scale (see Appendix for exact question wording).

We collect data from 14 users (7 male, 7 female; ages 27-50) with mobility limitations resulting from a diverse range of medical conditions: spinal muscular atrophy (SMA), quadriplegia, arthrogryposis, cerebral palsy, Ullrich congenital muscular dystrophy, schizencephaly with spastic quadriplegia, spinal cord injury (including C2 and C3 quadriplegia). We additionally collect data from 3 occupational therapists (OTs), (3 female; ages 23-26), trained to simulate right/left hemiplegia (stroke). The OTs positioned one side of their body in full shoulder adduction, and elbow flexion to allow only 30 degrees of functional elbow joint range. Finally, we collect data from 89 users without mobility limitations (40 male, 49 female; ages 19-68). The population size for users with limitations is larger than the median (n=6n=6) in studies of assistive robots [\citeconsecutivemankoff2010disability, nanavati2023physically].

IV-B Data Analysis

We consider three distinct datasets under which to train the workload model: D1D_{1}, which only includes the data from the 89 users without mobility limitations; D2D_{2}, which only includes the data from the 14 users with mobility limitations and 3 OTs; and D1,2D_{1,2}, which includes both D1D_{1} and D2D_{2}. We convert the modified NASA-TLX responses into a single querying workload score by normalizing responses to 3 of the questions, and taking a weighted average (weight details in Appendix). We find a statistically significant difference in mean querying workload between D1D_{1} (0.36±0.290.36\pm 0.29) and D2D_{2} (0.40±0.280.40\pm 0.28), as described in the Appendix.

IV-C Querying Workload Predictive Model

We develop a predictive model of querying workload based on the user study data (Sec. IV-A). Our models operate in discrete time, where the time variable tt refers to an integer number of timesteps since the beginning of a condition, where each timestep corresponds to a fixed time spacing Δt\Delta t (set to 10s10\text{s} in our experiments). The model has the form WLt=f(WL0,Qt;θ)WL_{t}=f(WL_{0},Q_{t};\theta), where WL0WL_{0} is the initial workload, Qt={(dt,respt,distt)}t=1tQ_{t}=\{(d_{t^{\prime}},resp_{t^{\prime}},dist_{t^{\prime}})\}_{t^{\prime}=1}^{t} is the history of previous queries (where dtd_{t}, resptresp_{t}, and disttdist_{t} are the query variables defined in Sec. IV-A), and θ\theta are the model parameters.

Linear discrete-time models. To capture the dependency between the WLtWL_{t} and QtQ_{t}, we use a linear predictive workload model inspired by Granger causality models [\citeconsecutivegranger1969investigating,liang2023randomization]:

WLt=γWL0+i=0H1𝐰𝐢T[dti,respti,distti]+w0WL_{t}=\gamma WL_{0}+\sum_{i=0}^{H-1}\mathbf{w_{i}}^{T}[d_{t-i},resp_{t-i},dist_{t-i}]+w_{0}

where 𝐰𝐢\mathbf{w_{i}} represents the effect of the query asked ii timesteps in the past on the workload at the current time tt, HH is the history length, and w0w_{0} is a bias term. We train the model using linear regression, where we convert dtd_{t}, resptresp_{t}, and disttdist_{t} into features using one-hot encodings, and generate a training pair for each robot task query in our study (see Appendix for more details about the Granger models).

Model selection. We train models f1f_{1}, f2f_{2}, and f1,2f_{1,2}, corresponding to the datasets D1D_{1}, D2D_{2}, and D1,2D_{1,2}, respectively (Sec. IV-B). 111When training these models, our assumptions about the time taken for users to complete the modified TLX surveys, particularly for users without mobility limitations, are in the Appendix.. The models fi,simf_{i,sim} and fi,realf_{i,real} indicate models learned for the simulation experiments and the real-world user study, respectively. For our simulation experiments, we learn 3 separate models f1,simf_{1,sim}, f2,simf_{2,sim}, and f1,2,simf_{1,2,sim} (details in the Appendix). For our real-world user study, we only use 2 of the 3 real models: f1,realf_{1,real} for the users without limitations and f1,2,realf_{1,2,real} for the users with limitations. In our setting, f1,realf_{1,real} and f1,2,realf_{1,2,real} are models chosen based on cross-validation median test MSE on D1D_{1} and D1,2D_{1,2}, respectively (full model selection details in Appendix). We also train a model f2,realf_{2,real} (cross-validation score: 0.066±0.0220.066\pm 0.022), which has a higher mean and standard deviation MSE than f1,2,realf_{1,2,real} (cross-validation score: 0.058±0.0100.058\pm 0.010). This suggests that training only on D2D_{2} leads to a higher-variance, less accurate model, compared to training on D1,2D_{1,2}. Thus, it is still helpful to consider the data from users without limitations (D1D_{1}) when training our workload models.

V Human-in-the-Loop Algorithms

We develop decision-making algorithms that decide whether to ask for help or act autonomously, using the learned workload models in Sec. IV-C. Specifically, we consider four human-in-the-loop contextual bandit algorithms: one fully autonomous, and three that can query the human.

Fully autonomous algorithm. Our fully autonomous baseline is LinUCB [11], the state-of-the-art for acquiring unseen bite-sized food items like banana and apple slices [\citeconsecutivegordon2020adaptive, gordon2021leveraging]. LinUCB selects the action that maximizes a reward upper-confidence bound (UCB) estimate for each robot action, given by UCBar=θarTx+αbarUCB_{a_{r}}=\theta_{a_{r}}^{T}x+\alpha b_{a_{r}}. Here, θar\theta_{a_{r}} is the linear parameter vector learned through regression on contexts XarX_{a_{r}} and observed rewards for actions ara_{r}, where XarX_{a_{r}} includes contexts seen for ara_{r} during pretraining and online validation. The term α>0\alpha>0 corresponds to a confidence level, and bar=(xT(XarTXar+λI)1x)1/2b_{a_{r}}=(x^{T}(X_{a_{r}}^{T}X_{a_{r}}+\lambda I)^{-1}x)^{1/2} is the UCB bonus with L2L_{2} regularization strength λ\lambda. The size of αbar\alpha b_{a_{r}} reflects the reward estimate uncertainty for the given context-action pair.

Algorithm 1 LinUCB-QG.
1:Inputs: Context xx, scaling factor ww, time tt, initial workload WL0WL_{0}, workload model ff, model parameters θ\theta
2:For all ara_{r}: compute UCB bonus and value: (barb_{a_{r}}, UCBarUCB_{a_{r}}).
3:Let a=argmaxarθarTxa^{*}=\arg\max_{a_{r}}\theta_{a_{r}}^{T}x, a=argmaxaraθarTxa^{-}=\arg\max_{a_{r}\neq a^{*}}\theta_{a_{r}}^{T}x
4:Define G(θaTx+αba)(θaTxαba)G\triangleq(\theta_{a^{-}}^{T}x+\alpha b_{a^{-}})-(\theta_{a^{*}}^{T}x-\alpha b_{a^{*}})
5:Set a={aqif G>wf(WL0,Qt;θ)argmaxarUCBarotherwisea=\begin{cases}\text{$a_{q}$}&\text{if $G>wf(WL_{0},Q_{t};\theta)$}\\ \text{$\arg\max_{a_{r}}UCB_{a_{r}}$}&\text{otherwise}\end{cases}
6:return aa

Querying algorithms. The querying algorithms decide to query the human or select the action that maximizes UCBarUCB_{a_{r}}.

  1. 1.

    AlwaysQuery: always queries the user. 222For a particular food context xx, we assume that the user provides the optimal action a(x)a^{*}(x) when queried. However, the expert action may sometimes fail due to inherent action uncertainty (e.g. due to food property variability). Thus, if a(x)a^{*}(x) initially fails, AlwaysQuery will repeatedly execute a(x)a^{*}(x) until success.

  2. 2.

    LinUCB-ExpDecay: queries with exponentially-decaying probability (decay rate cc) depending on number of food items seen in an episode (see Appendix).

  3. 3.

    LinUCB-Query-Gap (LinUCB-QG), defined in Algorithm 1: queries if the worst-case performance gap between the best action aa^{*} and second-best action aa^{-} exceeds the predicted workload, with scaling factor ww.

rtask,avgr_{task,avg} MwtM_{wt} fqf_{q}
Workload Model f1,simf_{1,sim}
   –LinUCB 0.292±0.1130.292\pm 0.113 0.354±0.0790.354\pm 0.079 -
   –AlwaysQuery 0.727±0.1950.727\pm 0.195 0.591±0.1300.591\pm 0.130 1±01\pm 0
   –LinUCB-ExpDecay 0.399±0.1290.399\pm 0.129 0.425±0.0900.425\pm 0.090 0.560±0.1500.560\pm 0.150
   –LinUCB-QG 0.670±0.2370.670\pm 0.237 0.593±0.130\mathbf{0.593\pm 0.130} 0.867±0.189\mathbf{0.867\pm 0.189}
Workload Model f2,simf_{2,sim}
   –LinUCB 0.292±0.1130.292\pm 0.113 0.354±0.0790.354\pm 0.079 -
   –AlwaysQuery 0.727±0.1950.727\pm 0.195 0.359±0.1370.359\pm 0.137 1±01\pm 0
   –LinUCB-ExpDecay 0.303±0.1160.303\pm 0.116 0.353±0.0820.353\pm 0.082 0.160±0.1500.160\pm 0.150
   –LinUCB-QG 0.336±0.0690.336\pm 0.069 0.342±0.0690.342\pm 0.069 0.333±0.0940.333\pm 0.094
Workload Model f1,2,simf_{1,2,sim}
   –LinUCB 0.292±0.1130.292\pm 0.113 0.343±0.0790.343\pm 0.079 -
   –AlwaysQuery 0.727±0.1950.727\pm 0.195 0.359±0.1370.359\pm 0.137 1±01\pm 0
   –LinUCB-ExpDecay 0.303±0.1160.303\pm 0.116 0.330±0.0970.330\pm 0.097 0.160±0.1500.160\pm 0.150
   –LinUCB-QG 0.503±0.0410.503\pm 0.041 0.376±0.131\mathbf{0.376\pm 0.131} 0.667±0.189\mathbf{0.667\pm 0.189}
TABLE I: Querying algorithms simulated on the test set, corresponding to wtask=0.7w_{task}=0.7, for 3 different data regimes in the workload model, averaged across 3 seeds. Values for rtask,avgr_{task,avg} and fqf_{q} correspond to the hyperparameter setting with maximal MwtM_{wt} on the validation set (for each separate data setting).

In LinUCB-QG, the worst-case gap is the difference between a pessimistic estimate of aa^{*}’s reward and an optimistic estimate of aa^{-}’s reward, considering their confidence intervals (Fig. 2, top). A larger gap indicates a higher risk that the predicted best arm may be suboptimal, increasing the odds that the benefit of querying outweighs the workload penalty WLtWL_{t}. Therefore, querying only when the gap is sufficiently large helps balance task reward and workload.

We define the predicted workload penalty to be the counterfactual workload WLtWL_{t} if we were to query at the current time. To estimate this, we condition our workload model on the specific query type that we consider in our experiments. We use WLt=f(WL0,Qt;θ)WL_{t}=f(WL_{0},Q_{t};\theta), where we set the query type variables to be dt=easyd_{t}=\text{easy}, respt=MCQresp_{t}=\acs{MCQ}, distt=dist_{t}= “no distraction task” for all tt where the selected action at=aqa_{t}=a_{q}.

VI EXPERIMENTS

We evaluate our human-in-the-loop algorithm performance using two setups: (i) a simulation testbed (Sec. VI-A), and (ii) a real world user study with 19 subjects (Sec. VI-B). We evaluate our algorithms on a surrogate objective, adapted from 𝔼[rt|xt,at,t]\mathbb{E}[r_{t}|x_{t},a_{t},t] (Sec. III) (details in Appendix):

wtask[1Tt=1Trtask(xt,at)](1wtask)(WLTWL0)\displaystyle w_{task}\left[\frac{1}{T}\sum_{t=1}^{T}r_{task}(x_{t},a_{t})\right]-(1-w_{task})(WL_{T}-WL_{0})

VI-A Simulated Testbed

Refer to caption
Figure 4: Simulation and real-world results. Left (bottom): Workload comparison between three querying algorithms in simulation (D2D_{2} setting), with a sequence of 5 food items. Vertical lines indicate start of new food item, while red circles indicate query timesteps. LinUCB-QG reduces querying workload WLtWL_{t} compared to the other two methods, while offering competitive convergence times compared to LinUCB-ExpDecay. Left (top): LinUCB-QG balances real-world performance and user workload, significantly reducing workload compared to Always-Query (which achieves slightly higher success at greater user cost), as shown by three post-method metrics (Post) and three change in workload metrics (Delta). Right: LinUCB-QG significantly outperforms LinUCB, our autonomous baseline, in real-world bite acquisition success, both subjectively (Post-Performance) and objectively (rtask,avgr_{task,avg}). Error bars indicate standard error.

We simulate food interaction using a dataset of bite-sized food items on plates, including 16 food types [4], each with 30 trials for the 6 robot actions ara_{r}. We first draw a random image of a given food type from the dataset and generate the corresponding context xx using SPANet. When the policy selects a robot action ara_{r}, we sample a binary task reward rr from a Bernoulli distribution with success probability sx(x,ar)s_{x}(x,a_{r}). If r=1r=1, we declare that the bandit has converged, and move to a new food type at the next timestep. If r=0r=0, we draw a new random image of the same food type from the dataset until we exceed the maximum number of attempts Natt=10N_{att}=10. When the policy selects a query action aqa_{q}, the bandit policy repeatedly executes the optimal action a(x)a^{*}(x) until convergence, or the limit NattN_{att} is reached.

Metrics. We compute 5 objective metrics to evaluate the tradeoff between task performance and querying workload:

  • The mean task reward rtask,avg=1Tt=1Trtask(xt,at)r_{task,avg}=\frac{1}{T}\sum_{t=1}^{T}r_{task}(x_{t},a_{t}), measuring the efficiency of successful food acquisition.

  • The episodic change in workload ΔWL=WLTWL0\Delta WL=WL_{T}-WL_{0}.

  • Our surrogate objective Mwt=wtaskrtask,avg(1wtask)ΔWLM_{wt}=w_{task}r_{task,avg}-(1-w_{task})\Delta WL, where wtaskw_{task} represents a user-specific preference for the task performance/workload tradeoff.

  • The number of timesteps taken to converge tconvt_{conv}, which provides additional insights into task performance.

  • The fraction of food items for which we queried fqf_{q}, which provides an insight into querying workload.

Evaluating querying workload in simulation. In our simulation environment, we use the learned workload model ff for both LinUCB-QG action-selection, as well as computing the evaluation metrics ΔWL\Delta WL and MwtM_{wt}. During the rollout of a particular bandit policy, we evolve the workload model state WLtWL_{t} according to the learned model when we query. We consider the three dataset scenarios and corresponding workload models f1,simf_{1,sim}, f2,simf_{2,sim}, and f1,2,simf_{1,2,sim} (Sec. IV-C), and set WL0=0.5WL_{0}=0.5 (a median initial workload).

Pretrain, validation, and test sets. In our experiments, we partition the set of 16 food types into a pretraining set (for the contextual bandit and for training SPANet), a validation set (for tuning the querying algorithm hyperparameters), and a test set (for metric evaluation). Our validation set (cantaloupe, grape) and test set (banana, carrot) include food items with varying material properties. Our pretraining set includes the remaining 12 food types in the food dataset. We use the validation set to select the hyperparameter that maximizes the weighted metric MwtM_{wt} given a wtaskw_{task} setting.

Results. We investigate how LinUCB-QG balances the tradeoff between task reward and querying workload. Table I compares the four algorithms for a setting with a slight preference for maximizing task performance over minimizing workload (wtask=0.7w_{task}=0.7). Across all three workload data settings, LinUCB-QG has a higher mean task reward rtask,avgr_{task,avg} than LinUCB and LinUCB-ExpDecay, and a mean weighted metric MwtM_{wt} that is better than or competitive with the other methods (within the uncertainty of the simulation), suggesting that it achieves the best overall tradeoff between task reward and querying workload. Additionally, the querying fraction fqf_{q} for LinUCB-QG increases from D2D_{2} to D1,2D_{1,2} to D1D_{1}, suggesting that LinUCB-QG is sensitive to the higher workload predictions with care recipient and OT data, while still trading off task reward and workload.

Experiment: generalization on D2D_{2}. We run an additional experiment to ablate D2D_{2}, where at test time we use either f1,simf_{1,sim} or f1,2,simf_{1,2,sim} in LinUCB-QG action-selection, but use f2,simf_{2,sim} for metric evaluation (wtask=0.4w_{task}=0.4). We find that MwtM_{wt} is lower in the first setting (0.269±0.0400.269\pm 0.040) compared to the second setting (0.373±0.1080.373\pm 0.108), suggesting that including D2D_{2} is crucial to achieve the best trade-off between task reward and workload for users with mobility limitations. See Appendix for more details and comparison metrics.

VI-B Real-World User Study

To evaluate the real-world performance-workload tradeoff of LinUCB-QG, we conduct a user study with 19 users: 18 users without mobility limitations (8 male, 10 female; ages 19-31; 66% with prior robot interaction experience), and one 46-year old user with Multiple Sclerosis since they were 19. We investigate whether LinUCB-QG improves task performance compared to LinUCB, while minimizing workload compared to Always-Query. This study was approved by the IRB at Cornell University (Protocol #IRB0010477).

Study setup. We use a Kinova Gen-3 6-DoF robot arm with a custom feeding tool (details in Appendix). For each method, we present users with a plate of 3 food items (banana, carrot, cantaloupe), with diverse characteristics: cantaloupe works with all robot actions; carrots require sufficient penetration force and fork tines perpendicular to the food major axis; bananas are soft, requiring tilted-angled actions [5]. We set Natt=3N_{att}=3 for a reasonable study duration. The user provides feedback using a speech-to-text interface when queried, without any additional distraction tasks. Users evaluate each of the three methods in a counterbalanced order. We measure workload before/after each method by asking 5 modified NASA-TLX subscale questions: mental/physical, temporal, performance, effort, and frustration (question wording in Appendix). We conduct 2 repetitions of the 3 food items for all 3 methods (18 total trials).

Metrics. We define 7 subjective and 3 objective metrics to compare the methods. 4 subjective metrics correspond to the modified post-method NASA-TLX questions: Post-Mental Workload, Post-Temporal Workload, Post-Performance, and Post-Effort Workload. The other 3 subjective metrics measure workload changes during each method: Delta-Mental Workload, Delta-Temporal Workload, and Delta-Querying Workload (the change in querying workload score, weighting function in Appendix). The 3 objective metrics are mean task reward per plate (rtask,avgr_{task,avg}), mean successes per plate (nsuccessn_{success}), and mean query timesteps per plate (nqn_{q}).

Overall results. Among the three methods, LinUCB-QG offers the most balanced approach to human-in-the-loop bite acquisition. It is more efficient than LinUCB and imposes less querying workload than Always-Query. While LinUCB typically picks up the carrot or cantaloupe within 1-2 timesteps but struggles with the banana, Always-Query successfully picks up food in the first timestep by always querying the user, leading to higher workload. Our method, LinUCB-QG, selectively asks for help with the banana and acts autonomously for the other foods in most trials, balancing task performance with querying workload.

(1) Task success results: Fig. 4 (right) shows that LinUCB-QG achieves higher objective (rtask,avgr_{task,avg}) and subjective (Post-Performance) success ratings versus LinUCB, our autonomous baseline, showing greater efficiency. (2) Querying workload results: As shown in Fig. 4 (left, top), LinUCB-QG has lower subjective querying workload scores (mental, temporal, effort) compared to Always-Query, indicating that selective querying reduces workload. This is the case for both the post-method metrics (Post) and the workload change metrics (Delta). All results are statistically significant (Wilcoxon paired signed-rank test, α=0.05\alpha=0.05). See Appendix for full comparisons across all metrics.

Results: user with mobility limitations. We highlight observations from the user with mobility limitations. First, they provided a higher mean subjective success rating for LinUCB-QG than for LinUCB, aligning with the aggregate results. However, the user provided consistently low physical/mental, temporal and effort ratings across all methods. Potential reasons for this include the relative simplicity of providing feedback in our setting, and the fact that this particular user frequently requests assistance in daily life, reducing their querying workload. The user also commented that relying on external physiological signals (such as EEG) to estimate workload would add stress (due to the additional hardware required), reinforcing the benefit of non-intrusive workload models for human-in-the-loop querying algorithms.

Discussion. Future work could incorporate different query types in the real-world study, beyond the best acquisition action, such as the other query types in our workload model. In our setting, task complexity depends on physical food properties such as compliance or shape. We did not explicitly focus on such variables when modeling workload, although query difficulty did_{i} does correlate with task complexity. The NASA-TLX scale has known limitations, such as subjectivity [38], over-emphasizing task difficulty [39], and workload score calculation problems [\citeconsecutivevirtanen2022weight, bolton2023mathematical], meaning that the modified NASA-TLX scores used to train our workload models are inherently noisy. We mitigate this by focusing on relative workload changes when evaluating how well our algorithms balance performance and workload. Additionally, we do not take measurements between queries in our workload dataset, so our workload models interpolate between queries using their learned weights, leading to rollouts that are not always intuitive (e.g. due to assigning greater weight to past queries). Future directions include predictive modeling with objective, non-intrusive workload measurements, and extensions to other assistive tasks where performance must be balanced with workload, using domain-agnostic workload features.

References

  • [1] Sidney Katz et al. “Studies of illness in the aged: the index of ADL: a standardized measure of biological and psychosocial function” In jama 185.12 American Medical Association, 1963, pp. 914–919
  • [2] Matthew W Brault “Americans with disabilities: 2010” In Current population reports 7, 2012, pp. 70–131
  • [3] Rishabh Madan et al. “Sparcs: Structuring physically assistive robotics for caregiving with stakeholders-in-the-loop” In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2022, pp. 641–648 IEEE
  • [4] Ryan Feng et al. “Robot-assisted feeding: Generalizing skewering strategies across food items on a plate” In The International Symposium of Robotics Research, 2019, pp. 427–442 Springer
  • [5] Ethan K Gordon et al. “Adaptive robot-assisted feeding: An online learning framework for acquiring previously unseen food items” In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020, pp. 9659–9666 IEEE
  • [6] Ethan K Gordon et al. “Leveraging post hoc context for faster learning in bandit settings with applications in robot-assisted feeding” In 2021 IEEE International Conference on Robotics and Automation (ICRA), 2021, pp. 10528–10535 IEEE
  • [7] Priya Sundaresan, Suneel Belkhale and Dorsa Sadigh “Learning visuo-haptic skewering strategies for robot-assisted feeding” In 6th Annual Conference on Robot Learning, 2022
  • [8] Daniel Gallenberger, Tapomayukh Bhattacharjee, Youngsun Kim and Siddhartha S Srinivasa “Transfer depends on acquisition: Analyzing manipulation strategies for robotic feeding” In 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 2019, pp. 267–276 IEEE
  • [9] Suneel Belkhale et al. “Balancing efficiency and comfort in robot-assisted bite transfer” In 2022 International Conference on Robotics and Automation (ICRA), 2022, pp. 4757–4763 IEEE
  • [10] Alessandra Malito “Grocery stores carry 40,000 more items than they did in the 1990s” In MarketWatch, June 17, 2017
  • [11] Lihong Li, Wei Chu, John Langford and Robert E Schapire “A contextual-bandit approach to personalized news article recommendation” In Proceedings of the 19th international conference on World wide web, 2010, pp. 661–670
  • [12] Tapomayukh Bhattacharjee et al. “Is more autonomy always better? exploring preferences of users with mobility impairments in robot-assisted feeding” In Proceedings of the 2020 ACM/IEEE international conference on human-robot interaction, 2020, pp. 181–190
  • [13] Yuchen Cui et al. “Understanding the relationship between interactions and outcomes in human-in-the-loop machine learning” In International Joint Conference on Artificial Intelligence, 2021
  • [14] Terrence Fong, Charles Thorpe and Charles Baur “Robot, asker of questions” In Robotics and Autonomous systems 42.3-4 Elsevier, 2003, pp. 235–243
  • [15] Shayan Shayesteh and Houtan Jebelli “Investigating the impact of construction robots autonomy level on workers’ cognitive load” In Canadian Society of Civil Engineering Annual Conference, 2021, pp. 255–267 Springer
  • [16] Ayca Aygun et al. “Investigating Methods for Cognitive Workload Estimation for Assistive Robots” In Sensors, 2022
  • [17] Lex Fridman, Bryan Reimer, Bruce Mehler and William T Freeman “Cognitive load estimation in the wild” In Proceedings of the 2018 chi conference on human factors in computing systems, 2018, pp. 1–9
  • [18] Sandra G Hart and Lowell E Staveland “Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research” In Advances in psychology 52 Elsevier, 1988, pp. 139–183
  • [19] Maithra Raghu et al. “The algorithmic automation problem: Prediction, triage, and human effort” In arXiv preprint arXiv:1903.12220, 2019
  • [20] Vijay Keswani, Matthew Lease and Krishnaram Kenthapadi “Towards unbiased and accurate deferral to multiple experts” In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 2021, pp. 154–165
  • [21] Harikrishna Narasimhan et al. “Post-hoc estimators for learning to defer to an expert” In Advances in Neural Information Processing Systems 35, 2022, pp. 29292–29304
  • [22] Hussein Mozannar and David Sontag “Consistent estimators for learning to defer to an expert” In International Conference on Machine Learning, 2020, pp. 7076–7087 PMLR
  • [23] Shalmali Joshi, Sonali Parbhoo and Finale Doshi-Velez “Learning-to-defer for sequential medical decision-making under uncertainty” In arXiv preprint arXiv:2109.06312, 2021
  • [24] Tesca Fitzgerald et al. “INQUIRE: INteractive querying for user-aware informative REasoning” In 6th Annual Conference on Robot Learning, 2022
  • [25] Mattia Racca, Antti Oulasvirta and Ville Kyrki “Teacher-aware active robot learning” In 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 2019, pp. 335–343 IEEE
  • [26] Amber Li and Tom Silver “Embodied Active Learning of Relational State Abstractions for Bilevel Planning” In arXiv preprint arXiv:2303.04912, 2023
  • [27] Joshua Bhagat Smith, Prakash Baskaran and Julie A Adams “Decomposed physical workload estimation for human-robot teams” In 2022 IEEE 3rd International Conference on Human-Machine Systems (ICHMS), 2022, pp. 1–6 IEEE
  • [28] Caroline E Harriott, Tao Zhang and Julie A Adams “Assessing physical workload for human–robot peer-based teams” In International Journal of Human-Computer Studies 71.7-8 Elsevier, 2013, pp. 821–837
  • [29] Akilesh Rajavenkatanarayanan, Harish Ram Nambiappan, Maria Kyrarini and Fillia Makedon “Towards a real-time cognitive load assessment system for industrial human-robot cooperation” In 2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), 2020, pp. 698–705 IEEE
  • [30] Muneeb Imtiaz Ahmad, Jasmin Bernotat, Katrin Lohan and Friederike Eyssel “Trust and cognitive load during human-robot interaction” In arXiv preprint arXiv:1909.05160, 2019
  • [31] Tapomayukh Bhattacharjee, Gilwoo Lee, Hanjun Song and Siddhartha S Srinivasa “Towards robotic feeding: Role of haptics in fork-based food manipulation” In IEEE Robotics and Automation Letters 4.2 IEEE, 2019, pp. 1485–1492
  • [32] Pallavi Koppol, Henny Admoni and Reid G Simmons “Interaction Considerations in Learning from Humans.” In IJCAI, 2021, pp. 283–291
  • [33] Morten Hertzum “Reference values and subscale patterns for the task load index (TLX): a meta-analytic review” In Ergonomics 64.7 Taylor & Francis, 2021, pp. 869–878
  • [34] Jennifer Mankoff, Gillian R Hayes and Devva Kasnitz “Disability studies as a source of critical inquiry for the field of assistive technology” In Proceedings of the 12th international ACM SIGACCESS conference on Computers and accessibility, 2010, pp. 3–10
  • [35] Amal Nanavati, Vinitha Ranganeni and Maya Cakmak “Physically assistive robots: A systematic review of mobile and manipulator robots that physically assist people with disabilities” In Annual Review of Control, Robotics, and Autonomous Systems 7 Annual Reviews, 2023
  • [36] Clive WJ Granger “Investigating causal relations by econometric models and cross-spectral methods” In Econometrica: journal of the Econometric Society JSTOR, 1969, pp. 424–438
  • [37] Tengyuan Liang and Benjamin Recht “Randomization Inference When N Equals One” In arXiv preprint arXiv:2310.16989, 2023
  • [38] Sandra G Hart “NASA-task load index (NASA-TLX); 20 years later” In Proceedings of the human factors and ergonomics society annual meeting 50.9, 2006, pp. 904–908 Sage publications Sage CA: Los Angeles, CA
  • [39] Ryan D McKendrick and Erin Cherry “A deeper look at the NASA TLX and where it falls short” In Proceedings of the Human Factors and Ergonomics Society Annual Meeting 62.1, 2018, pp. 44–48 SAGE Publications Sage CA: Los Angeles, CA
  • [40] Kai Virtanen, Heikki Mansikka, Helmiina Kontio and Don Harris “Weight watchers: NASA-TLX weights revisited” In TheoreTical issues in ergonomics science 23.6 Taylor & Francis, 2022, pp. 725–748
  • [41] Matthew L Bolton, Elliot Biltekoff and Laura Humphrey “The mathematical meaninglessness of the NASA task load index: A level of measurement analysis” In IEEE Transactions on Human-Machine Systems 53.3 IEEE, 2023, pp. 590–599
  • [42] Rajat Kumar Jenamani et al. “FLAIR: Feeding via Long-Horizon AcquIsition of Realistic dishes” Under submission to Robotics: Science and Systems (RSS) 2024.
  • [43] Shilong Liu et al. “Grounding dino: Marrying dino with grounded pre-training for open-set object detection” In arXiv preprint arXiv:2303.05499, 2023
  • [44] Alexander Kirillov et al. “Segment anything” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4015–4026

APPENDIX

VI-C Problem Formulation

VI-C1 Details on Contextual Bandit Setting

Below we provide more details on the contextual bandit setting.

  • Observation space 𝒪\mathcal{O}: In each RGB image, the food item is either isolated, close to the plate edge, or on top of another food item [4].

  • Action space 𝒜\mathcal{A}: Each robot action ara_{r} is a pair consisting of one of three pitch configurations (tilted angled (TA), vertical skewer (VS), tilted vertical (TV)) and one of two roll configurations (00^{\circ}, 9090^{\circ}), relative to the orientation of the food [\citeconsecutivebhattacharjee2019towards, feng2019robot].

  • Context space 𝒳\mathcal{X}: SPANet, the network from which we derive our contexts xx, is pretrained in a fully-supervised manner to predict so(o,a)s_{o}(o,a), the probability that action aa succeeds for observation oo.

VI-D Querying Workload User Study

VI-D1 NASA-TLX weighting function

Throughout the paper, we define the querying workload state WLWL to be the following weighted sum of the raw NASA-TLX subscales: WL=0.4mental+0.2temporal+0.4effortWL=0.4\cdot\text{mental}+0.2\cdot\text{temporal}+0.4\cdot\text{effort}.

VI-D2 Modified NASA-TLX Subscale Questions

Below are the question wordings that we use in our online workload user study, which are modified versions of the NASA-TLX subscales:

  • Mental/Physical Demand: “How mentally or physically demanding was the task?”

  • Temporal Demand: “How hurried or rushed was the pace of the task?”

  • Performance: “How successful were you in accomplishing what you were asked to do?”

  • Effort: “How hard did you have to work to accomplish your level of performance?”

  • Frustration: “How irritated, stressed, and annoyed were you?”

Note that for the users without mobility limitations, the wording of the Mental/Physical Demand task was the following: “How mentally demanding was the task?”. This is because we assumed that the physical demand to answer the workload questions was minimal for users without mobility limitations.

VI-D3 Baseline conditions

2 baseline conditions. Each baseline condition corresponds to a fixed setting of disttdist_{t} (either easy addition, or hard video transcription), without any robot query tasks.

VI-D4 Sample study questions

Below we provide examples of the robot queries that we ask in the online querying workload user study, covering each of the possible response types rtr_{t} and question difficulties dtd_{t}:

  • rt=MCQr_{t}=\text{MCQ}, dt=easyd_{t}=\text{easy}: “What kind of food item is outlined in the image below?” (Responses: “Raspberry”, “Strawberry”, “Grape”, “Apple”, “I don’t know.”)

  • rt=BBr_{t}=\text{BB}, dt=easyd_{t}=\text{easy}: “Draw a box around only the strawberry.”

  • rt=OEr_{t}=\text{OE}, dt=easyd_{t}=\text{easy}: “Why did the Robot fail in acquiring the cantaloupe?”

  • rt=MCQr_{t}=\text{MCQ}, dt=hardd_{t}=\text{hard}: “Which of the following images is tofu?” (Responses: “Left”, “Right”, “I don’t know”)

  • rt=BBr_{t}=\text{BB}, dt=hardd_{t}=\text{hard}: “Draw a box around only the carrot in the bottom right.”

  • rt=OEr_{t}=\text{OE}, dt=hardd_{t}=\text{hard}: “How would you skewer the following item with a fork?”

VI-D5 Compensation

Users without mobility limitations were compensated at a rate of $12/hr, for a total compensation of $18 for the entire study (as the study duration was 1.5 hr). The occupational therapists who simulated mobility limitations were compensated at a rate of $10/hr, for a total compensation of $15 for the entire study. The users with mobility limitations were compensated at a rate of $20/hr, for a total compensation of $30 for the entire study.

VI-D6 Mean querying workload difference, D1D_{1} vs D2D_{2}.

In Section IV-B, we describe finding a statistically significant difference in mean querying workload between D1D_{1} (users without limitations) and D2D_{2} (users with limitations and OTs), where D1D_{1} had a mean querying workload of 0.36±0.290.36\pm 0.29, and D2D_{2} had a mean querying workload of 0.40±0.280.40\pm 0.28. We run a Mann-Whitney U-test on the querying workload scores from D1D_{1} and D2D_{2}. The alternative hypothesis that we tested is that the distribution of workload values for D1D_{1} is stochastically less than the distribution of workload values for D2D_{2}. For this alternative hypothesis, we find that p=5.410e-05p=5.410\text{e-}05, indicating a statistically significant result for α=0.05\alpha=0.05.

VI-E Querying Workload Predictive Model

VI-E1 Modified TLX survey timing information

When training the workload models, we make the following assumption about the amount of time taken for users to complete the modified TLX surveys. In the D1D_{1} setting, we assume 0 seconds for the full user population (which only includes users without mobility limitations). In the D1,2D_{1,2} setting, we assume 5 seconds for users without mobility limitations (and for two users in the D2D_{2} population for whom we did not collect time data), and for the rest of the users with mobility limitations, we used the logged time taken to complete the surveys. Our justification for this design choice is that in the D1,2D_{1,2} setting, we would like the assumed time for users without mobility limitations to be appropriately scaled compared to the logged times for users with limitations, who took non-zero times to complete the surveys.

VI-E2 Discrete-time Models

We consider different values of HH, with one memoryless Granger model (where H=1H=1) and a set of Granger models with HH ranging from 55 to 3030. We chose the upper bound of H=30H=30 to roughly correspond to the study condition length in Section IV-A. This is because we set each discrete time step to correspond to 10s10\text{s}. We zero-pad the query variables [dti,respti,distti][d_{t-i},resp_{t-i},dist_{t-i}] if there are no queries in the discrete time window tit-i, or if we run out of history.

For each of these models, we also consider variants where we impose non-negativity constraints and/or ridge-regression penalties on the weights γ\gamma and 𝐰𝐢T\mathbf{w_{i}}^{T}. Specifically, we consider 4 different variants:

  • Granger: no nonnegativity or ridge-regression penalty

  • Granger-Nonnegative (Granger-N): nonnegativity constraint only

  • Granger-Ridge (Granger-R): ridge-regression penalty only

  • Granger-Ridge-Nonnegative (Granger-RN): non-negativity constraint and ridge-regression penalty

VI-E3 Continuous-time Models

We also consider a continuous-time setting, where WL(t)WL(t) represents the querying workload at time tt. In this setting, we consider one model (denoted as Exp-Impulse) that models the workload at the current time tt as composed of a series of impulses at the query times, with an exponential decay in workload in between the queries. The Exp-Impulse model is defined recursively as follows:

WL(t)=WL(tprev)eλ(ttprev)+β(dt,rt,distt)WL(t)=WL(t_{prev})e^{-\lambda(t-t_{prev})}+\beta(d_{t},r_{t},dist_{t})

where tprevt_{prev} is the previous timestep at which we queried, β(dt,rt,distt)\beta(d_{t},r_{t},dist_{t}) is the magnitude of the impulse, and λ\lambda is the workload decay rate.

VI-E4 Model Selection

For each of the three dataset settings (D1D_{1}, D2D_{2}, D1,2D_{1,2}), we perform 4-fold cross-validation to select a model from the above set of linear models. Dataset D1D_{1} consists of N=4272N=4272 unique question/workload pairs, while dataset D2D_{2} consists of N=705N=705 unique question/workload pairs. For tuning of the ridge regression regularization parameter λr\lambda_{r}, we perform 5-fold cross validation to select the optimal parameter value, with a logarithmic range from 10310^{-3} to 10210^{2}.

Table II shows predictive MSE statistics on the held-out test splits, computed across the 4 folds, for the set of learned models. Note that we also considered a constant baseline (denoted Constant), whose predicted workload is WLt=WL0WL_{t}=WL_{0}, and an average baseline (denoted Average), whose predicted workload is the average value of WL0WL_{0} in the training set.

Real-world user study models. We describe the two simulated workload models (f1,realf_{1,real}, f1,2,realf_{1,2,real}) that are used for the real-world experiments in Section VI-B, along with the mobility limitation-only model f2,realf_{2,real} that is described in Section IV-C. The model f1,realf_{1,real} is a Granger model with H=5H=5, with a ridge-regression penalty on γ\gamma and 𝐰𝐢T\mathbf{w_{i}}^{T}, where the final selected ridge regression penalty was λr=10\lambda_{r}=10. The model f2,realf_{2,real} is a Granger model with H=1H=1. The model f1,2,realf_{1,2,real} is a Granger model with H=1H=1 and a ridge-regression penalty on γ\gamma and 𝐰𝐢T\mathbf{w_{i}}^{T}, where the final selected ridge regression penalty was λr=10\lambda_{r}=10. The model f1,realf_{1,real} achieves a cross-validation mean test MSE of 0.0573±0.01270.0573\pm 0.0127 on D1D_{1}, while f1,2,realf_{1,2,real} achieves a cross-validation mean test MSE of 0.0582±0.00990.0582\pm 0.0099 on the D1,2D_{1,2} dataset.

We use these models because they are the models with the lowest median test MSE on each dataset. We initially considered using mean test MSE to select the best-performing model, but we discovered that the mean test MSE had a very large magnitude for certain models. This is because for the model settings that do not have weight constraints, linear regression would overfit to the training set and learn model parameters with large weight magnitudes. Because of this, we use the median test MSE to select the model. Note that while the models have differences in their mean test MSE, the standard deviations in test MSE for the models overlap because of the variance across folds.

Simulated models. We describe the three simulated workload models (f1,realf_{1,real}, f2,realf_{2,real}, f1,2,realf_{1,2,real}) that are used in the experiments in Section VI-A. For all three models, we use Granger models with H=10H=10, where we place the following constraints on the parameters during model training: γ,{𝐰𝐢T}i=0H1[0.05,1]\gamma,\{\mathbf{w_{i}}^{T}\}_{i=0}^{H-1}\in[0.05,1], w01w_{0}\leq 1. Placing these additional constraints guarantees that the predicted change in workload associated with a query is non-negative.

D1D_{1} D2D_{2} D1,2D_{1,2}
Model HH Test MSE (μ±σ\mu\pm\sigma) Test MSE (median) Test MSE (μ±σ\mu\pm\sigma) Test MSE (median) Test MSE (μ±σ\mu\pm\sigma) Test MSE (median)
Constant 0 0.1271±0.01730.1271\pm 0.0173 0.13040.1304 0.104±0.02450.104\pm 0.0245 0.1120.112 0.123±0.01640.123\pm 0.0164 0.1240.124
Average 0 0.0937±0.00860.0937\pm 0.0086 0.09220.0922 0.089±0.03190.089\pm 0.0319 0.09330.0933 0.0919±0.008060.0919\pm 0.00806 0.09440.0944
Granger 1 0.0572±0.01260.0572\pm 0.0126 0.05070.0507 0.066±0.02190.066\pm 0.0219 0.0761\mathbf{0.0761} 0.0583±0.009880.0583\pm 0.00988 0.05730.0573
5 7.59e+15±1.31e+167.59\text{\text{e+}}15\pm 1.31\text{\text{e+}}16 0.06560.0656 3.95e+23±6.68e+233.95\text{e+}23\pm 6.68\text{e+}23 1.38e+221.38\text{e+}22 3.66e+16±6.34e+163.66\text{e+}16\pm 6.34\text{e+}16 0.05810.0581
10 3.16e+18±5.47e+183.16\text{e+}18\pm 5.47\text{e+}18 0.06610.0661 1.08e+24±1.88e+241.08\text{e+}24\pm 1.88\text{e+}24 2.24e+202.24\text{e+}20 3.38e+15±5.86e+153.38\text{e+}15\pm 5.86\text{e+}15 0.0580.058
15 1.87e+17±3.23e+171.87\text{e+}17\pm 3.23\text{e+}17 0.06600.0660 2.08e+21±3.23e+212.08\text{e+}21\pm 3.23\text{e+}21 3.23e+203.23\text{e+}20 3.45e+15±5.98e+153.45\text{e+}15\pm 5.98\text{e+}15 0.05820.0582
20 5.41e+17±9.37e+175.41\text{e+}17\pm 9.37\text{e+}17 0.06630.0663 2.43e+24±4.21e+242.43\text{e+}24\pm 4.21\text{e+}24 1.72e+211.72\text{e+}21 1.92e+12±3.32e+121.92\text{e+}12\pm 3.32\text{e+}12 0.05840.0584
25 1.39e+18±2.41e+181.39\text{e+}18\pm 2.41\text{e+}18 0.06680.0668 5.05e+23±7.95e+235.05\text{e+}23\pm 7.95\text{e+}23 7.11e+227.11\text{e+}22 3.07e+21±5.32e+213.07\text{e+}21\pm 5.32\text{e+}21 0.05870.0587
30 1.09e+20±1.89e+201.09\text{e+}20\pm 1.89\text{e+}20 0.06770.0677 1e+24±1.34e+241\text{e+}24\pm 1.34\text{e+}24 3.64e+233.64\text{e+}23 5.78e+21±1e+225.78\text{e+}21\pm 1\text{e+}22 0.0590.059
Granger-N 1 0.0573±0.01260.0573\pm 0.0126 0.05070.0507 0.0662±0.02170.0662\pm 0.0217 0.07620.0762 0.0584±0.009970.0584\pm 0.00997 0.05760.0576
5 0.0575±0.01280.0575\pm 0.0128 0.05080.0508 0.0665±0.02150.0665\pm 0.0215 0.07630.0763 0.0586±0.01010.0586\pm 0.0101 0.05760.0576
10 0.0575±0.01280.0575\pm 0.0128 0.05080.0508 0.0682±0.01990.0682\pm 0.0199 0.07690.0769 0.0585±0.009970.0585\pm 0.00997 0.05760.0576
15 0.0574±0.01260.0574\pm 0.0126 0.05080.0508 0.0698±0.01960.0698\pm 0.0196 0.07840.0784 0.0586±0.009850.0586\pm 0.00985 0.05780.0578
20 0.0575±0.01270.0575\pm 0.0127 0.05080.0508 0.0705±0.02040.0705\pm 0.0204 0.07980.0798 0.0588±0.009820.0588\pm 0.00982 0.05810.0581
25 0.0577±0.01260.0577\pm 0.0126 0.05080.0508 0.0726±0.02110.0726\pm 0.0211 0.08190.0819 0.0589±0.009910.0589\pm 0.00991 0.05820.0582
30 0.0578±0.01250.0578\pm 0.0125 0.05100.0510 0.0735±0.02050.0735\pm 0.0205 0.0830.083 0.0591±0.009940.0591\pm 0.00994 0.05830.0583
Granger-R 1 0.0572±0.01270.0572\pm 0.0127 0.05070.0507 0.066±0.02190.066\pm 0.0219 0.07630.0763 0.0582±0.009880.0582\pm 0.00988 0.0572\mathbf{0.0572}
5 0.0573±0.01270.0573\pm 0.0127 0.0506\mathbf{0.0506} 0.0663±0.02160.0663\pm 0.0216 0.07720.0772 0.0584±0.00990.0584\pm 0.0099 0.05760.0576
10 0.0574±0.01250.0574\pm 0.0125 0.05090.0509 0.0675±0.02170.0675\pm 0.0217 0.07830.0783 0.0584±0.009640.0584\pm 0.00964 0.05740.0574
15 0.0574±0.01240.0574\pm 0.0124 0.05120.0512 0.0691±0.02190.0691\pm 0.0219 0.08010.0801 0.0584±0.009480.0584\pm 0.00948 0.05750.0575
20 0.0575±0.01240.0575\pm 0.0124 0.05120.0512 0.0687±0.02210.0687\pm 0.0221 0.07950.0795 0.0588±0.009390.0588\pm 0.00939 0.05820.0582
25 0.0579±0.01230.0579\pm 0.0123 0.05140.0514 0.0694±0.02230.0694\pm 0.0223 0.08010.0801 0.0589±0.009510.0589\pm 0.00951 0.05830.0583
30 0.0582±0.01220.0582\pm 0.0122 0.05150.0515 0.0694±0.02220.0694\pm 0.0222 0.08010.0801 0.0591±0.009560.0591\pm 0.00956 0.05830.0583
Granger-RN 1 0.0573±0.01260.0573\pm 0.0126 0.05080.0508 0.0719±0.02460.0719\pm 0.0246 0.08190.0819 0.0589±0.01050.0589\pm 0.0105 0.05740.0574
5 5.66e+24±9.80e+245.66\text{e+}24\pm 9.80\text{e+}24 0.06550.0655 9.96e+21±1.73e+229.96\text{e+}21\pm 1.73\text{e+}22 1.6e+151.6\text{e+}15 5.88e+24±1.02e+255.88\text{e+}24\pm 1.02\text{e+}25 4.02e+154.02\text{e+}15
10 0.6120±0.95590.6120\pm 0.9559 0.06580.0658 5.16e+14±8.94e+145.16\text{e+}14\pm 8.94\text{e+}14 0.08940.0894 0.0607±0.01120.0607\pm 0.0112 0.0610.061
15 0.0573±0.01250.0573\pm 0.0125 0.05080.0508 5.62e+12±9.73e+125.62\text{e+}12\pm 9.73\text{e+}12 0.08620.0862 0.0589±0.01040.0589\pm 0.0104 0.05740.0574
20 0.0574±0.01250.0574\pm 0.0125 0.05090.0509 3.46e+13±3.49e+133.46\text{e+}13\pm 3.49\text{e+}13 3.15e+133.15\text{e+}13 4.45e+04±7.71e+044.45\text{e+}04\pm 7.71\text{e+}04 0.06760.0676
25 0.1189±0.11790.1189\pm 0.1179 0.05140.0514 1.1e+14±1.9e+141.1\text{e+}14\pm 1.9\text{e+}14 0.08950.0895 0.059±0.01040.059\pm 0.0104 0.05750.0575
30 0.1195±0.10210.1195\pm 0.1021 0.06620.0662 1.92e+14±2.7e+141.92\text{e+}14\pm 2.7\text{e+}14 5.65e+135.65\text{e+}13 0.059±0.01040.059\pm 0.0104 0.05750.0575
Exp-Impulse - 0.0737±0.01280.0737\pm 0.0128 0.06780.0678 0.0809±0.01990.0809\pm 0.0199 0.08780.0878 0.0753±0.01120.0753\pm 0.0112 0.07660.0766
TABLE II: Test mean-squared error (MSE) values for querying workload models, for all three workload data settings (D1D_{1}, D2D_{2}, D1,2)D_{1,2}). For each data setting, the model with the lowest median test MSE is indicated in bold.

VI-F Human-in-the-Loop Algorithms

We provide the full algorithmic description for LinUCB-ExpDecay in Algorithm 2.

Algorithm 2 LinUCB-ExpDecay.
1:Inputs: Context xx, decay rate cc, number of food items seen NN, time tt
2:For all robot actions ara_{r}, compute UCB bonus barb_{a_{r}} and UCB value UCBarUCB_{a_{r}}.
3:Set P(query)=ecNP(query)=e^{-cN} if tt is the first timestep to observe xx, P(query)=0P(query)=0 otherwise.
4:Set a=aqa=a_{q} with probability P(query)P(query), a=argmaxarUCBara=\arg\max_{a_{r}}UCB_{a_{r}} with probability 1P(query)1-P(query).
5:return aa
Refer to caption
Figure 5: Example LinUCB-QG rollout in simulation environment, illustrating evolution of performance gap GG and workload WLtWL_{t}, along with timesteps corresponding to query times. LinUCB-QG decides to ask the human for help when G>wWLtG>wWL_{t} (in this rollout, w=4w=4).

VI-G Experiments: Simulated Testbed

VI-G1 Surrogate Objective

Here we justify our choice of the surrogate objective outlined in Section VI. Recall that the expected reward 𝔼[rt|xt,at,t]\mathbb{E}[r_{t}|x_{t},a_{t},t] defined in Section III depends explicitly on the querying workload WLtWL_{t} for all times tt for which at=aqa_{t}=a_{q}. However, measuring the intermediate workload values after every query would require administering a survey after each query, which is impractical in the real world. Therefore, in our experimental formulation in both simulation and in the real study, we will assume that we cannot observe the intermediate workload values. Instead, we will observe only the initial and final workload workload values, which are WL0WL_{0} and WLT=f(WL0,QT;θ)WL_{T}=f(WL_{0},Q_{T};\theta), respectively.

VI-G2 Additional Results

First, we include a set of additional metrics for the experimental setting described in Section VI-A, which focus on the observed convergence for each food item. We define the following metrics: ffail,foodf_{fail,food}, which is the fraction of food items for which the algorithm was unable to converge; and fauto,foodf_{auto,food}, which is the fraction of food items for which the algorithm autonomously converged. We also show ΔWL\Delta WL, the change in querying workload across an episode and tconvt_{conv}, the number of timesteps required to converge to the optimal action. Table III shows these metrics for wtask=0.4w_{task}=0.4.

ΔWL\Delta WL tconvt_{conv} ffail,foodf_{fail,food} fauto,foodf_{auto,food}
Workload Model f1,simf_{1,sim}
  –LinUCB 0.500±0.000-0.500\pm 0.000 3.467±1.0873.467\pm 1.087 0.067±0.0940.067\pm 0.094 0.933±0.0940.933\pm 0.094
  –AlwaysQuery 0.274±0.024-0.274\pm 0.024 1.467±0.3401.467\pm 0.340 0.000±0.0000.000\pm 0.000 0.000±0.0000.000\pm 0.000
  –LinUCB-ExpDecay 0.485±0.037-0.485\pm 0.037 2.413±0.4632.413\pm 0.463 0.040±0.0800.040\pm 0.080 0.400±0.1460.400\pm 0.146
  –LinUCB-QG 0.414±0.122-0.414\pm 0.122 1.667±0.4991.667\pm 0.499 0.000±0.0000.000\pm 0.000 0.133±0.1890.133\pm 0.189
Workload Model f2,simf_{2,sim}
  –LinUCB 0.500±0.000-0.500\pm 0.000 3.467±1.0873.467\pm 1.087 0.067±0.0940.067\pm 0.094 0.933±0.0940.933\pm 0.094
  –AlwaysQuery 0.500±0.0000.500\pm 0.000 1.467±0.3401.467\pm 0.340 0.000±0.0000.000\pm 0.000 0.000±0.0000.000\pm 0.000
  –LinUCB-ExpDecay 0.471±0.107-0.471\pm 0.107 3.297±1.0483.297\pm 1.048 0.067±0.0940.067\pm 0.094 0.773±0.1770.773\pm 0.177
  –LinUCB-QG 0.357±0.101-0.357\pm 0.101 3.133±0.7543.133\pm 0.754 0.000±0.0000.000\pm 0.000 0.667±0.0940.667\pm 0.094
Workload Model f1,2,simf_{1,2,sim}
  –LinUCB 0.463±0.000-0.463\pm 0.000 3.467±1.0873.467\pm 1.087 0.067±0.0940.067\pm 0.094 0.933±0.0940.933\pm 0.094
  –AlwaysQuery 0.500±0.0000.500\pm 0.000 1.467±0.3401.467\pm 0.340 0.000±0.0000.000\pm 0.000 0.000±0.0000.000\pm 0.000
  –LinUCB-ExpDecay 0.395±0.240-0.395\pm 0.240 3.297±1.0483.297\pm 1.048 0.067±0.0940.067\pm 0.094 0.773±0.1770.773\pm 0.177
  –LinUCB-QG 0.078±0.411-0.078\pm 0.411 2.000±0.1632.000\pm 0.163 0.000±0.0000.000\pm 0.000 0.333±0.1890.333\pm 0.189
TABLE III: Additional convergence metrics in simulation on the test set, corresponding to wtask=0.7w_{task}=0.7, for 3 different data regimes in the workload model, averaged across 3 seeds (For LinUCB-ExpDecay, we also average across 5 policy random seeds). Metric values for LinUCB-ExpDecay and LinUCB-QG correspond to the hyperparameter setting with maximal MwtM_{wt} on the validation set (for each separate data setting).

Next, we include full results for multiple settings of wtaskw_{task} and wclw_{cl}, for the same querying workload models used in Section VI-A, ranging from wtask{0.2,0.3,,0.9}w_{task}\in\{0.2,0.3,\dots,0.9\} shown in Tables IV, V, and VI (excluding wtask=0.4w_{task}=0.4, whose results are shown in Table I). In the f1,simf_{1,sim} setting, we see that LinUCB-QG generally performs the best for intermediate values of wtaskw_{task} (corresponding to an intermediate emphasis on minimizing workload), while Always-Query performs the best for higher values of wtaskw_{task} (corresponding to a high emphasis on maximum task performance, which the Always-Query baseline achieves due to its lack of exploration compared to LinUCB-QG). In the f2,simf_{2,sim} and f1,2,simf_{1,2,sim} settings, LinUCB performs the best for lower values of wtaskw_{task} (because it never asks for help), while Always-Query performs the best for higher values of wtaskw_{task} (where task performance is more critical). Nevertheless, in all data settings and across different wtaskw_{task} values, LinUCB-QG offers better mean rtaskr_{task} compared to LinUCB and LinUCB-ExpDecay, and competitive MwtM_{wt} compared to the other methods.

wtaskw_{task} Method rtask,avgr_{task,avg} MwtM_{wt} fqf_{q}
0.2 LinUCB 0.292±0.1130.292\pm 0.113 0.458±0.0230.458\pm 0.023 0.000±0.0000.000\pm 0.000
AlwaysQuery 0.727±0.1950.727\pm 0.195 0.365±0.0210.365\pm 0.021 1.000±0.0001.000\pm 0.000
LinUCB-ExpDecay 0.374±0.1430.374\pm 0.143 0.475±0.029\mathbf{0.475\pm 0.029} 0.440±0.1500.440\pm 0.150
LinUCB-QG 0.434±0.0490.434\pm 0.049 0.458±0.0490.458\pm 0.049 0.733±0.0940.733\pm 0.094
0.3 LinUCB 0.292±0.1130.292\pm 0.113 0.438±0.0340.438\pm 0.034 0.000±0.0000.000\pm 0.000
AlwaysQuery 0.727±0.1950.727\pm 0.195 0.410±0.0420.410\pm 0.042 1.000±0.0001.000\pm 0.000
LinUCB-ExpDecay 0.374±0.1430.374\pm 0.143 0.462±0.043\mathbf{0.462\pm 0.043} 0.440±0.1500.440\pm 0.150
LinUCB-QG 0.434±0.0490.434\pm 0.049 0.455±0.0480.455\pm 0.048 0.733±0.0940.733\pm 0.094
0.4 LinUCB 0.292±0.1130.292\pm 0.113 0.417±0.0450.417\pm 0.045 0.000±0.0000.000\pm 0.000
AlwaysQuery 0.727±0.1950.727\pm 0.195 0.455±0.064\mathbf{0.455\pm 0.064} 1.000±0.0001.000\pm 0.000
LinUCB-ExpDecay 0.374±0.1430.374\pm 0.143 0.449±0.0570.449\pm 0.057 0.440±0.1500.440\pm 0.150
LinUCB-QG 0.434±0.0490.434\pm 0.049 0.452±0.0470.452\pm 0.047 0.733±0.0940.733\pm 0.094
0.5 LinUCB 0.292±0.1130.292\pm 0.113 0.396±0.0560.396\pm 0.056 0.000±0.0000.000\pm 0.000
AlwaysQuery 0.727±0.1950.727\pm 0.195 0.501±0.0860.501\pm 0.086 1.000±0.0001.000\pm 0.000
LinUCB-ExpDecay 0.374±0.1430.374\pm 0.143 0.437±0.0710.437\pm 0.071 0.440±0.1500.440\pm 0.150
LinUCB-QG 0.670±0.2370.670\pm 0.237 0.542±0.059\mathbf{0.542\pm 0.059} 0.867±0.1890.867\pm 0.189
0.6 LinUCB 0.292±0.1130.292\pm 0.113 0.375±0.0680.375\pm 0.068 0.000±0.0000.000\pm 0.000
AlwaysQuery 0.727±0.1950.727\pm 0.195 0.546±0.1080.546\pm 0.108 1.000±0.0001.000\pm 0.000
LinUCB-ExpDecay 0.374±0.1430.374\pm 0.143 0.424±0.0860.424\pm 0.086 0.440±0.1500.440\pm 0.150
LinUCB-QG 0.670±0.2370.670\pm 0.237 0.567±0.094\mathbf{0.567\pm 0.094} 0.867±0.1890.867\pm 0.189
0.8 LinUCB 0.292±0.1130.292\pm 0.113 0.333±0.0900.333\pm 0.090 0.000±0.0000.000\pm 0.000
AlwaysQuery 0.727±0.1950.727\pm 0.195 0.636±0.152\mathbf{0.636\pm 0.152} 1.000±0.0001.000\pm 0.000
LinUCB-ExpDecay 0.399±0.1290.399\pm 0.129 0.417±0.1030.417\pm 0.103 0.560±0.1500.560\pm 0.150
LinUCB-QG 0.670±0.2370.670\pm 0.237 0.619±0.1660.619\pm 0.166 0.867±0.1890.867\pm 0.189
0.9 LinUCB 0.292±0.1130.292\pm 0.113 0.313±0.1010.313\pm 0.101 0.000±0.0000.000\pm 0.000
AlwaysQuery 0.727±0.1950.727\pm 0.195 0.682±0.173\mathbf{0.682\pm 0.173} 1.000±0.0001.000\pm 0.000
LinUCB-ExpDecay 0.399±0.1290.399\pm 0.129 0.408±0.1160.408\pm 0.116 0.560±0.1500.560\pm 0.150
LinUCB-QG 0.670±0.2370.670\pm 0.237 0.644±0.2010.644\pm 0.201 0.867±0.1890.867\pm 0.189
TABLE IV: Workload data setting f1,simf_{1,sim}: Querying bandit algorithm metrics in simulation on the test set for different values of wtaskw_{task}. Averages are across 3 random seeds (For LinUCB-ExpDecay, we also average across 5 policy random seeds). Metric values for LinUCB-ExpDecay and LinUCB-QG correspond to the hyperparameter setting with maximal MwtM_{wt} on the validation set (for each separate data setting).
wtaskw_{task} Method rtask,avgr_{task,avg} MwtM_{wt} fqf_{q}
0.2 LinUCB 0.292±0.1130.292\pm 0.113 0.458±0.023\mathbf{0.458\pm 0.023} 0.000±0.0000.000\pm 0.000
AlwaysQuery 0.727±0.1950.727\pm 0.195 0.255±0.039-0.255\pm 0.039 1.000±0.0001.000\pm 0.000
LinUCB-ExpDecay 0.303±0.1160.303\pm 0.116 0.438±0.0840.438\pm 0.084 0.160±0.1500.160\pm 0.150
LinUCB-QG 0.336±0.0690.336\pm 0.069 0.353±0.0880.353\pm 0.088 0.333±0.0940.333\pm 0.094
0.3 LinUCB 0.292±0.1130.292\pm 0.113 0.438±0.034\mathbf{0.438\pm 0.034} 0.000±0.0000.000\pm 0.000
AlwaysQuery 0.727±0.1950.727\pm 0.195 0.132±0.059-0.132\pm 0.059 1.000±0.0001.000\pm 0.000
LinUCB-ExpDecay 0.303±0.1160.303\pm 0.116 0.421±0.0760.421\pm 0.076 0.160±0.1500.160\pm 0.150
LinUCB-QG 0.336±0.0690.336\pm 0.069 0.351±0.0830.351\pm 0.083 0.333±0.0940.333\pm 0.094
0.4 LinUCB 0.292±0.1130.292\pm 0.113 0.417±0.045\mathbf{0.417\pm 0.045} 0.000±0.0000.000\pm 0.000
AlwaysQuery 0.727±0.1950.727\pm 0.195 0.009±0.078-0.009\pm 0.078 1.000±0.0001.000\pm 0.000
LinUCB-ExpDecay 0.303±0.1160.303\pm 0.116 0.404±0.0720.404\pm 0.072 0.160±0.1500.160\pm 0.150
LinUCB-QG 0.336±0.0690.336\pm 0.069 0.349±0.0780.349\pm 0.078 0.333±0.0940.333\pm 0.094
0.5 LinUCB 0.292±0.1130.292\pm 0.113 0.396±0.056\mathbf{0.396\pm 0.056} 0.000±0.0000.000\pm 0.000
AlwaysQuery 0.727±0.1950.727\pm 0.195 0.113±0.0980.113\pm 0.098 1.000±0.0001.000\pm 0.000
LinUCB-ExpDecay 0.303±0.1160.303\pm 0.116 0.387±0.0710.387\pm 0.071 0.160±0.1500.160\pm 0.150
LinUCB-QG 0.336±0.0690.336\pm 0.069 0.347±0.0740.347\pm 0.074 0.333±0.0940.333\pm 0.094
0.6 LinUCB 0.292±0.1130.292\pm 0.113 0.375±0.068\mathbf{0.375\pm 0.068} 0.000±0.0000.000\pm 0.000
AlwaysQuery 0.727±0.1950.727\pm 0.195 0.236±0.1170.236\pm 0.117 1.000±0.0001.000\pm 0.000
LinUCB-ExpDecay 0.303±0.1160.303\pm 0.116 0.370±0.0750.370\pm 0.075 0.160±0.1500.160\pm 0.150
LinUCB-QG 0.336±0.0690.336\pm 0.069 0.344±0.0710.344\pm 0.071 0.333±0.0940.333\pm 0.094
0.8 LinUCB 0.292±0.1130.292\pm 0.113 0.333±0.0900.333\pm 0.090 0.000±0.0000.000\pm 0.000
AlwaysQuery 0.727±0.1950.727\pm 0.195 0.481±0.156\mathbf{0.481\pm 0.156} 1.000±0.0001.000\pm 0.000
LinUCB-ExpDecay 0.303±0.1160.303\pm 0.116 0.336±0.0910.336\pm 0.091 0.160±0.1500.160\pm 0.150
LinUCB-QG 0.560±0.0510.560\pm 0.051 0.466±0.0370.466\pm 0.037 0.800±0.1630.800\pm 0.163
0.9 LinUCB 0.292±0.1130.292\pm 0.113 0.313±0.1010.313\pm 0.101 0.000±0.0000.000\pm 0.000
AlwaysQuery 0.727±0.1950.727\pm 0.195 0.604±0.176\mathbf{0.604\pm 0.176} 1.000±0.0001.000\pm 0.000
LinUCB-ExpDecay 0.399±0.1290.399\pm 0.129 0.399±0.1120.399\pm 0.112 0.560±0.1500.560\pm 0.150
LinUCB-QG 0.560±0.0510.560\pm 0.051 0.513±0.0440.513\pm 0.044 0.800±0.1630.800\pm 0.163
TABLE V: Workload data setting f2,simf_{2,sim}: Querying bandit algorithm metrics in simulation on the test set for different values of wtaskw_{task}. Averages are across 3 random seeds (For LinUCB-ExpDecay, we also average across 5 policy random seeds). Metric values for LinUCB-ExpDecay and LinUCB-QG correspond to the hyperparameter setting with maximal MwtM_{wt} on the validation set (for each separate data setting).
wtaskw_{task} Method rtask,avgr_{task,avg} MwtM_{wt} fqf_{q}
0.2 LinUCB 0.292±0.1130.292\pm 0.113 0.429±0.023\mathbf{0.429\pm 0.023} 0.000±0.0000.000\pm 0.000
AlwaysQuery 0.727±0.1950.727\pm 0.195 0.255±0.039-0.255\pm 0.039 1.000±0.0001.000\pm 0.000
LinUCB-ExpDecay 0.303±0.1160.303\pm 0.116 0.377±0.1880.377\pm 0.188 0.160±0.1500.160\pm 0.150
LinUCB-QG 0.503±0.0410.503\pm 0.041 0.163±0.3300.163\pm 0.330 0.667±0.1890.667\pm 0.189
0.3 LinUCB 0.292±0.1130.292\pm 0.113 0.411±0.034\mathbf{0.411\pm 0.034} 0.000±0.0000.000\pm 0.000
AlwaysQuery 0.727±0.1950.727\pm 0.195 0.132±0.059-0.132\pm 0.059 1.000±0.0001.000\pm 0.000
LinUCB-ExpDecay 0.303±0.1160.303\pm 0.116 0.367±0.1640.367\pm 0.164 0.160±0.1500.160\pm 0.150
LinUCB-QG 0.503±0.0410.503\pm 0.041 0.206±0.2890.206\pm 0.289 0.667±0.1890.667\pm 0.189
0.4 LinUCB 0.292±0.1130.292\pm 0.113 0.394±0.045\mathbf{0.394\pm 0.045} 0.000±0.0000.000\pm 0.000
AlwaysQuery 0.727±0.1950.727\pm 0.195 0.009±0.078-0.009\pm 0.078 1.000±0.0001.000\pm 0.000
LinUCB-ExpDecay 0.303±0.1160.303\pm 0.116 0.358±0.1420.358\pm 0.142 0.160±0.1500.160\pm 0.150
LinUCB-QG 0.503±0.0410.503\pm 0.041 0.248±0.2490.248\pm 0.249 0.667±0.1890.667\pm 0.189
0.5 LinUCB 0.292±0.1130.292\pm 0.113 0.377±0.056\mathbf{0.377\pm 0.056} 0.000±0.0000.000\pm 0.000
AlwaysQuery 0.727±0.1950.727\pm 0.195 0.113±0.0980.113\pm 0.098 1.000±0.0001.000\pm 0.000
LinUCB-ExpDecay 0.303±0.1160.303\pm 0.116 0.349±0.1220.349\pm 0.122 0.160±0.1500.160\pm 0.150
LinUCB-QG 0.503±0.0410.503\pm 0.041 0.291±0.2090.291\pm 0.209 0.667±0.1890.667\pm 0.189
0.6 LinUCB 0.292±0.1130.292\pm 0.113 0.360±0.068\mathbf{0.360\pm 0.068} 0.000±0.0000.000\pm 0.000
AlwaysQuery 0.727±0.1950.727\pm 0.195 0.236±0.1170.236\pm 0.117 1.000±0.0001.000\pm 0.000
LinUCB-ExpDecay 0.303±0.1160.303\pm 0.116 0.340±0.1060.340\pm 0.106 0.160±0.1500.160\pm 0.150
LinUCB-QG 0.503±0.0410.503\pm 0.041 0.333±0.1700.333\pm 0.170 0.667±0.1890.667\pm 0.189
0.8 LinUCB 0.292±0.1130.292\pm 0.113 0.326±0.0900.326\pm 0.090 0.000±0.0000.000\pm 0.000
AlwaysQuery 0.727±0.1950.727\pm 0.195 0.481±0.156\mathbf{0.481\pm 0.156} 1.000±0.0001.000\pm 0.000
LinUCB-ExpDecay 0.303±0.1160.303\pm 0.116 0.321±0.0960.321\pm 0.096 0.160±0.1500.160\pm 0.150
LinUCB-QG 0.503±0.0410.503\pm 0.041 0.418±0.0930.418\pm 0.093 0.667±0.1890.667\pm 0.189
0.9 LinUCB 0.292±0.1130.292\pm 0.113 0.309±0.1010.309\pm 0.101 0.000±0.0000.000\pm 0.000
AlwaysQuery 0.727±0.1950.727\pm 0.195 0.604±0.176\mathbf{0.604\pm 0.176} 1.000±0.0001.000\pm 0.000
LinUCB-ExpDecay 0.399±0.1290.399\pm 0.129 0.383±0.1110.383\pm 0.111 0.560±0.1500.560\pm 0.150
LinUCB-QG 0.503±0.0410.503\pm 0.041 0.461±0.0590.461\pm 0.059 0.667±0.1890.667\pm 0.189
TABLE VI: Workload data setting f1,2,simf_{1,2,sim}: Querying bandit algorithm metrics in simulation on the test set for different values of wtaskw_{task}. Averages are across 3 random seeds (For LinUCB-ExpDecay, we also average across 5 policy random seeds). Metric values for LinUCB-ExpDecay and LinUCB-QG correspond to the hyperparameter setting with maximal MwtM_{wt} on the validation set (for each separate data setting).

VI-G3 Additional Details: generalization on D2D_{2}.

In this experiment, the validation process is analogous to that for the previous simulation results, but on the test set, we use separate workload models for LinUCB-QG action-selection, and for computing the evaluation metrics ΔWL\Delta WL and MwtM_{wt}. We use either f1,simf_{1,sim} or f1,2,simf_{1,2,sim} for the action-selection workload model, while we fix the evaluation workload model to be f2,simf_{2,sim}, and refer to the two settings as f1f_{1}-on-f2f_{2} and f1,2f_{1,2}-on-f2f_{2}, respectively. Table VII shows the results for LinUCB-QG in each of the two settings.

Metric f1f_{1}-on-f2f_{2} f1,2f_{1,2}-on-f2f_{2}
rtask,avgr_{task,avg} 0.434±0.0490.434\pm 0.049 0.503±0.0410.503\pm 0.041
ΔWL\Delta WL 0.159±0.092-0.159\pm 0.092 0.286±0.159-0.286\pm 0.159
MwtM_{wt} 0.269±0.0400.269\pm 0.040 0.373±0.108\mathbf{0.373\pm 0.108}
fqf_{q} 0.733±0.0940.733\pm 0.094 0.667±0.1890.667\pm 0.189
TABLE VII: LinUCB-QG performance in cross-data setting at test time, showing algorithm metrics in simulation on the test set, with wtask=0.4w_{task}=0.4. Averages are across 3 random seeds. Values for rtask,avgr_{task,avg} and ΔWL\Delta WL correspond to the hyperparameter setting with maximal MwtM_{wt} on the validation set (for each separate data setting).

As mentioned in Section VI-A, we find that the mean weighted metric MwtM_{wt} is higher in the f1,2f_{1,2}-on-f2f_{2} than the f1f_{1}-on-f2f_{2} setting (and ΔWL\Delta WL and fqf_{q} are both lower).

VI-G4 Sample Workload Model Rollout

Finally, Figure 5 shows a rollout of the workload model in simulation for LinUCB-QG, showing how the workload WLtWL_{t} evolves over time, how the UCB estimate gap variable GG varies, and when LinUCB-QG decides to query. In this example, LinUCB-QG decides to ask the human for help when G>wWLtG>wWL_{t} (in this rollout, w=4w=4), but also recall that once LinUCB-QG has asked for help for the current food item, it will continue to execute the expert action until it has converged.

VI-H Experiments: Real-world User Study

VI-H1 Feeding Tool

In our work, we use a custom feeding tool developed in prior work [\citeconsecutiverajatpriyarss2024, belkhale2022balancing], which is attached to the end of the Kinova’s Robotiq 2F-85 gripper. The tool consists of a motorized fork with two degrees of freedom: pitch β\beta and roll γ\gamma. Each of the six robot actions ara_{r} (for r{1,,6}r\in\{1,\dots,6\}) corresponds to a unique value of β\beta and γ\gamma. In particular, the pitch configuration (either TA, VS, or TV) affects the value of β\beta, while the roll configuration (either 00^{\circ} or 9090^{\circ}) affects the value of γ\gamma.

VI-H2 Details on Perception Modules

To more accurately determine skewering-relevant geometric information for each bite-sized food item, we leverage perception modules that were developed in prior work [42]. We use GroundingDINO [43] to produce an initial bounding box around the food item, followed by SAM [44] to segment the food item, and finally extract the minimum area rectangle around the segmentation mask to get a refined bounding box. From this bounding box, we derive the centroid and major axis of the food item, which are used by the low-level action executor to execute the desired robot action ara_{r}.

VI-H3 Additional Details and Analysis

Here we present additional details related to the real-world user study experiments outlined in Section VI-B.

Pre-method questions. We ask the users with mobility limitations the following questions prior to each method:

  • “How mentally/physically burdened do you feel currently?”

  • “How hurried or rushed do you feel currently?”

For users without mobility limitations, we ask the following questions:

  • “How mentally burdened do you feel currently?”

  • “How hurried or rushed do you feel currently?”

Post-method questions. We ask the users with mobility limitations the following questions after each method:

  • “For the last method, how mentally/physically burdened do you feel currently because of the robot querying you?”

  • “For the last method, how hurried or rushed do you feel currently because of the robot querying you?”

  • “For the last method, how hard did you have to work to make the robot pick up food items?”

  • “For the last method, how successful was the robot in picking up food items?”

For users without mobility limitations, we ask the following questions:

  • “For the last method, how mentally burdened do you feel currently because of the robot querying you?”

  • “For the last method, how hurried or rushed do you feel currently because of the robot querying you?”

  • “For the last method, how hard did you have to work to make the robot feed you?”

  • “For the last method, how successful was the robot in picking up the food item and bringing it to your mouth?”

Compensation. Users were compensated at a rate of $12/hr, for a total compensation of $18 for the entire study (as the study duration was 1.5 hr).

Note on initial workload WL0WL_{0}. Although we do ask for an initial estimate of the user’s workload by asking the pre-method questions above, in our real-world experiments, we provide a fixed initial workload value of WL0=0.5WL_{0}=0.5 for LinUCB-QG. This value is used by LinUCB-QG for its internal workload predictions and decision-making. However, when calculating the subjective metrics in Section VI-B that measure changes in workload (Delta-Mental, Delta-Temporal, and Delta-Querying Workload), we use the responses to the pre-method questions to estimate the user’s initial workload.

Method comparison across all metrics. Tables VIII and IX include comparisons across the three methods used in the real-world user study (LinUCB, AlwaysQuery, LinUCB-QG) for the 7 subjective and 3 objective metrics mentioned in Section VI-B, respectively. For all metrics, we find that LinUCB-QG achieves an intermediate value for that metric compared to the other two algorithms, suggesting that our method finds a balance between querying workload and task performance (which these metrics cover). Additionally, Table X shows the full results of the Wilcoxon paired signed-rank tests that we used to determine whether LinUCB-QG exhibited statistically significant differences in the full set of metrics compared to LinUCB-QG and AlwaysQuery. For the metrics related to querying workload, our alternative hypothesis was that the value of the metric for LinUCB-QG was less than the value for AlwaysQuery. For the metrics related to task performance, our alternative hypothesis was that the value of the metric for LinUCB-QG was greater than the value for LinUCB. For all metrics, we find that LinUCB-QG has a statistically significant improvement compared to the baseline for that metric.

Method Post-Mental Post-Temporal Post-Performance Post-Effort Delta-Mental Delta-Temporal Delta-Querying Workload
LinUCB 1.24±0.591.24\pm 0.59 1.37±0.971.37\pm 0.97 3.03±0.773.03\pm 0.77 1.24±0.751.24\pm 0.75 0.34±0.88-0.34\pm 0.88 0.26±0.92-0.26\pm 0.92 0.02±0.16-0.02\pm 0.16
AlwaysQuery 3.03±1.383.03\pm 1.38 2.58±1.552.58\pm 1.55 4.67±0.724.67\pm 0.72 3.03±1.333.03\pm 1.33 1.42±1.391.42\pm 1.39 0.87±1.760.87\pm 1.76 0.39±0.310.39\pm 0.31
LinUCB-QG 2.26±1.112.26\pm 1.11 2.08±1.222.08\pm 1.22 4.06±1.044.06\pm 1.04 2.24±1.052.24\pm 1.05 0.63±1.050.63\pm 1.05 0.37±1.080.37\pm 1.08 0.21±0.200.21\pm 0.20
TABLE VIII: Subjective metric results for the real-world user study.
Method rtask,avgr_{task,avg} nsuccessn_{success} nqn_{q}
LinUCB 0.38±0.080.38\pm 0.08 2.00±0.232.00\pm 0.23 -
AlwaysQuery 0.87±0.250.87\pm 0.25 2.68±0.612.68\pm 0.61 3±03\pm 0
LinUCB-QG 0.68±0.290.68\pm 0.29 2.53±0.722.53\pm 0.72 1.74±0.781.74\pm 0.78
TABLE IX: Objective metric results for the real-world user study.
Metric Hypothesis pp
Post-Mental LinUCB-QG << AlwaysQuery 3.28e-43.28\text{e-}4
Post-Temporal 1.42e-21.42\text{e-}2
Post-Effort 1.2e-41.2\text{e-}4
Delta-Mental 7.68e-47.68\text{e-}4
Delta-Temporal 1.82e-21.82\text{e-}2
Delta-Querying Workload 1.47e-41.47\text{e-}4
nqn_{q} 2.93e-72.93\text{e-}7
Post-Performance LinUCB-QG >> LinUCB 1.59e-41.59\text{e-}4
rtask,avgr_{task,avg} 1.56e-41.56\text{e-}4
nsuccessn_{success} 2.18e-42.18\text{e-}4
TABLE X: Wilcoxon paired signed-rank test results for the real-world user study, showing pp-values associated with querying workload metrics (top) and task performance metrics (bottom). For all tests, we used a significance level of α=0.05\alpha=0.05.

Comments on Post-Performance metric. When looking at the Post-Performance metric, we find that LinUCB-QG achieves an intermediate value, higher than LinUCB but lower than Always-Query. When asking users to rate subjective success, we asked them to rate the overall success of each algorithm, which depends on both whether the algorithm successfully picked up food items and how many attempts were required for success. For this reason, we believe that users rated the success of Always-Query highly because it efficiently selects the optimal action for each food item within one timestep, whereas LinUCB-QG sometimes takes multiple timesteps if it does not query for a particular food item.

ACKNOWLEDGMENT

This work was partly funded by NSF CCF 2312774 and NSF OAC-2311521, a LinkedIn Research Award, and a gift from Wayfair, and by NSF IIS 2132846 and CAREER 2238792. Research reported in this publication was additionally supported by the Eunice Kennedy Shriver National Institute Of Child Health & Human Development of the National Institutes of Health and the Office of the Director of the National Institutes of Health under Award Number T32HD113301. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

The authors would like to thank Ethan Gordon for his assistance with the food dataset, Ziang Liu, Pranav Thakkar and Rishabh Madan for their help with running the robot user study, Janna Lin for providing the voice interface, Shuaixing Chen for helping with figure creation, Tom Silver for paper feedback, and all of the participants in our two user studies.