Multi-User Augmented Reality with Infrastructure-free Collaborative Localization

John Miller, Elahe Soltanaghai, Raewyn Duvall,
Jeff Chen, Vikram Bhat, Nuno Pereira, Anthony Rowe Carnegie Mellon University

Abstract.

Multi-user augmented reality (AR) could someday empower first responders with the ability to see team members around corners and through walls. For this vision of people tracking in dynamic environments to be practical, we need a relative localization system that is nearly instantly available across wide-areas without any existing infrastructure or manual setup. In this paper, we present LocAR, an infrastructure-free 6-degrees-of-freedom (6DoF) localization system for AR applications that uses motion estimates and range measurements between users to establish an accurate relative coordinate system. We show that not only is it possible to perform collaborative localization without infrastructure or global coordinates, but that our approach provides nearly the same level of accuracy as fixed infrastructure approaches for AR teaming applications. LocAR uses visual-inertial odometry (VIO) in conjunction with ultra-wideband (UWB) ranging radios to estimate the relative position of each device in an ad-hoc manner. The system leverages a collaborative 6DoF particle filtering formulation that operates on sporadic messages exchanged between nearby users. Unlike map or landmark sharing approaches, this allows for collaborative AR sessions even if users do not overlap the same spaces. LocAR consists of an open-source UWB firmware and reference mobile phone application that can display the location of team members in real-time using mobile AR. We evaluate LocAR across multiple buildings under a wide-variety of conditions including a contiguous 30,000 square foot region spanning multiple floors and find that it achieves median geometric error in 3D of less than 1 meter between five users freely walking across 3 floors.

^†^†conference: ; ;

1. Introduction

Driven by advances in visual-inertial odometry (VIO), simultaneous localization and mapping (SLAM) and miniaturized depth sensing technologies, we are seeing augmented reality (AR) technologies becoming more accessible on a wide variety of platforms. Mobile phones are being equipped with specific dedicated hardware to enable richer AR experiences, including multiple cameras, specialized processors, UWB ranging radios (Apple, 2021) and small LIDAR depth sensors. Navigation applications like Google Maps AR mode, Ikea Place, and games like Pokemon Go have shown some of the early potential of AR on mobile phones.

While these single-user applications have largely been successful, developing interactive multi-user applications has proven to be substantially more difficult. Multi-user AR presents a unique set of challenges involving communication, synchronization, and localization between users. Overcoming such challenges opens the door for truly groundbreaking applications to emerge, both in mobile AR and for wearable headsets. Take for example a first responder or firefighter application, where teams of users navigate through a previously unexplored or harsh (damaged/modified) environment. With the availability of a robust multi-user AR platform, it would be possible to annotate people and their paths, and drop virtual markers in the environment in an entirely infrastructure-free manner. First responders could use headset AR to see the status and position of fellow teammates and the location of support vehicles even through walls without any a priori scene information. In the mobile phone context, this same type of platform could even help you find a friend at a concert venue or your keys at home.

In order to overlay virtual content that appears from the user’s perspective to be ”anchored” to the physical world, it is necessary to track the pose of the user’s display relative to the world. As the user rotates/translates the display, the projected content needs to move accordingly, which requires accurate 6-degree-of-freedom (6DoF) motion tracking. With a single user, it is sufficient to perform this tracking with respect to any arbitrary starting pose. The position and orientation of the origin is irrelevant, as long as the current pose is accurate with respect to that origin. With multiple users, the problem becomes more challenging – in order for each user to see the same virtual content at the same physical location, each tracking instance must share the same 6DoF origin. This requires some collaboration between the devices. While gravity provides a reference direction for one axis (provided the devices are equipped with accelerometers), magnetic field readings are inconsistent in indoor environments and thus unreliable as a yaw reference (Rajagopal et al., 2019).

Current AR frameworks, such as Apple’s ARKit and Google’s ARCore, as well as off-the-shelf headsets like Microsoft’s Hololens 2, now have provisions to enable multi-user applications (Apple, 2018). While each uses a slightly different approach, all rely on sharing visual (and depth) features between users in order to establish a common coordinate system. As each user detects distinguishable features in the environment, these features are collected into a map using SLAM. By sharing this map, other users are able to localize themselves if they detect the same visual features. While we are optimistic about the future of multi-user vision-based SLAM, the current frameworks currently fall short in terms of reliability and scalability across large areas. Finding a visual feature match that is robust enough to provide a common origin currently requires the two users to view the surrounding scene from a very similar perspective. However, in many scenarios, visual matching between the users is not possible even in close proximity. For example, if the users view the scene from different directions or under different lighting conditions, or if objects in the scene have been moved or become occluded, visual feature matching struggles. Additionally, in search and rescue applications, users are often purposefully taking disjoint paths through the environment to improve coverage, so there will be no common visual features for map matching (either because they are separated by walls or are in a visual denied environment with smoke). In order to provide building-scale coverage and beyond, it is necessary to maintain a large and (ideally) dense feature map, which quickly becomes impractical to store and share.

Refer to caption — Figure 1. LocAR offers a distributed infrastructure-free relative localization framework that allows multiple mobile users to create a collaborative AR session.

This paper addresses these limitations by proposing LocAR, a distributed relative localization framework that allows multiple AR users to create an on-demand collaborative AR session without any prior infrastructure. LocAR uses motion information from VIO and ranging measurements between users from UWB radios, which are now available on the latest generation of mobile phones and specialized AR headsets, to estimate the relative pose (6DoF) of each AR user. LocAR’s key innovation is the design of a collaborative particle filter that jointly estimates the 6DoF pose of all AR users relative to each other without requiring any map sharing or pre-existing localization infrastructure. Since it does not rely on sharing visual features, this approach is broadly applicable to static or dynamic environments, both indoor and outdoor, including search-and-rescue scenarios where the need for visual pre-mapping is a nonstarter. LocAR is an alternative technique for setting up multi-user AR sessions by sharing inertial data and ranges as opposed to sharing nearby landmarks.

To achieve this, LocAR captures the local inertial information from each individual AR user, providing the 6DoF pose of this user over time. While tracking motion, LocAR collects distance ranges (using UWB) to other users, and combines these information sources using a particle filter. Like most inertial tracking systems, VIO tracking estimates are smooth and locally accurate, but drift over time and provide no initial pose estimate. UWB ranges are infrequent and noisy, but provide absolute distance information that does not drift over time. By combining these complementary sensors, we achieve the best of both worlds. The absolute nature of UWB ranges allows us to correct VIO drift over time, while noise of UWB readings is smoothed by the VIO. In addition, the distributed architecture of LocAR allows each user to locally estimates the pose of other AR users with minimal message exchange between users in favor of scalability.

One core challenge, however, is the state-space explosion of the particle filter as the number of AR users grows, since they all must be tracked simultaneously. To address this challenge, LocAR takes a collaborative particle filtering approach that still tracks all nodes jointly, but uses Rao-Blackwell factorization to reduce the number of particles required to a tractable level. Compared to a more traditional particle filter where each user is independent, we show that the collaborative approach is able to leverage the synergistic information present between ranges to different nodes to improve accuracy, while maintaining a reasonable memory footprint that grows linearly with the number of tracked nodes. In addition, our filter formulation allows for sporadic UWB and VIO updates, loosening communication constraints in the system design over methods that rely on fixed-rate updates.

In order to demonstrate relative user tracking and a prototype teaming use-case, we developed a mobile AR application for iOS. Our technique is fundamental to any relative tracking system that has inertial data along with ranging estimates and hence could be applied to AR headset in hands-free applications like aiding first responders (firefighters would not use mobile phones inside burning buildings). Since UWB APIs are currently not available to mobile phone developers, we created a peer-to-peer ranging firmware for the MDEK1001 evaluation module from Decawave. The firmware allows a phone to pair with the MDEK module over BLE which in turn discovers and ranges with any number of nearby UWB devices. Each module can be paired with mobile phones or powered by batteries to act as a stand-alone tag or beacon. The firmware is able to multiplex a BLE connection with the phone while simultaneously performing low-power neighborhood discovery using a scalable rate-adaptive round-robin protocol for ranging (discussed in more detail in Section 5.1).

We evaluated the performance of our system in a number of environments and in four different buildings. We tested in static as well as more dynamic environments with moving people, furniture and changing lighting. One of our test included 5 users moving around a large (30,000+ sq ft) contiguous 3-floors area within an office building. The test environment includes long corridors, different sizes of rooms separated by concrete, drywall and various other construction materials. In a number of tests, we moved furniture and toggled lighting to simulate more dynamic elements often found in the wild. In each test, the users walked freely creating many NLoS scenarios with multiple walls between users. Across all these experiments, LocAR provides a mean 3D geometric error performance of 0.9 m between users across 12 different random walking traces, creating over 200 groundtruth-ed points and walking periods between 5-20 minutes per test. In addition, we observe that the quality of AR performance is sensitive to more than just geometric localization error. Camera lens parameters, bearing and distance combine to define visual registration errors that are highly dependant on the scene geometry. To better capture these effects, we also evaluate our system in terms of pixel error, which more accurately captures the visual displacement errors experienced by users. We observe that LocAR provides significantly lower pixel error compared to baseline methods that only rely on visual features. Our application source and UWB firmware is all open-source and will be available on GitHub.

Contributions: Our core technical contributions are:

•

A distributed Rao-Blackwellized Particle Filter (RBPF) formulation and implementation for real-time 6DoF relative localization.
•

An energy-efficient peer-to-peer UWB protocol with open-source firmware tailored toward wide-area relative localization.
•

An end-to-end implementation and thorough evaluation of an infrastructure-free AR localization system that can support multiple users. A short demo of the system in real-time can be seen here: https://www.youtube.com/watch?v=5vjKUjgLqhc.

2. Related Work

Approach	Infrastructure-free	Robustness (light, motion, etc)	Computational Complexity
Beacons / Markers (Wang and Olson, 2016a; Zhao et al., 2020)	✗	✗	✓
SLAM w beacons (Song et al., 2019)	✗	✓	✓
Dead Reckoning¹¹1requires known start point for localization(Kendall et al., 2015)	✓	✗	✓
SLAM map sharing (Apple, 2020)	✓	✗	✗
LocAR	✓	✓	✓

Table 1. Existing localization techniques vs. LocAR

The topic of indoor localization has seen many solutions over the past decade for mobile phones, in robotics, and more recently for AR. These methods can be broadly categorized under vision-based solutions that mainly rely on cameras, LIDARs, along with supporting sensors such as IMUs, and range-based solutions that use RF, acoustic, infrared, or UWB beacons for ranging and location estimates. In the following sections, we discuss the related work in each of these categories and explain their limitations for multi-user AR applications.

2.1. Vision-based Localization

Fiducial markers such as ARTags (Mulloni et al., 2011; Klopschitz and Schmalstieg, 2007) and AprilTags (Wang and Olson, 2016a; Kallwies et al., 2020; Zhao et al., 2020) are frequently used in AR systems to provide a reference between the physical environment and virtual objects. While these passive markers can be accurately localized with only a camera and low computational requirements, they are not suitable for real-time tracking of AR users, especially in mobile and NLoS scenarios.

Simultaneous localization and mapping (SLAM) techniques are widely used in robotics and AR systems for identifying and then leveraging features in an environment to track the position of a moving device. These methods use either monocular cameras (Bae et al., 2016; Engel et al., 2014; Mur-Artal et al., 2015; Aider et al., 2005), depth cameras (Bae et al., 2014; Lu and Kambhamettu, 2014; Yuan et al., 2016), or stereo cameras (Bellavia et al., 2013; Brand et al., 2014; You et al., 1999) to extract visual features from the scene and extract the 3D coordinates of the features and the device 6DoF pose. These coordinates, however, are only relative to an origin point, typically the start point. More recently, ARKit (Apple, 2020) by Apple and ARCore (Developers, 2019) by Google have shown persistent AR by providing 6DoF pose estimation with respect to a previously acquired map by combining vision or point-clouds with VIO. However, the biggest challenge facing vision-based localization is that it relies heavily on recognizing known images or clusters of unique feature points in the environment. This results in slow convergence and a high sensitivity to environment dynamics such as displacement of furniture, ambient lighting, and requires rich visual (or depth map) features. Even with advanced hardware platforms such as the Hololens 2 that uses depth sensors, users must often walk around and view several areas of a scene before localization is able to take effect. Due to the limitations of purely vision-based approaches, we advocate combining visual localization approaches with range-based beacons, and mainly focus on relative location of the users with respect to each other.

2.2. Range-based Localization

Beacon-based solutions provide continuous localization using UWB (Olsson et al., 2014), BLE (Shao et al., 2018; Cheung et al., 2006), or ultrasound (Gómez et al., 2013; Lazik et al., 2015; Li et al., 2018) ranging. However, all of these methods rely on pre-installed infrastructure and dense deployments throughout the building, which is not suitable for all AR applications. An alternative approach is to combine vision and ranging mechanisms, where ranging information from on-board radios such as UWB, Bluetooth, or WiFi is used to eliminate the accumulated errors of odometry sensors (Song et al., 2019; Olsson et al., 2014; Liu et al., 2017; Dhekne et al., 2019; Gentner and Ulmschneider, 2017; Wang et al., 2017; Rajagopal et al., 2019). While probably the best systems in terms of performance, the existing solutions still require pre-installed infrastructure or known starting points to link local maps to the physical space. LocAR addresses these limitations by combining relative ranging of the users with local motion information of each users, thus achieving the best of both worlds. One recent example of using VIO with UWB for ranging was introduced by the Apple AirTag platform. AirTag leverages a single moving phone along with UWB ranging (from the Apple U1 chipset) to detect a small battery-powered tag. The system currently only operates in 2D and hence does not work across multiple floors and does not support a network of moving users each localizing each other.

2.3. Multi-User Localization

While typical SLAM-based solutions assume a single user, recent developments in AR frameworks, such as Google’s ARCore / Cloud Anchors, Apple’s ARKit and Microsoft’s Spatial Anchors, have enabled multi-user capabilities. In these systems, each AR device individually performs SLAM to capture the visual features of the physical space relative to its local coordinate system. The users then share these visual maps to establish a common coordinate system and estimate the pose of other users. To share these maps between the users, Google ARCore uses a cloud-based architecture, which combines these maps centrally and sends the updated maps to all the users. However, Apple ARKit uses a peer-to-peer architecture, where the host of the AR session shares its current map with the users joining the session. However, any of these techniques impose significant communication overhead (Ran et al., 2019). These maps consist of dense visual features, 3D meshes, or raw point clouds, which are usually large and difficult to transfer. In addition, these map matching approaches assume a significant overlap between all of the users, which becomes unwieldy in terms of network traffic and computation in large areas. On the other hand, in search and rescue environments, where multi-user AR can bring significant added value, users are often purposefully taking disjoint paths through the environment to improve coverage, thus making map matching much more challenging and substantially increasing the convergence time.

2.4. Relative Localization

Traditional localization systems typically have the goal of estimating the ”absolute” location in a fixed coordinate system that is mapped to the physical space using external systems. In this sense, the idea of ”localization” is inherently tied to the existence of some form of infrastructure from which to base the measurements. However, reliance on infrastructure is infeasible in many AR scenarios, especially in the presence of multiple users. In these cases, it is instead possible to determine the relative locations between users to establish a common coordinate system for multi-user AR applications. For example, to display a virtual overlay of a target on the screen, only the knowledge of the target’s position relative to the display system is required.

The concept of relative localization is first used in sensor network localization for collectively locating stationary (Nagpal et al., 2003; Savvides et al., 2001) and mobile (Moore et al., 2004; Rad et al., 2011; Eren et al., 2004) nodes with respect to each other. These works provide the theoretical foundation for network localization using graph theory (Eren et al., 2004; Popescu et al., 2012; Ferner et al., 2008; Čapkun et al., 2002; Jamali-Rad et al., 2012; Barooah and Hespanha, 2007), information theory (Savic and Zazo, 2013; Ross, 2014; Charrow et al., 2014; Hoffmann et al., 2006), or Bayesian inference methods (Nilsson and Händel, 2013; Carlone et al., 2011). However, the majority of these systems are only evaluated in simulation and do not provide the desired AR performance in terms of latency (real-time operation), accuracy, and degrees of rigid body movement in space (6DoF).

Relative localization has also been explored in robotics for localizing a swarm of drones or multiple robots with respect to each other (Coppola et al., 2018; Carlone et al., 2011; Olsson et al., 2014; Liu et al., 2017). However, all of these systems assume short ranges with all the nodes in Line-of-Site (LOS) and only focus on 2D or 3D localization, instead of 6DoF. In addition, most of these methods require the prior knowledge of the initial position, which suffers from accumulated error over time. Therefore, they can not be directly extended to wide-area AR applications. In this paper, we present a distributed relative localization framework that provides the real-time relative pose estimates of AR users without requiring any pre-existing infrastructure, prior mapping, or known initial position.

Relative Localization Architectures In terms of information sharing, two types of relative estimation architecture can be employed for a relative localization system – centralized and distributed. In a centralized architecture (Nagpal et al., 2003), all the nodes in a network collect and combine the information in a central server for fusion. This requires all nodes to be in constant communication with the central node, which results in a large communication overhead. A distributed architecture (Moore et al., 2004) doesn’t require a central server; instead, each node processes the information locally using their on-board solver. The main advantage is that it is more scalable to larger network sizes at the expense of a slight reduction in accuracy.

3. System Overview

In this paper, we propose a distributed relative localization framework that allows multiple users to create a collaborative and persistent AR session in a completely infrastructure-free setup. The system is computationally practical to a large number of users and wide-area environments. We now describe our formulation of the localization problem in greater detail and introduce the components and algorithms we use in the system, with details being discussed in greater depth in later sections.

3.1. Problem Formulation

We consider an indoor scenario consisting of $N$ mobile users (nodes), all with unknown positions and orientations. Each mobile user has an AR display device, defined as display device, and wants to localize all other users, defined as target devices, with respect to itself, without requiring any a priori knowledge of the physical space or pre-installed infrastructure. The localization framework has to work in real-time on limited compute platforms and needs to scale feasibly with number of devices being tracked. All AR devices are equipped with 2 sensors:

•

VIO tracking: currently many smart phones and most AR headsets in the market have built-in VIO. VIO essentially tracks the motion of camera by fusing detected visual feature points with inertial sensor data. The output of VIO is the position and orientation of the device with respect to the reference frame at startup by performing conventional dead-reckoning. Even though VIO provides the camera displacement over time, there is no common origin between multiple users to extract their relative positions. Another challenge of VIO data is the accumulated drift error over time and sensitivity of visual dead-reckoning to environment conditions, such as lighting and motion.
•

UWB ranging: Among various wireless technologies that can go through obstacles such as Bluetooth, UWB, WiFi, etc, UWB is the most promising technology to combat multipath propagation in cluttered environments(Rajagopal et al., 2019). As a result, we are seeing appearance of UWB chips on the latest mobile phones, providing peer-to-peer ranging. However, each UWB node is only capable of measuring the distance to neighboring nodes that are in range. Given the mobility of users, we cannot assume that range measurements occur synchronously or with any sort of regularity, resulting in sparse and inadequate data for real-time localization.
•

Data Communication: We assume that each user’s device can communicate their state information with any neighbors in a peer-to-peer manner. This requires relatively low data rate exchanges and could leverage the UWB transmissions or use an ad-hoc method like WiFi Direct, Bluetooth or dedicated emergency responder radios. One of the key benefits of our collaborative filtering approach is that devices only needs to exchange data with their neighbors that are replying to UWB messages (not a fully connected network). In our experimental platform, each node communicates using WiFi or LTE from the mobile device, but this could be easily replaced in a production implementation. In our experiments, the state data transmitted on each active neighbor link between neighbors was below 16 kbit/s (assuming 10Hz updates). Our system is also resilient to message drops and reasonable levels of jitter (tens of ms). With message latencies on the order of 100ms, our system appears to perform well and is within common bounds for most single-hop wireless communication systems.

3.2. System Architecture

We demonstrate that the synergy between VIO and UWB data allows us to overcome the challenges and limitations of each. This results in a distributed relative localization framework that allows multiple mobile users to create a collaborative AR session. First, LocAR adopts a distributed architecture in favor of scalability by providing each node with peer-to-peer distance measurements to other nodes. Next, to deal with the sparsity of UWB reading, range measurements are combined with local camera VIO traces. The absolute nature of UWB ranging allows us to correct VIO drift over time. Finally, LocAR leverages the presence of multiple users and their mobility to collaboratively estimate the relative position of all users, improving the overall localization accuracy while maintaining low computational overhead. An overview of LocAR framework is depicted in Figure 2. Upon startup of an AR app, the AR session tracks the pose of this device using VIO from the AR API and collects the UWB ranges from neighboring nodes, which are then passed to a particle filter state-estimation solver to extract the location and orientation of other nodes with respect to itself. The next section elaborates on LocAR’s collaborative pose estimation technique and the underlying challenges.

3.3. Overlaying Virtual Objects

To display a virtual object, three matrices are required:

•

$K$ : The 3x4 matrix encoding the intrinsic properties of the virtual camera such as resolution, focal length, and centerpoint, which which are assumed to be known
•

$D_{O}$ : The 4x4 matrix encoding the 6DOF pose of the display relative to some arbitrary origin
•

$V_{O}$ : The 4x1 vector encoding the 3DOF position of the target object relative to that same origin.

From these, we can calculate the pixel coordinates of the virtual object on the display $[u,v]^{T}$ as:

(1)

[u^{\prime},v^{\prime},w^{\prime}]^{T}=K*D_{O}^{-1}*V_{O}

(2)

[u,v]^{T}=[u^{\prime},v^{\prime}]^{T}/w^{\prime}

We can simply combine the latter two matrices as:

(3)

V_{D}=D_{O}^{-1}*V_{O},

where $V_{D}$ is now the 4x1 vector representing the position of the target object relative to the display. There is no longer a requirement for a fixed origin. The coordinate system is typically chosen with +x pointing towards the right edge of the display, +y pointing towards the top edge of the display, and +z pointing out of the display towards the viewers eyes.

4. Relative Pose Estimation

Here we describe our localization framework, which uses a particle filter for tracking the $N-1$ target devices $V_{D}^{(i)}$ relative to the display device. First, we start by explaining a simple approach that tracks each $V_{D}^{(i)}$ independently and then demonstrate how it can be enhanced by tracking all $V_{D}^{(i)}$ jointly, using Rao-Blackwellized particle filtering (RBPF) to ensure that the problem remains tractable as $N$ grows.

4.1. Particle Filter (PF) Formulation

We begin by describing the state-space representation and our error models for VIO and UWB measurements, for a basic particle filter formulation. A particle filter for our state estimation has the following benefits: (i) it is computationally easy to run online, (ii) it allows us to use arbitrary noise models to describe VIO and UWB errors, (iii) it can work with as few as 1-2 beacons in under-defined cases, (iv) as it is agnostic to update rate, it allows handling of asynchronous ranges from the beacons and does not require receiving synchronized ranges to perform trilateration.

4.1.1. State-Space

We wish to track each device $V_{D}^{(i)}$ relative to the display. Each $V_{D}$ consists of three positional components, $x$ , $y$ , and $z$ . In addition, since the VIO estimates from each device are with reference to a separate origin with a separate orientation, we need to add components to the state-space to track the orientation of each device as well. However, since VIO provides an orientation estimate that is gravity-aligned (thanks to the inclusion of accelerometer measurements), we need only track a single angle $\theta$ about the vertical axis for each $V_{D}$ , where $\theta$ is the angular offset between the target device’s origin orientation and the display device’s. Thus, our state-space for each tracked device has 4 dimensions: $x^{(i)}$ , $y^{(i)}$ , $z^{(i)}$ , and $\theta^{(i)}$ .

4.1.2. VIO Measurements

VIO, like other forms of odometry, tracks a device’s motion over time relative to some arbitrary origin. It measures $dx$ , $dy$ , and $dz$ . Although AR frameworks on mobile devices normally perform loop closure to help mitigate drift, there is still a steady accumulation of integration error that occurs in practice, both in position and orientation about the vertical axis (Rajagopal et al., 2019). We model these errors as Gaussian with small standard deviations $\sigma_{xyz}$ and $\sigma_{\theta}$ , respectively. The state update equations for VIO at time $t$ are:

(4)

x^{(i)}(t+1)=x^{(i)}(t)+dx*\cos\theta^{(i)}+dz*\sin\theta^{(i)}+N(0,\sigma_{xyz}^{2})

(5)

y^{(i)}(t+1)=y^{(i)}(t)+dy+N(0,\sigma_{xyz}^{2})

(6)

z^{(i)}(t+1)=z^{(i)}(t)+dz*\cos\theta^{(i)}-dx*\sin\theta^{(i)}+N(0,\sigma_{xyz}^{2})

(7)

\theta^{(i)}(t+1)=\theta^{(i)}(t)+N(0,\sigma_{\theta}^{2})

We note that, although the linear velocity error $\sigma_{xyz}$ and rotational velocity error $\sigma_{\theta}$ are modeled as Gaussian in our formulation, any unexpected error in VIO (such as large jumps) can be recovered from using resampling techniques that account for the possibility of these jumps (see ”kidnapped robot problem” in (Thrun, 2002)).

4.1.3. UWB Measurements

UWB measurements occur frequently but sporadically between pairs of nodes. They give a measurement of the distance between a pair of nodes, with an error that is roughly Gaussian with standard deviation $\sigma_{r}$ . In the particle filter, we use a uniform model for UWB range error that extends $\pm 3\sigma_{r}$ , and assume there is a $P_{nlos}$ chance that the UWB range is entirely wrong due to non-LOS (NLOS) errors. The probability model for obtaining a UWB range $z$ to node $V_{D}^{(i)}$ is:

(8)

P(z)=\begin{cases}P_{nlos}&\text{if $|\|V_{D}^{(i)}-D\|-z|>3\sigma_{r}$}\\ 1-P_{nlos}&\text{otherwise}\end{cases}

where $D$ is the position of the display device relative to its starting point, as measured by its own VIO tracking.

The reason for using a uniform probability model instead of a Gaussian is to account for the conditional dependence between consecutive measurements between the same pair of nodes. Since the particle filter assumes consecutive measurements are independent, it will interpret repeated measurements as ”new” information, averaging their errors together. If a Gaussian error model is used, these repeated measurements will lead to false convergence and particle impoverishment, which can be avoided by using a uniform model.

4.2. Collaborative Estimation with RBPF

In our naive baseline implementation, we run a completely independent particle filter for each device we wish to track. As a result, the computational load scales linearly with the number of devices $N$ . However, since the particle filters are run independently, any error that is accumulated due to noise in the display device’s own VIO tracking cannot be mitigated through collaboratively ranging to multiple devices and leveraging the synergistic information that arises.

With unlimited computational resources, it would be possible to jointly model the states of all $N$ moving devices. This way, every range could be used to improve the state estimation of all nodes in the joint distribution. However, sampling from $4N$ dimensional state-space would require a number of samples exponential in $N$ in order to adequately sample the growing dimensionality.

A solution to this problem arises when some state variables $Y^{(i)}$ are always conditionally independent given some other state variable $X$ . When this is the case, it is possible to factorize the joint probability distribution and independently track each $Y^{(i)}|X$ . This approach, called Rao-Blackwellization (RBPF), is common in the SLAM literature (Thrun, 2002) as a means of estimating a map whose elements are conditionally independent given a user’s location. As illustrated in Figure 3, our formulation of the relative localization problem fits this framework, since UWB provides measurements of device locations that are conditionally independent given the location of the display device.

In the RBPF formulation, a particle filter is used to represent the belief of the display device $D$ , where each target device $V_{D}^{(i)}$ can be represented by any probabilistic distribution. We chose to also represent the target device estimates using particle filters. In Section 6.7, we demonstrate the benefit of the collaborative nature of our joint RBPF formulation over the more common naive independent particle filter approach.

5. System Implementation

There are three main components to the implementation of our system: UWB ranging platform, AR application, and large-scale ground-truth collection. The UWB ranging platform allows collection of range data between users in a dynamically sized ad-hoc network. The AR application overlays a digital objects on the estimated relative locations in the field of view of the display, allowing users such as firefighters to know where their team is without having a direct visual. It also collects visual data from VIO and ground-truth by decoding AprilTags, which are placed strategically around the building to determine error in our system during data collection.

5.1. UWB Ranging Prototype

Though not the focus of this paper, we realized that many localization researchers have struggled to find a UWB solution that can easily operate in peer-to-peer mode at scale. Unfortunately, most of the freely available reference implementations are designed for fixed infrastructure scenarios that support mobile devices, as opposed to fully peer-to-peer operation. We imagine that phones with UWB hardware could eventually implement this functionality on-board once APIs become available.

In order to perform neighborhood discovery and ad-hoc ranging with UWB, we developed an open-source and easy-to-use firmware image for the MDEK1001 modules from DecaWave. The MDEK1001 is an all in 1 battery- or USB-powered module with an enclosure that pairs a Nordic nRF52832 MCU with a DW1000 chip. The Nordic chip has a 64 MHz Arm Cortex-M4 processor with integrated BLE radio that can be programmed to act like a stand-alone beacon or pair with a mobile phone. Our firmware image exposes a standard serial interface (the AT command set) with the ability to store default parameters to flash memory, making it easy to configure addresses, sleep modes, neighborhood discovery polling rates, and UWB ranging options. Our firmware provides three basic functionalities: (1) Neighborhood discovery using BLE’s GAP discovery protocol, (2) coordinated Double-sided Two-way Ranging (DS-TWR) using UWB and (3) an interface to external systems using either USB serial or a standard BLE GATT server. We design our protocol under the assumption that we have a highly dynamic mesh of nodes with hidden terminals and asymmetric links that change on the order of seconds.

The neighborhood discovery protocol is BLE’s standard device discovery protocol that frequency hops across three channels. We allow users to define a custom advertisement period $T_{BLE}$ (default period of 200 ms) and a configurable signal strength (RSSI) threshold for determining the most recent and closest neighbors. To save power when nodes are idle, we duty-cycle background scanning and disable the UWB radio. A node in the system can announce that it wants to participate in active ranging through its BLE advertisements. This in turn will wake-up all nearby nodes and activate their UWB radios. Figure 5 shows an overview of the BLE and UWB transactions required to perform neighborhood discovery and ranging. Note that the BLE discovery uses three channels and not just a single channel. Once activated, each node initiates a DS-TWR request (detailed in the upper right of the figure and in this application note (Corporation, 2015) ) over UWB at a user-configurable timing interval $T_{UWB}$ , with a default value of 100 ms. Each $T_{UWB}$ period the node performs a new DS-TWR request to the next node in its local neighbor list.

If DS-TWR messages are dropped either due to collision or packet corruption, the next polling interval is randomly offset to avoid repeated collisions. We use an exponential random distribution across $T_{UWB}$ similar in nature to Slotted ALOHA. As one would expect, as the number of neighbors increases, the polling rate of each individual neighbor decreases. We provide users with a lookup table for $T_{UWB}$ values needed to support particular maximum node densities within a single collision domain. In Figure 5, you can see that Beacon $B_{1}$ transmits every $T_{UWB}$ to Beacon $B_{2}$ since it has no other neighbors. Beacon $B_{2}$ cycles through 3 total neighbors in its neighborhood list (the neighbor graph shown on the left). After nodes stop transmitting active ranging advertisements for a defined timeout, nodes return to their lower powered duty-cycled listening state. As shown in the bottom line of Figure 5, we also support simultaneously pairing an actively scanning node with a mobile device using a standard BLE GATT server. It is also possible to connect the MDEK1001 to a host device over USB serial or through its built-in RPI header. The default parameters of our firmware support 16-bit addressing (over 30K nodes) with cluster densities of 10 nodes at approximately 1 Hz update rates for each neighbor. Our low power sleep energy is on the order of 10 mW (mostly consumed by background BLE scanning) with an average active ranging energy of 800 mW. In practice, we see BLE neighbor discovery on the order of a 1-2 s with a typical 10-20 s eviction timeout. All source and documentation are available on GitHub.

5.2. Prototype AR Application

We developed a prototype of LocAR as a mobile AR application running on iOS. This application provides two main features: (1) it shows the relative location and orientation of other users in the scene in AR (shown in Figure 4-c), and (2) it coordinated ground-truth data collection among mobile users (shown in Figure 4-b). The mobile app collected VIO data using Apple’s ARKit and UWB ranging data using a MDEK1001 module from DecaWave over BLE. All ranging and communication information was shared using MQTT over WiFi, but this could conceptually be replaced by WiFi Direct or some similar peer-to-peer protocol. ARKit captures VIO data at 60 Hz and we collected UWB ranges with a polling rate of 10 Hz. While it is difficult to exactly isolate how much energy VIO consumes on iOS and Android, the Intel T265 stands at a good reference consuming less than 1.5W. This is low enough that it does not significantly impact interactive usage over a few hours during an AR session. As described above, the actual rate at which UWB data was received by each mobile user depends on the distance and number of neighbors around a particular node.

5.3. Ground-Truth Collection

One of the biggest challenges for assessing the performance of a 6DOF localization system at scale is accurately collecting ground-truth poses. We developed a data collection framework that periodically guides users to converge on ”check-in” locations where AprilTags could be used to accurately record pose. We first installed over a dozen 8.5 by 11 in AprilTags (Wang and Olson, 2016b) across the multiple floors of our test building with retro-reflective markers on each corner. We surveyed the corners of each AprilTag using a total station with an estimated accuracy on the order of millimeters. To coordinate synchronized ground-truth readings between different users, we integrated AprilTag decoder into the AR application, in which the users are instructed to move to the nearest AprilTag and wait until all users across the building had a high-confidence ground-truth measurement. Given the known tag location and the pose estimated by the AprilTag decoder (Wang and Olson, 2016b), the application computes the ground-truth location which is then published over MQTT to a central logging service.

6. Evaluation

In this section, we first explain our experimental setup, which is performed across several environments, and define our evaluation metrics. We then present the performance of LocAR and analyze the sensitivity of our system under various real-world conditions. Then, in Section 7, we describe additional tests that we performed to evaluate the sensitivity of the system to different environments (including changes in lighting and background motion in the scene) as well as other factors such as user walking patterns and RF line-of-sight conditions.

6.1. Experimental Setup

Our primary evaluation of LocAR consisted of a deployment across a 30,000 sq ft area spanning 3 floors of an office building, with 3 to 5 users walking in an arbitrary fashion, and 9 static tags deployed for baseline comparison, as shown in Figure 6. We also stress tested LocAR in a diverse set of environments both indoor and outdoor, different lighting conditions, as well as dynamic environments. The snapshots of these environments are shown in Figure 7. In all these experiments, each user carries an iPad or iPhone with a built-in VIO tracking, and a UWB node attached to the back of the device (as shown in Figure 4-a), while the static tags consist of just the UWB platform. As noted before, LocAR does not require any pre-installed infrastructure or static beacons with known location for localization, and here the static tags are only used for our baseline comparison. Unless otherwise specified, all of our presented results only use ranges from mobile nodes.

The experiments consist of both LOS and heavy NLOS situations, with many instances where users are spread across 3 different floors with one or more dry/concrete walls between them. No instructions are provided to users on how to walk or how to hold the tablets. For 7 different experiments and 10-15 min per run, the users walk with different speeds and periodically stand stationary, resulting in a total of about 40 min worth of data per person. This data is divided into an ”evaluation” set, where users are walking normally, and a ”sensitivity analysis” set, where users are walking in pre-defined patterns (evaluated in Section 7). As explained in Section 5.3, the ground-truth was obtained with a number of AprilTags surveyed in a global coordinate frame using a total station. To synchronize the ground-truth measurements between users, the AR application guides users to scan a nearby AprilTag every 5-30 s over the course of each experiment.

6.2. Evaluation Metrics

The quality of AR performance is sensitive to more than just geometric error. Camera lens parameters, bearing, and distance combine to create the visual error seen by a user. To better capture these effects, we introduce a new AR-specific metric, which we call display-proportional error (DPE), that combines distance, bearing, and the camera field-of-view as a single cohesive benchmark. Figure 8 shows a typical example where 3 virtual cubes are overlaid at a fixed distance from a set of (real) physical orange cones. The cones are located at a distance of 1, 5, and 10 m, respectively, away from the camera. The green cube has no error, the yellow cube is offset by 0.5 m and the red cube is offset by 1 m. Notice that, due to perspective, the cubes that are further from the camera appear closer to the cone, even though their relative error in meters is the same. This simple example highlights why geometric error alone does not do justice to AR localization performance. Instead, display-proportional error computes the AR error as the distance between an object’s true location and its estimated location when projected onto a 2D display, as a proportion of the display’s horizontal size. In the example in Figure 8, the closest yellow box has a DPE of 0.23²²2after accounting for the height that the user is holding the device off the ground. This error corresponds to approximately 1/4 of the screen width, while the farthest yellow box has a DPE of only .03, or about 1/33 of the screen width. In this sense, DPE captures the reprojected error of the estimated 3D locations, and can easily be used to calculate pixel error by simply multiplying by the display’s horizontal resolution.

Therefore, We formalize our error metric definitions as:

•

3D geometric error: We calculate the average pair-wise Euclidean distance in 3D between all pairs of mobile nodes in meters.
•

Display-Proportional Error:

(9) $\frac{\epsilon_{xy}}{|dist+\epsilon_{z}|}*\frac{f_{x}}{H_{x}},$

Where $\epsilon_{xy}$ is the $xy$ component of the 3D geometric error, $\epsilon_{z}$ is the $z$ component, $dist$ is the true distance between the display and the target object, $f_{x}$ is the camera’s focal length (in pixels), and $H_{x}$ is its horizontal resolution (in pixels)

6.3. Baselines

We compare the performance of LocAR with two baselines:

(1) VIO-Only: a typical infrastructure-free localization method (Bloesch et al., 2015) that uses VIO for pose estimation relative to the start point. This is a common localization method in robotics, but it requires initialization and a priori knowledge of users’ start points. Even though this assumption is not feasible for most multi-user AR applications, it allows us to more easily isolate the performance contributions from VIO and UWB ranging in our system.

(2) UWB-VIO infrastructure-based oracle: an infrastructure-based localization technique, which uses VIO to estimate 6DOF motion and UWB ranging to fixed beacons. We assume each fixed beacon (9 total) has a known global location in order to provide a baseline (Wang et al., 2017). As shown in Figure 4-b, static UWB nodes are placed at a fixed offset from each AprilTag to provide a set of known UWB locations. We consider this technique as our oracle and show that LocAR can achieve performance at nearly the same accuracy without relying on any pre-installed infrastructure.

It should be noted that both of these baselines are originally proposed for absolute localization, either relative to origin or relative to the physical coordinate system, so we obtain the relative localization for comparison with our system using Equation 3.

6.4. Localization Accuracy

We evaluate the localization accuracy of LocAR across our evaluation dataset with 5 mobile users, on both single and multiple floors, and with a mixture of LOS and NLOS situations. Figure 10 shows the overall 3D relative localization error and compares it with our two baseline approaches. LocAR achieves a median 3D error of 0.9 m, compared to 2.5 m and 0.8 m in the VIO-Only baseline and UWB-VIO oracle, respectively. We can see that LocAR outperforms VIO-Only baseline by leveraging the UWB ranging and collaborative pose estimation which eliminates drift over time. In addition, LocAR achieves relatively similar accuracy to the UWB-VIO oracle, which relies on pre-installed infrastructure and a priori knowledge of beacons for trilateration that is unnecessary for LocAR.

As mentioned in Section 6.2, the 3D geometric error does not necessarily quantify the localization performance specific to AR application, hence our proposed AR metric, display-proportional error, which captures the relative object displacement error on the screen. Figure 10 compares the AR performance of the three methods using our defined metric, and shows that LocAR can satisfy a high AR quality in 99% of cases with less than 0.5 fractional error on display and a median quality 0.1 fractional error. This means that the users are able to steer in the right direction toward a user 99% of the time across varying ranges and angles.

6.5. Error vs. Separation Distance

Next, we evaluate LocAR’s performance as a function of distance. The ground-truth relative distance of users varies between 0.2 m to 27 m including many instances of completely NLOS. Figure 12-a demonstrates the 3D relative error of each sample test (any pair of users at every 5 s interval) grouped by the ground-truth pair-wise distances. As we can see, error in positioning tends to increase slightly with distance, either due to UWB nodes going out of range or inherent VIO drift. However, unlike geometric error, DPE actually improves with distance. This suggests that a visual display showing an overlay with faraway users’ locations would still be effective at portraying those users’ locations.

It can be seen in Figure 12 that there is a linear correlation between error and true distance of nodes and that error stays below 10% the distance on average. This means that at 20 m, the average error is less than 2 m, which is a reasonable amount of error when localizing 2 users in a building with NLOS. This can be better captured with the AR-metric shown in Figure 12 with a decreasing error over extended ranges. As we can see, the trend of geometric error and AR error with respect to true distance is opposite, confirming the AR-specific behavior of localization systems explained in Section 6.2.

6.6. Drift Over Time

In many localization systems, including VIO tracking, error increases with time. Dead reckoning systems have inherent drift that is inevitable, and small errors in individual state estimation accumulate over time. As seen in Figure 14, LocAR does not have this problem, and keeps an almost constant error distribution throughout every experimental run, while VIO has linear drift over time. This success can be attributed to the use of UWB ranging between nodes to keep the drift bounded to within the UWB error.

6.7. Impact of Collaborative Localization

To evaluate the impact of our collaborative particle filter formulation, we compare the 3D relative error of the naive independent PF and collaborative RBPF, explained in Section 4. To isolate the impact of other parameters, including user mobility, number of users, etc, we perform controlled experiments with a single user and 9 static tags deployed for baseline comparisons. Then we estimate the relative location of static tags with respect to the user for different subsets of tags changing from 1 to 9 randomly selected tags. As seen in Figure 14, collaborative RBPF has a clear advantage over Independent PF. While with 1 static node, they are very similar in 3D relative error of around 0.9 m, independent PF localization remains at this error while the error of collaborative RBPF decreases from 0.85 m to 0.22 m with the addition of static tags.

This was expected as collaborative RBPF takes advantage of other system nodes’ estimates. With this method, the drift of the mobile node is able to be somewhat mitigated by the averaging of noise across measurements to multiple other nodes. As more nodes are able to perform these ”averaging corrections” together, the localization system is able to converge to a more precise estimate than it could with nodes localizing individually. As a side conclusion, we can leverage this feature to further improve the localization performance by deploying some static UWB tags with unknown locations. For example, in a first response operation, the users can deploy some static nodes at random locations as they move around the building to enhance their relative localization performance. Even though LocAR can operate completely infrastructure-free, it can nicely integrate with the infrastructure if one is present.

7. Sensitivity Analysis

In this section, we elaborate on the computational overhead of LocAR’s collaborative localization algorithms. We also describe additional tests we performed in other campus environments and evaluate the sensitivity of LocAR to varying user mobility patterns and NLOS conditions in these environments.

7.1. Computational Overhead

Real-time convergence and computation are two of the critical requirements of an AR localization system specially in mobile applications. Compared to independent particle filtering, our collaborative formulation achieves higher accuracy at the cost of higher computational overhead. Table 2, however, shows that LocAR can still operate on a wide variety of platforms and in practice, converges in real-time. It should be noted that our implementation is not heavily optimized, and our compute time includes significant system overhead. The key takeaway is that the run-time overhead increases almost linearly with the number of users.

Number of users	2	3	4	5
CPU Usage (ms)	2.3%	8.0%	16%	25%
Memory Usage (MB)	4	8	12	16

Table 2. Single threaded runtime performance on 2.4GHz i7 CPU

7.2. Performance Across Diverse Environments

In addition to the multi-story building tests described in Section 6, we also performed a series of tests across several other environments under different conditions, many of which are shown in Figure 7. These environments included:

•

A ”busy” office (with furniture being moved and lights being turned on and off to simulate ordinary office commotion)
•

A campus cafe with a large atrium and spiral staircase
•

A hallway intersection near some elevators inside a brick building
•

A dimly lit parking garage with height variation and lots of concrete and metal blocking line of sight.
•

An outdoor area between campus buildings

The results of these experiments are shown in Figure 15. We see that performance is consistent across all of these environments, with the parking garage performance suffering slightly due to the heavy NLOS conditions and low light. Note that the performance in all of these environments is slightly better than the primary multi-story building test, which was the most challenging due to its immense scale.

7.3. Impact of Mobility Pattern

Another factor affecting the performance of LocAR’s localization accuracy is the high dynamics of the environment and mobility of users. To this end, we compared the system performance in 3 different walking scenarios: (1) when users were walking in pairs, which represents the near-best performance as the collaborative algorithm can take advantage of clean ranging estimates between each pair of users walking near each other, (2) normal walking when users randomly move in the space with a usual walking speed, (3) when all the users were performing fast movements such as running, jumping, crawling, etc, for the purpose of stress testing the algorithm. Figure 15-b confirms the expected trend for different walking scenarios, and demonstrates that LocAR is resilient to fast motions and therefore suitable for applications that involves fast motions, such as rescue operations or gaming.

7.4. NLOS Performance

Next, we study LocAR’s localization performance in NLOS scenarios. Previous analyses shows that UWB ranging degrades in complete NLOS (Rajagopal et al., 2019) due to noisy Time of Flight (TOF) estimates that mainly capture multipath reflections instead of the direct distance between nodes. To evaluate this effect, we performed 3 different experiments with different levels of NLOS scenarios. The first experiment includes 5 users that walk mostly in LOS of each other, all on the same floor. We then repeated this experiment, while users were walking in a larger space including both LOS and NLOS conditions. Finally, we performed the experiment while users were spread out across 3 floors with some heavy NLOS conditions such as multiple concrete walls between users, or being apart by more than 1 floor. As we can see in Figure 15-b, the 3D relative localization drops slightly with the increase of NLOS conditions, but we can still maintain a median accuracy of 1 m even in NLOS and extended ranges over 10-20 m.

8. Discussion

In this section, we discuss the mechanisms to relax assumptions made in our current implementation of LocAR and the potential future extensions.

Number of users: While the current evaluations are done with maximum of 5 mobile users and 9 stationary nodes, LocAR’s collaborative approach can significantly benefit from larger number of users to improve the localization accuracy (as shown in Figure 14). This is mainly due to drift mitigation by averaging the noises across measurements to multiple nodes. However, the higher performance will be achieved at the expense of higher computation and communication overhead. While our current implementation is not optimized for scalability, this trade-off can be balanced by using heuristics that clusters users based on their proximity. Designing a scalable and real-time version of LocAR is part of our future work.

On a related note, our current algorithm only uses the direct measurements from each neighbor without sharing any higher-level state information (such as estimates of other users etc). This reduces the communication complexity and computational overhead at the expense of slightly lower performance. Depending on the application and required localization performance, the collaboration capabilities of LocAR can be further enhanced by sharing ranging and pose estimations between users.

User Interactions: In practice, LocAR works best when users occasionally pass near each other, resulting in high-confident ranging and particle filter updating. So, the algorithm cannot benefit from collaboration if users are at the limits of the UWB range (100m in LoS and about 30m in sever NLoS). To avoid the performance degradation, one can add (arbitrarily placed). Such ”breadcrumb dropping” techniques (Li et al., 2018) has been widely used in rescue operations or first response operations, and are compatible with LocAR. It is worth noting that without close interaction the system will converge, just over a longer period of time or at a lower accuracy.

Camera Occlusion: A limitation that is common among vision-based IMU and localization methods is sensitivity to low visibility conditions, such as smoke-filled rooms or extreme darkness, as would be commonplace for firefighters. Our current experiments show that LocAR is resilient to partial camera blockages by leveraging the UWB ranging between users. An ultimate solution could be to replace vision with RF imaging technologies, such as millimeter wave (mmWave) (Doer and Trommer, 2020) or Infrared cameras. As part of our future work, we are integrating our infrastructure-free relative localization with mmWave cameras that are not affected by smoke.

Groundtruth Collection: One of the main challenges of any localization research is groundtruth collection especially in mobile indoor setups where GPS does not work well. As such, we designed a novel mechanism using April tags and coordinated synchronous check pointing that guided users as they walked through our test environments. However, When collecting the ground truth, participants had to wait until all of them had high-confidence ground truth measurements for synchronization. While we don’t expect this approach to impact the evaluation, it could slightly improve the UWB ranging as the users will be static for a few seconds when collecting GT. However, given the high sampling rate of UWB ranging and the heavily NLoS conditions of our experiments, we imagine this impact to be negligible.

9. Conclusion

This paper proposes LocAR, a collaborative AR localization system that allows multiple users to estimate the relative 6DOF position of each other in real-time. This system is free of infrastructure; is robust to environment dynamics and NLoS conditions; and maintains relatively low computational complexity to reduce power and update time. LocAR uses a variant of RBPF to perform state estimation of the nodes jointly by using UWB ranging and VIO tracking. LocAR then displays the tracked nodes in AR on the tablet in the coordinate frame of the user. Using the AR application, users can see where others are in the building despite walls, floors, and other obstacles creating NLOS. We also present an AR metric that captures the quality of localization with respect to the user’s display specifications, and is optimized for augmented reality applications.

As future work, we are interested in using the LocAR approach to bootstrap and correct mapped locations within fixed infrastructure systems. There is the potential to create a hybrid infrastructure-based and infrastructure-free AR localization environment that could provide the best of both worlds where rapidly deployed relative content could persist in the environment once fixed infrastructure is encountered.

References

(1)
Aider et al. (2005) Omar Ait Aider, Philippe Hoppenot, and Etienne Colle. 2005. A model-based method for indoor mobile robot localization using monocular vision and straight-line correspondences. Robotics and Autonomous Systems 52, 2-3 (2005), 229–246.
Apple (2018) Apple. 2018. HSwiftShot: creating a game for augmented reality. https://developer.apple.com/documentation/arkit/swiftshot_creating_a_game_for_augmented_reality Online. Accessed: 2018-10-20.
Apple (2020) Apple. 2020. Introducing ARKit 4. https://developer.apple.com/augmented-reality/arkit/ Online. Accessed: 2020-10-27.
Apple (2021) Apple. 2021. U1 Chipset. https://support.apple.com/en-us/HT212274 Online. Accessed: 2021-5-10.
Bae et al. (2014) Hyojoon Bae, Mani Golparvar-Fard, and Jules White. 2014. Rapid image-based localization using clustered 3d point cloud models with geo-location data for aec/fm mobile augmented reality applications. In Computing in Civil and Building Engineering (2014). 841–849.
Bae et al. (2016) Hyojoon Bae, Michael Walker, Jules White, Yao Pan, Yu Sun, and Mani Golparvar-Fard. 2016. Fast and scalable structure-from-motion based localization for high-precision mobile augmented reality systems. mUX: The Journal of Mobile User Experience 5, 1 (2016), 4.
Barooah and Hespanha (2007) Prabib Barooah and Joao P Hespanha. 2007. Estimation on graphs from relative measurements. IEEE Control Systems Magazine 27, 4 (2007), 57–74.
Bellavia et al. (2013) Fabio Bellavia, Marco Fanfani, Fabio Pazzaglia, and Carlo Colombo. 2013. Robust selective stereo SLAM without loop closure and bundle adjustment. In International Conference on Image Analysis and Processing. Springer, 462–471.
Bloesch et al. (2015) Michael Bloesch, Sammy Omari, Marco Hutter, and Roland Siegwart. 2015. Robust visual inertial odometry using a direct EKF-based approach. In 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, 298–304.
Brand et al. (2014) Christoph Brand, Martin J Schuster, Heiko Hirschmüller, and Michael Suppa. 2014. Stereo-vision based obstacle mapping for indoor/outdoor SLAM. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 1846–1853.
Čapkun et al. (2002) Srdjan Čapkun, Maher Hamdi, and Jean-Pierre Hubaux. 2002. GPS-free positioning in mobile ad hoc networks. Cluster Computing 5, 2 (2002), 157–167.
Carlone et al. (2011) Luca Carlone, Miguel Kaouk Ng, Jingjing Du, Basilio Bona, and Marina Indri. 2011. Simultaneous localization and mapping using rao-blackwellized particle filters in multi robot systems. Journal of Intelligent & Robotic Systems 63, 2 (2011), 283–307.
Charrow et al. (2014) Benjamin Charrow, Vijay Kumar, and Nathan Michael. 2014. Approximate representations for multi-robot control policies that maximize mutual information. Autonomous Robots 37, 4 (2014), 383–400.
Cheung et al. (2006) Kenneth C Cheung, Stephen S Intille, and Kent Larson. 2006. An inexpensive bluetooth-based indoor positioning hack. In Proceedings of UbiComp, Vol. 6.
Coppola et al. (2018) Mario Coppola, Kimberly N McGuire, Kirk YW Scheper, and Guido CHE de Croon. 2018. On-board communication-based relative localization for collision avoidance in Micro Air Vehicle teams. Autonomous robots 42, 8 (2018), 1787–1805.
Corporation (2015) DecaWave Corporation. 2015. The implementation of two-way ranging with the DW1000.
Developers (2019) Google Developers. 2019. https://developers.google.com/ar/discover Online. Accessed: 2020-10-27.
Dhekne et al. (2019) Ashutosh Dhekne, Ayon Chakraborty, Karthikeyan Sundaresan, and Sampath Rangarajan. 2019. TrackIO: tracking first responders inside-out. In 16th $\{$ USENIX $\}$ Symposium on Networked Systems Design and Implementation ( $\{$ NSDI $\}$ 19). 751–764.
Doer and Trommer (2020) Christopher Doer and Gert F. Trommer. 2020. An EKF Based Approach to Radar Inertial Odometry. In 2020 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI). 152–159. https://doi.org/10.1109/MFI49285.2020.9235254
Engel et al. (2014) Jakob Engel, Thomas Schöps, and Daniel Cremers. 2014. LSD-SLAM: Large-scale direct monocular SLAM. In European conference on computer vision. Springer, 834–849.
Eren et al. (2004) Tolga Eren, OK Goldenberg, Walter Whiteley, Yang Richard Yang, A Stephen Morse, Brian DO Anderson, and Peter N Belhumeur. 2004. Rigidity, computation, and randomization in network localization. In IEEE INFOCOM 2004, Vol. 4. IEEE, 2673–2684.
Ferner et al. (2008) Ulric Ferner, Henk Wymeersch, and Moe Z Win. 2008. Cooperative anchor-less localization for large dynamic networks. In 2008 IEEE International Conference on Ultra-Wideband, Vol. 2. IEEE, 181–185.
Gentner and Ulmschneider (2017) Christian Gentner and Markus Ulmschneider. 2017. Simultaneous localization and mapping for pedestrians using low-cost ultra-wideband system and gyroscope. In 2017 International Conference on Indoor Positioning and Indoor Navigation (IPIN). IEEE, 1–8.
Gómez et al. (2013) David Gómez, Paula Tarrío, Juan Li, Ana M Bernardos, and José R Casar. 2013. Indoor augmented reality based on ultrasound localization systems. In International Conference on Practical Applications of Agents and Multi-Agent Systems. Springer, 202–212.
Hoffmann et al. (2006) Gabriel M Hoffmann, Steven L Waslander, and Claire J Tomlin. 2006. Mutual information methods with particle filters for mobile sensor network control. In Proceedings of the 45th IEEE Conference on Decision and Control. IEEE, 1019–1024.
Jamali-Rad et al. (2012) Hadi Jamali-Rad, Hamid Ramezani, and Geert Leus. 2012. Cooperative localization in partially connected mobile wireless sensor networks using geometric link reconstruction. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2633–2636.
Kallwies et al. (2020) Jan Kallwies, Bianca Forkel, and Hans-Joachim Wuensche. 2020. Determining and Improving the Localization Accuracy of AprilTag Detection. In 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 8288–8294.
Kendall et al. (2015) Alex Kendall, Matthew Grimes, and Roberto Cipolla. 2015. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE international conference on computer vision. 2938–2946.
Klopschitz and Schmalstieg (2007) Manfred Klopschitz and Dieter Schmalstieg. 2007. Automatic reconstruction of wide-area fiducial marker models. In 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality. IEEE, 71–74.
Lazik et al. (2015) Patrick Lazik, Niranjini Rajagopal, Oliver Shih, Bruno Sinopoli, and Anthony Rowe. 2015. ALPS: A bluetooth and ultrasound platform for mapping and localization. In Proceedings of the 13th ACM conference on embedded networked sensor systems. 73–84.
Li et al. (2018) Jinyang Li, Zhiheng Xie, Xiaoshan Sun, Jian Tang, Hengchang Liu, and John A Stankovic. 2018. An automatic and accurate localization system for firefighters. In 2018 IEEE/ACM Third International Conference on Internet-of-Things Design and Implementation (IoTDI). IEEE, 13–24.
Liu et al. (2017) Ran Liu, Chau Yuen, Tri-Nhut Do, Dewei Jiao, Xiang Liu, and U-Xuan Tan. 2017. Cooperative relative positioning of mobile users by fusing IMU inertial and UWB ranging information. In 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 5623–5629.
Lu and Kambhamettu (2014) Guoyu Lu and Chandra Kambhamettu. 2014. Image-based indoor localization system based on 3d sfm model. In Intelligent Robots and Computer Vision XXXI: Algorithms and Techniques, Vol. 9025. International Society for Optics and Photonics, 90250H.
Moore et al. (2004) David Moore, John Leonard, Daniela Rus, and Seth Teller. 2004. Robust distributed network localization with noisy range measurements. In Proceedings of the 2nd international conference on Embedded networked sensor systems. 50–61.
Mulloni et al. (2011) Alessandro Mulloni, Hartmut Seichter, and Dieter Schmalstieg. 2011. Handheld augmented reality indoor navigation with activity-based instructions. In Proceedings of the 13th international conference on human computer interaction with mobile devices and services. 211–220.
Mur-Artal et al. (2015) Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. 2015. ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE transactions on robotics 31, 5 (2015), 1147–1163.
Nagpal et al. (2003) Radhika Nagpal, Howard Shrobe, and Jonathan Bachrach. 2003. Organizing a global coordinate system from local information on an ad hoc sensor network. In Information processing in sensor networks. Springer, 333–348.
Nilsson and Händel (2013) John-Olof Nilsson and Peter Händel. 2013. Recursive Bayesian initialization of localization based on ranging and dead reckoning. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 1399–1404.
Olsson et al. (2014) Fredrik Olsson, Jouni Rantakokko, and Jonas Nygårds. 2014. Cooperative localization using a foot-mounted inertial navigation system and ultrawideband ranging. In 2014 International Conference on Indoor Positioning and Indoor Navigation (IPIN). IEEE, 122–131.
Popescu et al. (2012) Dan C Popescu, Mark Hedley, and Thuraiappah Sathyan. 2012. Tracking in dynamic anchorless wireless networks based on Manifold Flattening. In Proceedings of the 2012 IEEE/ION Position, Location and Navigation Symposium. IEEE, 321–327.
Rad et al. (2011) Hadi Jamali Rad, Alon Amar, and Geert Leus. 2011. Cooperative mobile network localization via subspace tracking. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2612–2615.
Rajagopal et al. (2019) Niranjini Rajagopal, John Miller, Krishna Kumar Reghu Kumar, Anh Luong, and Anthony Rowe. 2019. Improving augmented reality relocalization using beacons and magnetic field maps. In 2019 International Conference on Indoor Positioning and Indoor Navigation (IPIN). IEEE, 1–8.
Ran et al. (2019) Xukan Ran, Carter Slocum, Maria Gorlatova, and Jiasi Chen. 2019. ShareAR: Communication-efficient multi-user mobile augmented reality. In Proceedings of the 18th ACM Workshop on Hot Topics in Networks. 109–116.
Ross (2014) Brian C Ross. 2014. Mutual information between discrete and continuous data sets. PloS one 9, 2 (2014), e87357.
Savic and Zazo (2013) Vladimir Savic and Santiago Zazo. 2013. Cooperative localization in mobile networks using nonparametric variants of belief propagation. Ad Hoc Networks 11, 1 (2013), 138–150.
Savvides et al. (2001) Andreas Savvides, Chih-Chieh Han, and Mani B Strivastava. 2001. Dynamic fine-grained localization in ad-hoc networks of sensors. In Proceedings of the 7th annual international conference on Mobile computing and networking. 166–179.
Shao et al. (2018) Chong Shao, Bashima Islam, and Shahriar Nirjon. 2018. Marble: Mobile augmented reality using a distributed ble beacon infrastructure. In 2018 IEEE/ACM Third International Conference on Internet-of-Things Design and Implementation (IoTDI). IEEE, 60–71.
Song et al. (2019) Yang Song, Mingyang Guan, Wee Peng Tay, Choi Look Law, and Changyun Wen. 2019. UWB/LiDAR Fusion for cooperative range-only SLAM. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 6568–6574.
Thrun (2002) Sebastian Thrun. 2002. Probabilistic robotics. Commun. ACM 45, 3 (2002), 52–57.
Wang et al. (2017) Chen Wang, Handuo Zhang, Thien-Minh Nguyen, and Lihua Xie. 2017. Ultra-wideband aided fast localization and mapping system. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 1602–1609.
Wang and Olson (2016a) John Wang and Edwin Olson. 2016a. AprilTag 2: Efficient and robust fiducial detection. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 4193–4198.
Wang and Olson (2016b) John Wang and Edwin Olson. 2016b. AprilTag 2: Efficient and robust fiducial detection. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 4193–4198.
You et al. (1999) Suya You, Ulrich Neumann, and Ronald Azuma. 1999. Hybrid inertial and vision tracking for augmented reality registration. In Proceedings IEEE Virtual Reality (Cat. No. 99CB36316). IEEE, 260–267.
Yuan et al. (2016) Wang Yuan, Zhijun Li, and Chun-Yi Su. 2016. RGB-D sensor-based visual SLAM for localization and navigation of indoor mobile robot. In 2016 International Conference on Advanced Robotics and Mechatronics (ICARM). IEEE, 82–87.
Zhao et al. (2020) Boxin Zhao, Zongzhe Li, Jun Jiang, and Xiaolin Zhao. 2020. Relative Localization for UAVs Based on April-Tags. In 2020 Chinese Control And Decision Conference (CCDC). IEEE, 444–449.