This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Listen to Your Map:
An Online Representation for Spatial Sonification

Lan Wu1, Craig Jin2, Monisha Mushtary Uttsha1 and Teresa Vidal-Calleja1 1Authors are with the Robotics Institute, University of Technology Sydney, Ultimo, NSW 2007, Australia.2Author is with the Computing and Audio Research Laboratory, University of Sydney, Camperdown, NSW 2050, Australia.This work was supported by ARIA Research and the Australian Government via the Department of Industry, Science, and Resources CRC-P program (CRCPXI000007).Lan.Wu-2@uts.edu.au
Abstract

Robotic perception is becoming a key technology for navigation aids, especially helping individuals with visual impairments through spatial sonification. This paper introduces a mapping representation that accurately captures scene geometry for sonification, turning physical spaces into auditory experiences. Using depth sensors, we encode an incrementally built 3D scene into a compact 360-degree representation with angular and distance information, aligning this way with human auditory spatial perception. The proposed framework performs localisation and mapping via VDB-Gaussian Process Distance Fields for efficient online scene reconstruction. The key aspect is a sensor-centric structure that maintains either a 2D-circular or 3D-cylindrical raster-based projection. This spatial representation is then converted into binaural auditory signals using simple pre-recorded responses from a representative room. Quantitative and qualitative evaluations show improvements in accuracy, coverage, timing and suitability for sonification compared to other approaches, with effective handling of dynamic objects as well. An accompanying video demonstrates spatial sonification in room-like environments.111https://tinyurl.com/ListenToYourMap

Keywords: Representation, Sonification, Spatial Perception.

Refer to caption
(a)
Refer to caption
(b)
Figure 1: Our sensor-centric representation for spatial sonification. a) 2D circular representation and b) 3D cylindrical representation for the Cow and Lady dataset. The circle is coloured by the distance to the obstacle along each azimuthal angle. We also show the selected close points on the surface. Similarly, for the cylinder, we illustrate the distance up to a certain height.

I Introduction

Robotic perception has the potential to interpret the environment on behalf of humans. Over the past two decades, robotic sensors such as cameras or LiDARs have acted as the eyes of robots, becoming essential tools for gathering information on unknown environments. For individuals living with visual impairments, auditory perception is crucial. This need could be fulfilled by transforming the information accumulated by the robotic perception algorithms via sonification. One of the key steps in representing an environment for sonification is to convert spatial data into an organised format that can be understood through sound. This process facilitates the transformation of physical spaces into auditory experiences, allowing users to understand and navigate through sound. The primary goal of this work is to build a robust framework that can precisely represent an environment’s spatial properties such that it can serve as a basis for sonification.

The ability to take a light-weight sensor information and incrementally build and maintain a compact representation will enable real-time auditory feedback to support spatial awareness and environment interaction, especially for visually impaired individuals. As we want our representation to cater towards sonification for humans, we take motivation from how human beings perceive an environment through sound. Thus, we transform an incrementally built 3D scene captured with depth data (from an RGB-D camera for instance) into a 360-degree projection of distance to the surface information. This approach naturally aligns with human auditory perception while simplifying information to focus on the most relevant spatial points. To achieve this, we propose a mapping representation, based on a sensor-centric organised structure that maintains 2D circular and/or 3D cylindrical rasterised representations (see Fig. 1). These structures capture the geometry of the 3D environment and offer flexibility in how spatial locations are sonified by leveraging VDB-GPDF [1], an online efficient mapping framework based on Gaussian Process (GP) distance and gradient fields and a fast-access VDB data structure (Volumetric, Dynamic grid that shares several characteristics with B+trees [2, 3]). In addition, a localisation framework based on VINS-RGBD [4] is combined with our mapping approach to provide a full online Simultaneous localisation and mapping (SLAM) solution with synchronised poses and raw dense depth data. Moreover, by incorporating custom binaural room impulse responses (BRIRs), our online incrementally-built scene representation offers perceptually robust auditory augmentation of the mapped environment for both near-field and far-field obstacles and unknown space.

We evaluate quantitatively the performance of the proposed mapping representation in terms of accuracy, coverage and timing, demonstrating the ability to convey the required spatial information better than naively using the current sensor information or a state-of-the-art Euclidean Distance Field. In addition, we show qualitative performance in the presence of dynamic objects and benchmarked with the information provided by distance field mapping. Finally, the submitted video showcases the spatial sonification for room-like environments.

II Literature Review

Several studies have explored sonification from vision-based representations. While earlier works focused mostly on 2D vision-to-audio transformations [5], more recent approaches utilise 3D vision-based sensors [6, 7]. Although 3D representations generally offer improved accuracy for localisation tasks, 2D systems can achieve comparable performance for navigation with sonification [8]. Some works have directly used 3D raw data for environment representation [9, 10]. In [9], the authors presented a real-time auditory feedback system that assists blind users in obstacle navigation using a depth camera. The system processes depth data to detect obstacles and sonifies this information. However, using depth data directly can be complex for turning 3D information into clear auditory cues as the information over time can be inconsistent. To tackle this, we propose to reconstruct the environment with sequential depth data and encode it into a consistent 360-degree circular/cylindrical raster-based projection.

More recently, there has been a growing trend towards leveraging machine learning techniques, such as generative models, to enhance environment representation for spatial sonification. These methods allow for more adaptive and context-aware sonification systems, improving the overall user experience. In [11], the authors introduced “SEE-2-SOUND,” a zero-shot framework. They generated spatial audio from visual inputs such as images, GIFs, and videos without pre-training. A similar task was performed in [12], with the additional use of a large language model to generate context-aware synchronised audio-visual content based on the event. While these systems offer adaptability, they often require substantial computational resources. In contrast, our method provides a lightweight, direct mapping representation of spatial data to auditory cues. This ensures real-time responsiveness while preserving essential spatial details and the ability to be deployed on modest embedded devices.

There are different techniques to sonify a virtual environment, Gao et al. showed quantitative evidence of how different sonification strategies can influence user perception and behaviour in [13]. They compared four elevation sonification mapping strategies and found that binary relative mapping performs better than the others, and in combination with azimuth sonification, it gives clear directional information generating more accurate and efficient task completion. Keeping this in mind, our proposed representation technique is designed to effectively capture both angular and distance information. Thus, ensuring optimal translation of spatial data into auditory cues for the sonification process.

There have been many works on sonification that are dedicated towards visually impaired persons[14]. The type of sound used for sonification also varies depending on the task and the type of information it carries. While discrete sounds are good for simpler information and objects that are near the user, continuous sounds are more suited to deliver complex elements in the scene [15]. Schwartz et al. proposed a mobile application that can reconstruct and sonify a scene to guide users with visual impairment in real-time [14]. The work in [16] built a 3D sonification framework to sonify images from video sequences to send the user a predefined sonic sensorial description through bone-conduction headphones. They combined YOLOV3 [17] for object detection with a probabilistic occupancy map from OctoMap [18] to detect environmental anomalies. These works primarily focus on sonifying immediate surroundings or predefined objects with limited adaptability. They do not offer a full 360-degree spatial coverage, focusing mostly on near-field obstacles. Our proposed approach, in contrast, is flexible, robust and offers greater adaptability as it conveys only spatial information for near-field and far-field.

III Preliminaries

III-A Localisation

We employ a modified VINS-RGBD [4] localisation framework to generate precise camera poses as inputs for the mapping. A key aspect of our approach is the improved synchronisation between the estimated poses and the raw dense point cloud data captured by the RGB-D camera. By refining the VINS framework, we ensure consistent spatial alignment, which enhances the accuracy and reliability of the mapping process, ultimately enabling more precise and detailed scene reconstruction.

III-B VDB-GPDF Mapping

Our representation is based on the VDB-GPDF mapping framework as proposed in [1]. It couples the VDB grid structure with the Gaussian Process Distance Field [19, 20, 21] to gradually construct and maintain a large-scale, dense distance field map of the environment. The depth sensor data (from depth cameras or LiDARs) at the world coordinate frame is voxelised to create a local VDB structure with 3D points. In every VDB leaf node, the local voxel centres act as training points for GPs. The collection of GPs across all leaf nodes forms the temporary latent Local GP Signed Distance Field (L-GPDF).

For testing, a set of query points is generated by ray-casting from the sensor origin through the voxels in the current reference frame. Each test point is used to query the L-GPDF for the distance field and variance estimates. The queried values from L-GPDF are then fused into a global VDB grid map by updating the voxel distances using the weighted sum method. Following the fusion of the distances and surface properties, the marching cubes algorithm [22] is used to reconstruct a dense surface by identifying the active voxels in the global VDB. The Global Gaussian Process Distance Field (G-GPDF) is made up of the zero-crossing points from active leaf nodes. At this stage, queries for the G-GPDF are computed by averaging the inferred distances from neighbouring GP nodes. For further details refer to [1].

IV Circular/Cylindrical Rasterisation

We aim to incrementally build a scene representation that facilitates spatial sonification of an unknown environment. We propose a representation based on VDB-GPDF that is encoded to cater for the sonification requirements of future human sensory augmentation. Given the sequentially built G-GPDF model and the extracted 3D surface, we encode the scene into a 2D or 3D 360-degree raster-based projection. With this representation, pre-recorded binaural room impulse response filters are used to create the rendered spatial sound.

IV-A 2D Circular Representation

Spatial audio feedback requires accurate but simple information, thus we propose to represent the incrementally built dense 3D map of the environment as a 2D circular sensor-centric representation. To ensure our representation aligns with human navigation requirements, we segment the ground and non-ground points. Any point above the ground and in the height range of a human that can potentially be an obstacle is classified as a non-ground point. Then, we project all these 3D non-ground points onto a 2D plane. On this plane, every 3D point is reduced to its corresponding 2D coordinates, maintaining spatial relationships within the scene.

A circular grid centred at the location of the depth sensor is created by rasterising these 2D points. Rasterisation is faster and more efficient than raycasting, making it ideal for real-time processes and handling complex scenes. We do this by calculating the radial distance and azimuthal angle for every point with respect to the sensor. The azimuthal angle is derived by calculating the cross-product between the sensor’s directional vector and the vector extending from the sensor to the point of interest.

After that, this angle is normalised to fit inside a 360° circular representation. Each degree denotes a distinct angle surrounding the position of the sensor. The distance between the point and the sensor is used to calculate the radial distance. This enables us to rasterise each point, according to its distance from the sensor, onto a circular plane.

This process of rasterisation yields a complete 360° depiction of the scene, with every point corresponding to a distinct angle and distance from the sensor. This ensures capturing the closest visible points surrounding the sensor’s position. Sonifying the three-dimensional scene is especially advantageous with this minimalist circular representation. As we explain in the following section, each point in the circle can be associated with a specific sound, allowing for intuitive sound cues that represent the 3D environment. We ensure that the most pertinent spatial information is preserved by keeping only the closest points at each angle, which improves the audio feedback’s clarity.

IV-B 3D Cylindrical Representation

We take our representation further onto generating a 3D cylindrical structure instead of only a circle. This approach is guided by the intuition that the human is positioned within the cylinder. The reconstructed 3D mesh is rasterised onto a cylindrical surface to create the 3D cylindrical representation. This representation provides a structured, sensor-centric view of the surroundings. The cylindrical grid is discretised into vertical (elevation) and azimuthal intervals. The elevation spans heights between 0.1 and 2 meters above the sensor, while the azimuth is divided into 360-degree intervals.

Each 3D point in the scene is mapped to a specific azimuth-elevation pair, and only the closest point (in terms of radial distance) for each pair is retained. This process effectively reprojects the 3D points onto a simplified 2D representation of the scene. The reprojection occurs by collapsing the radial dimension, projecting all points onto a central 2D plane or line located at the center of the cylinder. The resulting 2D map provides a compact representation of the scene’s geometry, where each point in the azimuth-elevation grid corresponds to the nearest surface point along the cylinder’s radial direction.

IV-C Sonification

We demonstrate the ability of our proposed representation to provide spatial information for cognitive sonification. An auditory framework is applied with the spatial representation to generate binaural auditory signals. A foundational aspect of the sonification process is to leverage the natural auditory perspective provided by human spatial hearing. In other words, humans naturally perceive the location of sounds around them and we design the representation to take advantage of spatial hearing perception.

Refer to caption
Figure 2: The sound environment used for the BRIR recordings is shown. The HATS manikin is at the lower right.

In this work, augmented or virtual spatial audio is enabled via the use of binaural room impulse responses (BRIRs) which are filters that mathematically specify the transformation of sound from a particular source location in a room to the ears of the individual listening to the map. There is a separate BRIR filter for each ear and each source location. Note that BRIR filters are unique for each individual in a perceptually relevant way because of morphological differences in ear shape. The data accumulated from the map representation has the spatial geometry of the room or environment. This information is sufficient to compute BRIRs using room simulation techniques. However, in this work, we use custom BRIRs that were prerecorded in a representative laboratory environment that includes a polypropylene carpet, a motion capture metallic frame, wooden beams with speakers mounted on them and other furniture (see Fig. 2).

The BRIR measurements were recorded using the Brüel & Kjær type 4128C Head and Torso Simulator (HATS) with two in-ear microphones together with a Brüel & Kjær type 9640 turntable. We arranged 10 small loudspeakers in a row at a height consistent with the audiovisual horizon of the HATS model. The 10 loudspeakers were positioned directly in front of the HATS model at zero degree of azimuth at distances varying from 0.4 m to 4 m in steps of 0.4 m. Using the turntable, the HATS model was rotated and BRIRs were recorded every 33^{\circ} of azimuth from 177-177^{\circ} to 180180^{\circ}. The BRIR files were saved as Spatially Oriented Format for Acoustics (SOFA) files [23]. The full loudspeaker recording positions are shown in Fig. 3.

Refer to caption
(a)
Figure 3: The efficiency performance of our proposed representation for spatial sonification with respect to different voxel resolutions.

In order to demonstrate the suitability of the mapping representation for spatial sonification, we focus on the 2D circular representation. We sonify the distance to the surface for all directions around the circle counter-clockwise. It simulates a scanning “length-changing cane”. The cane automatically taps the closest surface in each discrete direction as it scans. As we mentioned, the BRIRs were recorded every 33^{\circ}. Note that, our circular and cylindrical representations have 11^{\circ} resolution. To avoid sensory overload, we choose 1010^{\circ} as the resolution to separate the 360-degree range data into 36 angular sectors.

For each angular sector, the angle and range corresponding to the smallest range value are selected. The selected range data for each sector is then converted to binaural spatial audio by applying the appropriate BRIR filter to a ‘tap’ sound. Note that if a particular angular sector has not yet been scanned, a ‘woosh’ sound is played instead of a ‘tap’ sound. The distances encompassing the range data are scaled into the 0.4 m to 4 m range of the BRIRs using a constant scaling chosen appropriately for the given environment. Distances beyond 4 m are clamped to 4 m in the BRIRs range. This sonification distance range fits well with the perceptual properties of human distance perception, which is significantly poorer than human auditory direction perception [24] and we generally perceive distance in the near-field (<< 2 meters) better than the far-field (>> 2 meters). To further enhance the distance perception we apply a slight pitch shift to the tap sound. ‘Tap’ sounds closer than 1.5 m are shifted down in pitch by four semitones and ‘tap’ sounds further than 2.5 m are shifted up in pitch by four semitones. Our accompanying video shows these effects. Listening through headphones is recommended to hear the sounds at the correct azimuth location.

V Evaluation

To the best of our knowledge, no open-source state-of-the-art spatial sonification frameworks are available for comparison. Therefore, we quantitatively evaluate the proposed representation for spatial sonification in terms of a) efficiency, b) representation accuracy and c) coverage with respect to using depth images, d) qualitatively demonstrate the representation performance with a dynamic object and e) suitability for sonification with respect to Euclidean distance fields. Our framework is implemented in C++ based on ROS1. All experiments were run on 12th Gen Intel® Core™ i5-1245U with 12 cores.

The first dataset is the Cow and Lady[25], which includes fibreglass models of a large cow and a lady standing side by side in a room. The dataset consists of RGB-D point clouds collected using a Kinect 1 camera, with sensor trajectory captured using a Vicon motion system. We use all frames of the dataset, which covers major information of the scene over 360 degrees. We use the ground truth map of the Cow and Lady, which allows us to perform a proper quantitative evaluation for representation accuracy. As our second dataset, we run the online SLAM in a room with a live RealSense RGB-D camera. The room has a human as a dynamic object. The poses are computed from the VINS [4] localisation framework as described above.

Refer to caption
(a)
Figure 4: The efficiency performance of our proposed representation for spatial sonification with respect to different voxel resolutions.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 5: Quantitative comparisons of the accuracy in RMSE on the cow and lady dataset with varying voxel resolutions.

V-A Time Evaluation

We first evaluate the time consumption to show our representation can provide online performance for sonification. Fig. 4 shows the statistical computational time for the Cow and Lady dataset. Note that our framework has the free-space carving methods [1] enabled to update the moving objects in the scene. For the sake of simplicity, we compute the time for all the processes, including fusion, 2D circle, and 3D cylinder. It illustrates that our representation is efficient with varying resolutions for online performance.

V-B Representation Accuracy

For the accuracy evaluation, we use the raw depth images in the same way as in our framework to compute the 2D circle and 3D cylinder. Note that we apply the ground truth point cloud of the Cow and Lady to calculate the ground truth circle and cylinder. We then quantitatively compare our circle and cylinder representations and depth images with different map resolutions against the ground truth. As demonstrated in Fig. 5, at the beginning of the dataset, all RMSEs are high due to noisy measurements when the drone is taking off. Then the RMSE drops due to more information is collecting along the steady exploration. The RMSE of the depth sensor circle and cylinder is noisy due to the lack of fusion over time. Our accuracy outperforms depth representations in different resolutions. Note that occasionally, the depth sensor RMSE has lower values than ours, and this is because the depth sensor its only calculated in the field of view of the camera. Therefore, compared to GT and ours, only a few points in the depth circle and cylinder have information and are being evaluated for RMSE.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 6: Quantitative comparisons of the coverage on the cow and lady dataset with varying voxel resolutions.

V-C Coverage Over Frames

We compare the coverage of our framework against the depth sensor again. Due to the ground truth having missing areas, we use the fully reconstructed mesh as the benchmark to compute the coverage. Demonstrating in Fig. 6, as the mapping voxel resolution varies from 5cm to 15cm, the coverage of the depth representations remains the same. The depth circle covers below 20%, and the cylinder covers around 10%. Our incremental representation accumulates and fuses measurements as the coverage grows over frames. Note that the coverage gets up and down occasionally due to passing through unexplored areas to gather more information. With a high resolution of 5cm, our circle reaches 100% coverage, and our cylinder covers over 90% at the end.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 7: We map the scene with a live camera online to show the ability to deal with dynamic objects.

V-D Dynamic Moving Object Validation

We aim to show that our representation has the ability to deal with dynamic objects, which is very common in the real-world scenario. Our framework takes poses and points as raw input, so we use localisation frameworks such as VINS [4] to provide proper poses. We move a live camera to map the scene online. In Fig. 7, our 2D circle is fully reconstructed in 360 degrees, and we can visual object projection on the circle for each angle clearly. From Fig. LABEL:dynamic_1 to Fig. LABEL:dynamic_4, there was an object in the scene and later moved away. Our circle is updated with respect to the latest status of the scene.

Refer to caption
(a)
Refer to caption
(b)
Figure 8: a) Our circular representation encodes the radial distance at a given angle over the scan. b) EDF, however, encodes the distance and directions to the closest surfaces.

V-E Compare to EDF

This section aims to show the suitability of our representation for sonification over the Euclidean distance fields (EDF), which is better suited for path planning and navigation. For the EDF [1], it usually has a distance value and the gradient of the distance given the point’s location in the space. It always measures the distance and direction from the point to the closest surface. As we show in Fig. LABEL:arrow_ours, our circle has bearing angles pointing 360 degrees, and for each bearing, we have the closest distance along the ray (the ray arrow is shown in blue). However, as we see from Fig. LABEL:arrow_esdf, we query the EDF for each point on the circle, and this gives us the distance and gradient to the closest surfaces. The gradients of distances (shown in blue) on the circle slip around as the closest surface could be in any direction. This is not suitable for sonification as the bearing does not vary sequentially.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 9: Incrementally built scene with a live depth camera following our framework mapping, self-localising, and maintaining the proposed sonification representation.

V-F Sonification Demonstration

In our accompanying video and Fig. 9, we show an incrementally built and sonified room-sized environment using the proposed framework. Incremental localisation (using VINS [4]), mapping (with VDB-GPDF [1]), circular rasterization and sonification are shown in real-time. Listening through headphones is recommended to hear the sounds at the correct azimuth location.

VI Conclusion

We propose an online, incrementally built representation to provide spatial auditory information of unknown environments. This approach has the potential to aid individuals with visual impairments. By leveraging a sensor-centric mapping structure based on depth sensors, we effectively convert 3D environments into 360-degree auditory representations. Our approach, which includes the VDB-GPDF, plane projection, rasterisation and binaural room impulse responses, ensures fast accurate, complete and perceptually robust mapping for sonification. We demonstrate its performance in both static and dynamic environments. The successful implementation and evaluation of this framework highlight its potential to significantly improve the quality of life for visually impaired individuals by providing an intuitive and efficient means of navigating their surroundings. In future work, we aim to perform user studies targeting to reduce the cognitive load of the users for the 3D representation.

UTS
University of Technology Sydney
RI
Robotics Institute
FEIT
Faculty of Engineering and Information Technology
WiEIT
Women in Engineering and IT
1D
One-Dimensional
2D
Two-Dimensional
2.5D
Two-and-a-Half-Dimensional
3D
Three-Dimensional
GP
Gaussian Process
GPIS
Gaussian Process Implicit Surfaces
MAP
Maximum A Posteriori
MLE
Maximum Likelihood Estimation
ICP
Iterative Closest Point
MVE
Multi-View Environment
OG
Occupancy Grid
CHOMP
Covariant Hamiltonian Optimization for Motion Planning
FIESTA
Fast Incremental Euclidean DiSTAnce Fields
GPUs
Graphics Processing Units
KD
K-Dimensional
Log-GPIS
Log-Gaussian Process Implicit Surfaces
LiDAR
Light Detection And Ranging Sensor
SLAM
Simultaneous Localisation and Mapping
TDF
Truncated Distance Field
EDF
Euclidean Distance Field
SDF
Signed Distance Field
TSDF
Truncated Signed Distance Field
ESDF
Euclidean Signed Distance Field
GPIS-SDF
GPIS with signed distance function
RGB
Red-Green-Blue
RGB-D
Red-Green-Blue-Depth
RMSE
Root Mean Squared Error
CI
Conditional Independent
FP
Forward Propagation
BP
Backward Propagation
D-SKI
Structured Kernel Interpolation framework with Derivatives
SKI
Structured Kernel Interpolation method
D-SKI-CI-Fusion
Structured Kernel Interpolation with Derivatives and Conditional Independent Fusion
MVMs
Matrix-Vector Multiplications method
PDE
Partial Differential Equation
Log-GPIS-MOP
Log-Gaussian Process Implicit Surface for Mapping, Odometry and Planning
RANSAC
Random Sample Consensus
Dynamic-GPDF
Dynamic Gaussian Process Distance Fields

References

  • [1] L. Wu, C. L. Gentil, and T. Vidal-Calleja, “Vdb-gpdf: Online gaussian process distance field with vdb structure,” arXiv preprint arXiv:2407.09649, 2024.
  • [2] K. Museth, “Vdb: High-resolution sparse volumes with dynamic topology,” ACM transactions on graphics (TOG), vol. 32, no. 3, pp. 1–22, 2013.
  • [3] K. Museth, J. Lait, J. Johanson, J. Budsberg, R. Henderson, M. Alden, P. Cucka, D. Hill, and A. Pearce, “Openvdb: an open-source data structure and toolkit for high-resolution volumes,” in Acm siggraph 2013 courses, 2013, pp. 1–1.
  • [4] Z. Shan, R. Li, and S. Schwertfeger, “Rgbd-inertial trajectory estimation and mapping for ground robots,” Sensors, vol. 19, no. 10, p. 2251, 2019.
  • [5] J. Ward and P. Meijer, “Visual experiences in the blind induced by an auditory sensory substitution device,” Consciousness and cognition, vol. 19, no. 1, pp. 492–500, 2010.
  • [6] K. Yang, K. Wang, S. Lin, J. Bai, L. M. Bergasa, and R. Arroyo, “Long-range traversability awareness and low-lying obstacle negotiation with realsense for the visually impaired,” in Proceedings of the 1st International Conference on Information Science and Systems, 2018, pp. 137–141.
  • [7] Z. Li, F. Song, B. C. Clark, D. R. Grooms, and C. Liu, “A wearable device for indoor imminent danger detection and avoidance with region-based ground segmentation,” IEEE Access, vol. 8, pp. 184 808–184 821, 2020.
  • [8] L. Commère and J. Rouat, “Sonified distance in sensory substitution does not always improve localization: Comparison with a 2-d and 3-d handheld device,” IEEE Transactions on Human-Machine Systems, vol. 53, no. 1, pp. 154–163, 2022.
  • [9] M. Brock and P. O. Kristensson, “Supporting blind navigation using depth sensing and sonification,” in Proceedings of the 2013 ACM conference on Pervasive and ubiquitous computing adjunct publication, 2013, pp. 255–258.
  • [10] C. Stoll, R. Palluel-Germain, V. Fristot, D. Pellerin, D. Alleysson, and C. Graff, “Navigating from a depth image converted into sound,” Applied bionics and biomechanics, vol. 2015, no. 1, p. 543492, 2015.
  • [11] R. Dagli, S. Prakash, R. Wu, and H. Khosravani, “See-2-sound: Zero-shot spatial environment-to-spatial sound,” arXiv preprint arXiv:2406.06612, 2024.
  • [12] X. Su, J. E. Froehlich, E. Koh, and C. Xiao, “Sonifyar: Context-aware sound generation in augmented reality,” arXiv preprint arXiv:2405.07089, 2024.
  • [13] Z. Gao, H. Wang, G. Feng, and H. Lv, “Exploring sonification mapping strategies for spatial auditory guidance in immersive virtual environments,” ACM Transactions on Applied Perceptions (TAP), vol. 19, no. 3, pp. 1–21, 2022.
  • [14] B. S. Schwartz, S. King, and T. Bell, “Echosee: An assistive mobile application for real-time 3d environment reconstruction and sonification supporting enhanced navigation for people with vision impairments,” Bioengineering, vol. 11, no. 8, p. 831, 2024.
  • [15] K. Peetoom, M. Lexis, M. Joore, C. Dirksen, and L. Witte, “Disability and rehabilitation: Assistive technology,” Adv. Intell. Syst. Comput, vol. 10, pp. 271–294, 2015.
  • [16] Y. Zhao, R. Huang, and B. Hu, “A multi-sensor fusion system for improving indoor mobility of the visually impaired,” in 2019 Chinese Automation Congress (CAC).   IEEE, 2019, pp. 2950–2955.
  • [17] J. Redmon, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
  • [18] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard, “Octomap: an efficient probabilistic 3d mapping framework based on octrees,” Autonomous Robots, pp. 189–206, 2013.
  • [19] L. Wu, K. M. B. Lee, L. Liu, and T. Vidal-Calleja, “Faithful euclidean distance field from log-gaussian process implicit surfaces,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 2461–2468, 2021.
  • [20] L. Wu, K. M. B. Lee, C. Le Gentil, and T. Vidal-Calleja, “Log-GPIS-MOP: A Unified Representation for Mapping, Odometry, and Planning,” IEEE Transactions on Robotics, pp. 1–17, 2023.
  • [21] C. Le Gentil, O.-L. Ouabi, L. Wu, C. Pradalier, and T. Vidal-Calleja, “Accurate gaussian-process-based distance fields with applications to echolocation and mapping,” IEEE Robotics and Automation Letters, 2023.
  • [22] W. E. Lorensen and H. E. Cline, “Marching cubes: A high resolution 3d surface construction algorithm,” in Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques, ser. SIGGRAPH ’87.   Association for Computing Machinery, 1987.
  • [23] AES Standards Committee, “Aes69-2022: Aes standard for file exchange - spatial acoustic data file format,” 2022.
  • [24] A. J. Kolarik, B. C. J. Moore, P. Zahorik, S. Cirstea, and S. Pardhan, “Auditory distance perception in humans: a review of cues, development, neuronal bases, and effects of sensory loss,” Attention, Perception, & Psychophysics, vol. 78, no. 2, pp. 373–395, Feb. 2016.
  • [25] H. Oleynikova, Z. Taylor, M. Fehr, R. Siegwart, and J. Nieto, “Voxblox: Incremental 3d euclidean signed distance fields for on-board mav planning,” in 2017 IEEE/RSJ IROS, 2017, pp. 1366–1373.