Institute of Acoustics: Paper Detail

Volume : 46

Part : 1

Proceedings of the Institute of Acoustics

Normalizing flow-based Hilbert mapping for 3D reconstruction using imaging sonar

Sushrut Surve, Sibley School of Mechanical and Aerospace Engineering, Cornell University, Ithaca, NY, USA

Hewenxuan Li, Sibley School of Mechanical and Aerospace Engineering, Cornell University, New York, USA

E. Baker Herrin, Department of Mechanical and Aerospace Engineering, University of Florida, Gainesville, FL, USA

Jane Shin, Department of Mechanical and Aerospace Engineering, University of Florida, Gainesville, FL, USA

1 INTRODUCTION

3D target reconstruction refers to the task of inferring the shape of the target surface as an implicit or explicit representation from a set of sensor measurements. Various robotics applications, including manipulation and occlusion avoidance, necessitate real-time online planning strategies to create environment maps from measurements and devise efficient paths based on these maps. In underwater settings, reconstructing targets using acoustic sensors installed on autonomous underwater vehicles is crucial for tasks like search-and-rescue operations and ocean monitoring. However, target reconstruction methods are primarily developed for optical sensors, which do not perform well in conditions with poor visibility. Thus, 3D reconstruction strategies for acoustic sensors need to be developed.

This work focuses on online reconstruction approaches, which process measurements as recorded and infer the shape representations of the target being observed. Traditional approaches in mapping and reconstruction involve storing maps as occupancy grids¹, which are discrete and are updated incrementally with consecutive measurements. However, updating these grids is resource-intensive, both in computation and memory and also assumes spatial independence between cells in a grid. Some of these issues were addressed in follow-up work^2,3,4. However, the discretization of the workspace in these approaches prevents their usage with optimization-based planning methods, which have become prevalent in recent times. Continuous occupancy mapping frameworks were also introduced^5,6which do not require discretization of the workspace, but were computationally inefficient to be used in real-time applications. Hilbert mapping approach⁷was proposed to address these challenges to generate continuous occupancy maps more efficiently. Additionally, Hilbert maps can also be queried faster, incrementally updated without repeating computations, and have been proven to be able to reconstruct large-scale scenes⁸.

The classical mapping approaches aim to solve a sensor measurement fusion problem to infer the underlying target geometry as a probabilistic occupancy map. However, in active perception tasks such as active 3D target reconstruction, online planning approaches need to have access to predictions of target geometries given past measurements, to decide the best future views or avoid any probable collisions with the target. Several learning-based methods^9,10have also been developed to learn shape and geometric priors about certain categories of targets and predict complete shape representations given partially measured targets. However, these methods perform these shape completions well for the shapes in the dataset and do not generalize over to new shapes. Thus, generative modeling-based shape generation frameworks such as DeepSDF¹¹and PointFlow¹²have been proposed to be used for shape completion. A disadvantage of directly using these methods for reconstruction is that these methods predict an underlying latent shape representation given a partially observed target point cloud and predict the complete shape based on this shape representation. However, these completed point-clouds do not encode any information about the original input measured partial point cloud. Thus information about which part of the target has been measured already is lost. This work focuses on leveraging the exact likelihood inference property of normalizing flow-based generative models, to model the underlying probability distribution of the shape represented by the measurements. This makes it easier to interface these models with classical probabilistic occupancy mapping approaches.

This paper presents a novel approach to augment Hilbert maps obtained from 3D point cloud measurements with shape predictions from generative models. Firstly, online learning of Hilbert maps with 3D point clouds is introduced in Section 3.1. A new method to estimate the predicted shape probability distribution given sensor measurements is presented in Section 3.2. Finally, a probabilistic framework to fuse predicted shape probability distributions with the continuous occupancy probability maps learned using the Hilbert mapping approach is presented in Section 3.3. Section 4 aims to demonstrate how the presented generative model-based 3D target reconstruction strategy is used with imaging sonar measurements.

2 PROBLEM FORMULATION

This paper considers the problem of reconstructing the 3D surface of a target of interest from sensor measurements in real time. This problem is relevant to many underwater sensing and online planning applications involving 3D targets that need to be mapped by taking multiple views from different aspect angles using acoustic sensors such as forward-looking sonar or side-scan sonar. Let the environment (sea bottom) and the target of interest be characterized by a 3D surface denoted by T⊂W, where W ∈ R³denotes the workspace. Let F_Wdenote the inertial frame, with origin O_W, embedded in W. The position and orientation of the target with respect to F_Wis known a priori. This paper focuses on visual and acoustic sensors that measure 3D point clouds lying on the target surface T. These 3D point clouds can be obtained from 3D sonar scanners, spatial or temporal imaging sonar stereo, or opto-acoustic stereo imaging. The sensor geometry is denoted by A ⊂ W and the sensor field-of-view geometry by S ⊂ W. The sensor is assumed to be mounted on an unmanned underwater vehicle (UUV) which offers it mobility to move around in the workspace W. The inertial sensor frame of reference F_Ais embedded in this mobile sensor platform with origin O_A. The sensor records a measured point-cloud in the sensor frame of reference F_A, where m is the number of 3D points in the measured point cloud Z_kand p_i⊂ W ∈ ³is a point on the surface. Since the sensor pose s is assumed known, the measured point cloud is transformed to be represented in F_W. The sensor localization and measurements are subjected to noise. Thus the measurement points in Z do not lie on the target surface surface T. In other words, the measurement Z represents an estimate of the target geometry denoted by T ^b.

This paper focuses on inferring the target surface T as an occupancy field defined by an occupancy function, given Z. The occupancy function o : ³→ { 0, 1} for a target T is defined as,

The occupancy field O(x) is a discrete state stochastic random process defined for every x ∈ W by a probability mass function P(o(x) = 1). Thus, given a measurement Z, the objective is to estimate the conditional probability P(o(x) = 1|Z ) for every x ∈ W.

A few considerations are needed while designing a solution to this problem, making estimating the occupancy field difficult, especially in the context of online reconstruction. The reconstruction approach must be computationally efficient, enabling real-time querying and fast updates as new measurements are recorded. Sensor measurement and localization noise need to be addressed. Since targets can be self occluded or occluded by other objects in the workspace, measurements often only record a part of the target surface. Thus, the reconstruction approach should be augmented with predictions of unobserved regions of the target surface, especially if the resulting occupancy field is to be used for planning.

3 METHODOLOGY

This section describes the detailed methodology for generative model-informed estimation of the random field O(x) given measured 3D point cloud Z. Two approaches to learning these occupancy fields are used simultaneously: online-learning-based Hilbert mapping and shape distribution prediction using the pretrained normalizing-flow-based PointFlow model (PFM). First, the Hilbert mapping approach, described in Section 3.1, is used to estimate the occupancy probability P(o_h(x) = 1|Z ), where o_his the occupancy function estimated by the Hilbert map. Simultaneously, the measurement Z is used by the probabilistic shape-generation PFM to predict the complete target surface, given 3D point cloud measurements of the partially observed target as described in Section. 3.2. Specifically, a pre-trained auto-encoder Q predicts the shape latent variable y using the measurement set Z, and conditioned on this y, a normalizing flow based generative model G predicts the distribution P(o_g(x) = 1|y), where o_gis the occupancy function as predicted by the generative model. Finally, Section 3.3 describes how these two occupancy field estimates are fused in real-time by leveraging information from Hilbert space mapping approaches with shape and geometric priors learned by the generative model.

3.1 Hilbert Maps for 3D target reconstruction

Hilbert mapping approach estimates the occupancy field by projecting the measurements into a reproducing Kernel Hilbert space (RKHS) and performing regression in this space. As a result, the occupancy probability mass function, P(o_h(x) = 1|Z ), capable of representing complex target geometry observed in the measurements, is learned online using simple and computationally efficient logistic regression techniques. Additionally, this probabilistic framework has been proven to be robust to measurement and localization noise⁷, lending itself to various real-world applications.

To learn P(o_h(·)|Z ) incrementally using the Hilbert mapping approach, a dataset D is created at every time-step using the measurement Z and the sensor field-of-view S. This dataset consists of m 3D points in the measured point cloud Z and n non-occupied points sampled from S\^b. These non-occupied points are sampled along the ray connecting the sensor pose s and points in Z. Thus, comprises n + m 3D points, also called anchor points, and the occupancy label y(t) for every anchor point t∈D such that

Given D, a logistic regression classifier¹³is used to learn a discriminative model to predict the occupancy probability P(o_h(x) = 1|Z ) for a query point x ∈ W. Hence, the occupancy probability is defined as

where f∈H is a Hilbert function defined in W. The Hilbert function f is obtained from an inner product operation of the form:

where ⟨·, ·⟩_Hindicates inner product in the RKHS and ϕ(x) denotes the feature vectors of x. The weight vector w is learned online to determine the decision boundary in the space to which the feature vectors map the point x. In this paper, the radial basis function is used to construct the feature vector for a point x in the input space by correlating it with anchor points in D. For a point x, the kernel value k(x, t_j) with respect to the anchor point t_jis given by,

where

is called the length-scale matrix⁸, d_j= x−x_j= [∆x_j∆y_j∆z_j]^T, and d is the input space dimensionality, which is 3 in the case of 3D point-clouds. Thus, the n + m-dimensional feature vector for a point x in ³ is produced by concatenating kernel values produced by all the n + m anchor points as follows,

This procedure for calculating feature vectors can get computationally expensive as the number of anchor points increases. Thus, sparsity is enforced by calculating the entries k(·, t_j) for a subset of anchor points closer to the x while setting all other entries to zero. This subset of k nearest anchor points for x are found using nearest neighbor search algorithms such as KD-trees.

The weight vector w is learned by minimizing the loss function J, which is defined as the sum of the negative regularized log-likelihood of the form

where λ is a tuning constant chosen by the user and f is the regularization term. Thus, using stochastic gradient descent, the weight vector w is updated at each iteration i by taking a step toward the maximum gradient of J for minimization

where η is the learning rate, and the matrix A_iis a pre-conditioner to accelerate the convergence rate. This has been set to an identify matrix, and the learning rate is set to a constant user-defined value in our experiments. Using a simple logistic regression classifier to represent the occupancy probabilities enables fast, online, and real-time training.

3.2 Generative model-based inference of predicted shape distribution

PointFlow is a deep generative model that learns both the distribution of shapes and the distribution of points in 3D for a shape, given a dataset of 3D shapes of a category. The PFM is composed of a variational auto-encoder (VAE) and two continuous normalizing flows (CNFs). The VAE is trained to encode a point cloud into a latent space with its implicit shape representation, i.e., a latent variable, y, by learning a posterior probability distribution Q_ϕ(y;Z ) through the VAE with model parameters ϕ. Consequently, a CNF G_θ(Z; y) with parameters denoted by θ, conditioned on the latent variable y, is trained to learn the posterior probability P(o_g(x) = 1 | y). Another CNF is also trained in the original PFM to model the latent space distribution of shapes, which is not used in this paper.

Figure 1: A schematic of the shape completion and measurement likelihood evaluation using the pre-trained PFM.

The CNF G_θ(Z; y) transforms a point q(t₀) sampled from prior distribution q(t₀) ∼ N (0, I) to the point p in Z. This is mathematically represented as

where g_θdefines the continuous-time dynamics of the normalizing flow G_θconditioned on y, and G_θ[q(t₀); y] is the CNF with y as an input parameter, t₀and t₁are the lower and upper bound of integration over time. Therefore, with a trained PFM, the CNF conditioned on y can be used to generate point clouds by sampling from a known prior distribution. Measurements obtained from a specific aspect angle around the target often capture only a part of the target geometry. The trained PFM maps the measured point cloud Z to an estimated latent variable ŷ, which conditions the CNF to generate a complete predicted shape point cloud Ẑ. This is often called 3D shape completion. This shape completion operation is highlighted using red-colored lines in Fig.(1).

Additionally, the invertibility of normalizing flows facilitates the estimation of occupancy probability of any point x ∈ W given the estimated latent variable ŷ obtained from the measurement Z using the VAE Q_ϕ. This occupancy probability distribution of the predicted shape encoded by ŷ is defined as,

Since the prior probability follows a standard multivariable Gaussian distribution, the predicted shape occupancy probability of the query point x is obtained analytically via CNF with ŷ and P(q). A graphical illustration of the PointFlow-based approaches for shape completion and estimation of predicted shape occupancy probability distributions is shown in Fig. 1 where the pre-trained PointFlow model takes the measurement Z and a query point x as input, and outputs an estimated complete shape Ẑand the predicted shape occupancy probability distribution P(o_g(x) = 1 | ŷ).

3.3 Fusing Hilbert maps with measurement likelihood inference

The probability of occupancy of a point x ∈ W as predicted by the Hilbert map and as predicted by the generative model are denoted by P(o_h(x) = 1|Z ) and P(o_g(x) = 1|Z, y) respectively. The joint probability of o_hand o_ggiven the measurement set Z and the latent variable y generated by the VAE Q_ϕis

The joint probability can be rewritten in this form as a product of the occupancy probabilities estimated by Hilbert maps and predicted by the generative model due to the conditional independence imposed by y and Z. P(o_h(x) = 1, o_g(x) = 1|Z, y) is also called fused occupancy probability which encodes both information directly from the Hilbert map and also the predicted shape information from the PFM.

4 RESULTS

The numerical experiment setup, as shown in Fig. 2, aims to examine the benefits of fusing the occupancy probabilities from the Hilbert map and the PFM. An underwater workspace with an airplane wreck is created in the HoloOcean simulation environment, from which forward-looking sonar (FLS) images are obtained from an FLS-equipped autonomous underwater vehicle (AUV). Then, the neural implicit surface reconstruction method¹⁴is used to extract the airplane’s surface, from which the point cloud is sampled. This underwater setup provides key acoustic measurements to map the underwater environment around the AUV. A schematic that combines the idea of fusing Hilbert map and PointFlow predicted shape likelihood as the occupancy probability plus the results are shown in Fig 1.

Figure 2: (a) An AUV equipped with forward-looking sonar takes (b) multiple measurements recorded as sonar images. (c) A batch of these consecutive sonar images generates the mea- sured point cloud Z using the neural implicit surface reconstruction approach.

Numerical experiments based on an unseen aircraft 3D point cloud are used to test the proposed occupancy probability. The ground truth point cloud is sampled from a ShapeNet airplane model, illustrated in Fig. 2 (a), and we assume that the mobile sensor A observes the aircraft from the side such that half of the airplane is measured and stored in Z, see Fig. 2 (b). The anchor points D, i.e., the measurement along with the non-occupied points will be fed into the Hilbert map to learn an occupancy probability over the workspace P(o_h(x) = 1 | Z ). The same measurement set and query points x ∈ X will be input to the pre-trained PFM to obtain the predicted shape likelihood P(o_g(x) = 1 | ŷ). The idea of fusing the occupancy probabilities obtained from the Hilbert map and the PointFlow model is shown alongside the numerical experimental results in Fig. 3. Fig. 3 (a1) illustrates the evenly sampled occupancy probability from the Hilbert map, P(o_h(x) = 1 | Z ); Fig. 3 (a2) demonstrates the predicted shape likelihood P(o_p(x) | ŷ), queried at the same sampled points in the workspace; and Fig. 3 (b) shows the fused occupancy probability, denoted by the Hadamard product P(o_h(x) = 1 | Z ) ⊗ P(o_g(x) = 1 | ŷ).

In Fig. 3 (c), the overlay between the occupancy probability and the ground truth point cloud of the airplane is shown. One can observe that the Hilbert map result follows the measurement (the right half of the airplane in Fig. 3 (c2)) more faithfully compared to the PointFlow counterpart in Fig. 3 (d), e.g., higher occupancy probability is observed around the engine under the right wing. The Hilbert occupancy map obtained from the FLS-generated point cloud in Fig. 3 (d2) illustrates that the map accurately follows the observed part of the airplane. However, the Hilbert map also models the non-observable part of the scene with high occupancy probability; see the red-colored box in Fig. 3 (c2) and (d2). On the other hand, comparing Fig. 3 (c2) and (c3), PFM’s predicted shape likelihood provides an occupancy map closer to a complete plane without weighing the non-observable scene with high occupancy probability, attributed to its reconstruction capability. Moreover, since the PF provides an occupancy map from the predicted shape instead of the measurement itself, the occupancy probability is robust to measurement noise, as one can see from the yellow squares in Fig. 3 (d3) that the PFM ensures low occupancy at where the erroneously measured points reside. However, since the predicted shape and its associated predicted shape likelihood depends on the implicit shape representation ŷ inferred from the VAE, the underlying shape does not necessarily represent the true shape but the most probable shape of the airplane. Therefore, P(o_g(x) = 1 | ŷ) does not represent the exact occupancy of the airplane under evaluation even though it provides the occupancy estimate of the full airplane. For example, the PFM deems the fuselage and engines less likely to be occupied.

After fusing these two occupancy probabilities through their Hadamard product, one can observe from Fig. 3 (e) that the fused occupancy (1) cancels out the overconfidence in the occluded part of the scene compared to P(o_h(x) = 1 | Z ) in Fig. 3 (c) and (2) improves the P(o_h(x) = 1 | Z ) from reconstruction such that the occupancy probability around the engines and inside the fuselage are increased. Similarly, the area on the true plane location is modeled to be more likely comparing Fig. 3 (d4) to (d3) in the green boxes. Such a comparison qualitatively shows that the fused occupancy probability joints highly faithful occupancy probability yield from the observation-based Hilbert map and the shape prediction and completion capability of the pre-trained generative model from partial views.

5 CONCLUSION AND FUTURE WORK

Hilbert maps can learn the occupancy with higher accuracy from the measurement Z_kbut cannot properly estimate the occupancy probability for the occluded part. On the contrary, normalizing flow-based PointFlow can predict the occupancy probabilities for the unobserved part of the target owing to its probabilistic shape completion capability. However, since this method maps the measurement to an implicit shape representation, the predicted shape occupancy probability represents a complete shape and loses explicit information about the points in the measurement set. Fusing these two approaches to estimate occupancy probabilities mitigates the deficiencies of both methods while leveraging their merits to endow robustness to measurement and localization noise, prediction capability for unobserved regions of the target geometry, and reconstruction accuracy of Hilbert maps near observed regions. Future work will extend this approach to multi-view target reconstruction and examine the reconstruction performance with various other types of acoustic sensors using synthetic and real-world data.

Figure 3: A combined schematic with numerical experiment results from (a1) occupancy probability from Hilbert map, (a2) the occupancy probability from PointFlow model, and (b) the fused occupancy probability defined as the Hadamard product of the two occupancy probabilities.

REFERENCES

A. Elfes. Occupancy grids: a probabilistic framework for robot perception and navigation [Ph.D. thesis], 1989.
Johannes Strom and Edwin Olson. Occupancy grid rasterization in large environments for teams of robots. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4271–4276, 2011.
Sang-Il Oh and Hang-Bong Kang. Fast occupancy grid filtering using grid cell clusters from lidar and stereo vision sensor data. IEEE Sensors Journal, 16(19):7258–7266, 2016.
Armin Hornung, Kai M. Wurm, Maren Bennewitz, Cyrill Stachniss, and Wolfram Burgard. Octomap: An efficient probabilistic 3D mapping framework based on octrees. Autonomous Robots, 34:189–206, 2013.
Stanimir Dragiev, Marc Toussaint, and Michael Gienger. Gaussian process implicit surfaces for shape estimation and grasping. In 2011 IEEE International Conference on Robotics and Automation, pages 2845–2850. IEEE, 2011.
Simon T. O’Callaghan and Fabio T. Ramos. Gaussian process occupancy maps. The International Journal of Robotics Research, 31(1):42–62, 2012.
Fabio Ramos and Lionel Ott. Hilbert maps: Scalable continuous occupancy mapping with stochastic gradient descent. The International Journal of Robotics Research, 35(14):1717–1730, 2016.
Vitor Guizilini and Fabio Ramos. Large-scale 3D scene reconstruction with Hilbert maps. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3247–3254, 2016.
Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3D reconstruction in function space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4460–4470, 2019.
Haisong Liu, Yang Chen, Haiguang Wang, Zetong Yang, Tianyu Li, Jia Zeng, Li Chen, Hongyang Li, and Limin Wang. Fully sparse 3D occupancy prediction, 2024.
Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. DeepSDF: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 165–174, 2019.
Guandao Yang, Xun Huang, Zekun Hao, Ming-Yu Liu, Serge Belongie, and Bharath Hariharan. PointFlow: 3D point cloud generation with continuous normalizing flows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4541–4550, 2019.
David A. Freedman. Statistical models: theory and practice. Cambridge University Press, 2009.
Mohamad Qadri, Michael Kaess, and Ioannis Gkioulekas. Neural implicit surface reconstruction using imaging sonar. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 1040–1047. IEEE, 2023.

Building Acoustics

Policy & health

Underwater acoustics

Speech and hearing

Physical acoustics

Noise and vibration engineering

Musical acoustics

Electroacoustics

Environmental Sound

Measurement and instrumentation

About Us

Terms and Conditions

Advertise With Us

People & Contacts

Publications

Engineering

Bursary Fund

Regional Branches

Specialist Groups

Conferences and Events

Conference Proceedings

British Standards Committees

Organisation Search

Why become a member?

Application Process

Membership Fees

Application Policy

Application

Professional Development Scheme (CPD)

Bulletins

Help and Advice

Awards

Become a Sponsor Member

What is acoustics?

Technician Apprenticeship Scheme 2022

Where do acousticians work?

Career Guide

What educational qualifications do I need?

Volume : 46

Part : 1