11email: ruyi.zha@anu.edu.au
NAF: Neural Attenuation Fields for Sparse-View CBCT Reconstruction
Abstract
This paper proposes a novel and fast self-supervised solution for sparse-view CBCT reconstruction (Cone Beam Computed Tomography) that requires no external training data. Specifically, the desired attenuation coefficients are represented as a continuous function of 3D spatial coordinates, parameterized by a fully-connected deep neural network. We synthesize projections discretely and train the network by minimizing the error between real and synthesized projections. A learning-based encoder entailing hash coding is adopted to help the network capture high-frequency details. This encoder outperforms the commonly used frequency-domain encoder in terms of having higher performance and efficiency, because it exploits the smoothness and sparsity of human organs. Experiments have been conducted on both human organ and phantom datasets. The proposed method achieves state-of-the-art accuracy and spends reasonably short computation time. Code available at https://github.com/Ruyi-Zha/naf_cbct.
Keywords:
CBCT Sparse ViewImplicit Neural Representation1 Introduction
Cone Beam Computed Tomography (CBCT) is an emerging medical imaging technique to examine the internal structure of a subject noninvasively. A CBCT scanner emits cone-shaped X-ray beams and captures 2D projections at equal angular intervals. Compared with the conventional Fan Beam CT (FBCT), CBCT enjoys the benefits of high spatial resolution and fast scanning speed [19]. Recent years have witnessed the blossoming of low dose CT, which delivers a significantly lower radiation dose during the scanning process. There are two ways to reduce the dose: decreasing source intensity or projection views [8]. This paper focuses on the latter, i.e., sparse-view CBCT reconstruction.
Sparse-view CBCT reconstruction aims to retrieve a volumetric attenuation coefficient field from dozens of projections. It is a challenging task in two respects. First, insufficient views lead to notable artifacts. As a comparison, the traditional CBCT obtains hundreds of images. The inputs of sparse-view CBCT are 10 fewer. Second, the spatial and computational complexity of CBCT reconstruction is much higher than that of FBCT reconstruction due to the dimensional increase of inputs. CBCT relies on 2D projections to build a 3D model, while FBCT simplifies the process by stacking 2D slides restored from 1D projections (but in the sacrifice of time and dose).
Existing CBCT approaches can be divided into three categories: analytical, iterative and learning-based methods. Analytical methods estimate attenuation coefficients by solving the Radon transform and its inverse. A typical example is the FDK algorithm [7]. It produces good results in an ideal scenario but copes poorly with ill-posed problems such as sparse views. The second family, iterative methods, formulates reconstruction as a minimization process. These approaches utilize an optimization framework combined with regularization modules. While iterative methods perform well in ill-posed problems [2, 20], they require substantial computation time and memory. Recently, learning-based methods have become popular with the rise of AI. They use deep neural networks to 1) predict and extrapolate projections [3, 22, 24, 28], 2) regress attenuation coefficients with similar data [11, 27], and 3) make optimization process differentiable [1, 6, 10]. Most of these methods [3, 11, 22, 27] need extensive datasets for network training. Moreover, they rely on neural networks to remember what a CT looks like. Therefore it is difficult to apply a trained model of one application to another. While there are self-supervised methods [1, 28], they operate under FBCT settings considering network capacity and memory consumption. Their performance and efficiency drop when applied to the CBCT scenario.
Apart from the aforementioned work designated for CT reconstruction, efforts have been made to deal with other ill-posed problems, such as 3D reconstruction in the computer vision field. Similar to CT reconstruction, 3D reconstruction uses RGB images to estimate 3D shapes, which are usually represented as discrete point clouds or meshes. Recent studies propose [13, 16] Implicit Neural Representation (INR) as an alternative to those discrete representations. INR parameterizes a bounded scene as a neural network that maps spatial coordinates to metrics such as occupancy and color. With the help of position encoder [14, 21], INR is capable to learn high-frequency details.
This paper proposes Neural Attenuation Fields (NAF), a fast self-supervised solution for sparse-view CBCT reconstruction. Here we use ‘self-supervised’ to highlight that NAF requires no external CT scans but the X-ray projections of the interested object. Inspired by 3D reconstruction work [13, 16], we parameterize the attenuation coefficient field as an INR and imitates the X-ray attenuation process with a self-supervised network pipeline. Specifically, we train a Multi-Layer Perceptron (MLP), whose input is an encoded spatial coordinate and whose output is the attenuation coefficient at that location. Instead of using a common frequency-domain encoding, we adopt hash encoding [14], a learning-based position encoder, to help the network quickly learn high-frequency details. Projections are synthesized by predicting the attenuation coefficients of sampled points along ray trajectories and attenuating incident beams accordingly. The network is optimized with gradient descent by minimizing the error between real and synthesized projections. We demonstrate that NAF quantitatively and qualitatively outperforms existing solutions on both human organ and phantom datasets. While most INR approaches take hours for training, our method can reconstruct a detailed CT model within 10-40 minutes, which is comparable to iterative methods.
In summary, the main contributions of this work are:
-
•
We propose a novel and fast self-supervised method for sparse-view CBCT reconstruction. Neither external datasets nor structural prior is needed except projections of a subject.
-
•
The proposed method achieves state-of-the-art accuracy and spends relatively short computation time. The performance and efficiency of our method make it feasible for clinical CT applications.
-
•
The code will be publicly available for investigation purposes.

2 Method
2.1 Pipeline
The pipeline of NAF is shown in Fig. 1. During a CBCT scanning, an X-ray source rotates around the object and emits cone-shaped X-ray beams. A 2D panel detects X-ray projections at equal angular intervals. NAF then uses the scanner geometry to imitate the attenuation process discretely. It learns CT shapes by comparing real and synthesized projections. After the model optimization, the final CT image is generated by querying corresponding voxels.
NAF consists of four modules: ray sampling, position encoding, attenuation coefficient prediction, and projection synthesis. First, we uniformly sample points along X-ray paths based on the scanner geometry. A position encoder network then encodes their spatial coordinates to extract valuable features. After that, an MLP network consumes the encoded information and predicts attenuation coefficients. The last step of NAF is to synthesize projections by attenuating incident X-rays according to the predicted attenuation coefficients on their paths.
2.2 Neural attenuation fields
2.2.1 Ray sampling
Each pixel value of a projection image results from an X-ray passing through a cubical space and getting attenuated by the media inside. We sample points at the parts where rays intersect the cube. A stratified sampling method [13] is adopted, where we divide a ray into evenly spaced bins and uniformly sample one point at each bin. Setting greater than the desired CT size ensures that at least one sample is assigned to every grid cell that an X-ray traverses. The coordinates of sampled points are then sent to the position encoding module.
2.2.2 Position encoding
A simple MLP can theoretically approximate any function [9]. Recent studies [18, 21], however, reveal that a neural network prefers to learn low-frequency details due to “spectral bias”. To this end, a position encoder is introduced to map 3D spatial coordinates to a higher dimensional space.
A common choice is the frequency encoder proposed by Mildenhall et al. [13]. It decomposes a spatial coordinate into sets of sinusoidal components at different frequencies. While frequency encoder eases the difficulty of training networks, it is considered quite cumbersome. In medical imaging practise [26, 28], the size of encoder output is set to 256 or greater. The following network must be wider and deeper to cope with the inflated inputs. As a result, it takes hours to train millions of network parameters, which is not acceptable for fast CT reconstruction.
Frequency-domain encoding is a dense encoder because it utilizes the entire frequency spectrum. However, dense encoding is redundant for CBCT reconstruction for two main reasons. First, a human body usually consists of several homogeneous media, such as muscles and bones. Attenuation coefficients remain approximately uniform inside one medium but vary between different media. High-frequency features are not necessary unless for points near edges. Second, natural objects favor smoothness. Many organs have simple shapes, such as spindle (muscle) or cylinder (bone). Their smooth surfaces can be easily learned with low-dimensional features.
To exploit the aforementioned characteristics of the scanned objects, we use the hash encoder [14], a learning-based sparse encoding solution. The equation of hash encoder is:
(1) |
Hash encoder describes a bounded space by multiresolution voxel grids. A trainable feature lookup table with size is assigned to each voxel grid. At each resolution level, we 1) detect neighbouring corners (cubes with different colors in Fig. 1(b)) of the queried point , 2) look up their corresponding features in a hash function fashion [23], and 3) generate a feature vector with linear interpolation . The output of a hash encoder is the concatenation of feature vectors at all resolution levels. More details of hash function and its symbols can be found in [14].
Compared with frequency encoder, hash encoder produces much smaller outputs ( in our setting) with competitive feature quality for two reasons. On the one hand, the many-to-one property of hash function conforms to the sparsity nature of human organs. On the other hand, a trainable encoder can learn to focus on relevant details and select suitable frequency spectrum [14]. Thanks to hash encoder, the subsequent network is more compact.
2.2.3 Attenuation coefficient prediction
We represent the bounded field with a simple MLP , which takes the encoded spatial coordinates as inputs and outputs the attenuation coefficients at that position. As illustrated in Fig. 1(c), the network is composed of 4 fully-connected layers. The first three layers are 32-channel wide and have ReLU activation functions in between, while the last layer has one neuron followed by a sigmoid activation. A skip connection is included to concatenate the network input to the second layer’s activation. By contrast, Zang et al. [28] use a 6-layer 256-channel MLP to learn features from a frequency encoder. Our network is smaller.
2.2.4 Attenuation synthesis
According to Beer’s Law, the intensity of an X-ray traversing matter is reduced by the exponential integration of attenuation coefficients on its path. We numerically synthesize the attenuation process with:
(2) |
where is the initial intensity and is the distance between adjacent points.
2.3 Model optimization and output
NAF is updated by minimizing the L2 loss between real and synthesized projections. The loss function is defined as:
(3) |
where is a ray batch, and and are real and synthesized projections for ray respectively. We update both hash encoder and attenuation coefficient network during the training process.
The final output is formulated as a discrete 3D matrix. We build a voxel grid with the desired size and pass the voxel coordinates to the trained MLP to predict the corresponding attenuation coefficients. A CT model thus is restored.
3 Experiments
3.1 Experimental settings
3.1.1 Data
We conduct experiments on five datasets containing human organ and phantom data. Details are listed in Table 1.
Human organ: We evaluate our method using public datasets of human organ CTs [4, 12], including chest, jaw, foot and abdomen. The chest data are from LIDC-IDRI dataset [4], and the rest are from Open Scientific Visualization Datasets [12]. Since these datasets only provide volumetric CT scans, we generate projections by a tomographic toolbox TIGRE [5]. In TIGRE [5], we capture 50 projections with 3% noise in the range of 180°. We train our model with these projections and evaluate its performance with the raw volumetric CT data.
Phantom: We collect a phantom dataset by scanning a silicon aortic phantom with GE C-arm Medical System. This system captures 582 500500 fluoroscopy projections with position primary angle from -103°to 93°and position secondary angle of 0°. A 512512510 CT image is also generated with inbuilt algorithms as the ground truth. We only use 50 projections for experiments.
Dataset name | CT dimension | Scanning method | Scanning range | Number of projections | Detector resolution |
Chest [4] | 128128128 | TIGRE [5] | 50 | 256256 | |
Jaw [12] | 256256256 | TIGRE [5] | 50 | 512512 | |
Foot [12] | 256256256 | TIGRE [5] | 50 | 512512 | |
Abdomen [12] | 512512463 | TIGRE [5] | 50 | 10241024 | |
Aorta | 512512510 | GE C-arm | 50 (582) | 500500 |
3.1.2 Baselines
We compare our approach with four baseline techniques. FDK [7] is firstly chosen as a representative of analytical methods. The second method SART [2] is a robust iterative reconstruction algorithm. ASD-POCS [20] is another iterative method with a total-variation regularizer. We implement a CBCT variant of IntraTomo [28], named IntraTomo3D, as an example of frequency-encoding deep learning methods.
3.1.3 Implementation details
Our proposed method is implemented in PyTorch [17]. We use Adam optimizer with a learning rate that starts at and steps down to . The batch size is 2048 rays at each iteration. The sampling quantity of each ray depends on the size of CT data. For example, we sample points along each ray for the 128128128 chest CT. We use the same hyper-parameter setting for hash encoder as [14]. More details of hyper-parameters can be found in the supplementary material. All experiments are conducted on a single RTX 3090 GPU. We evaluate five methods quantitatively in terms of peak signal-to-noise ratio (PNSR) and structural similarity (SSIM) [25]. PSNR (dB) statistically assesses the artifact suppression performance, while SSIM measures the perceptual difference between two signals. Higher PNSR/SSIM values represent the accurate reconstruction and vice versa.
3.2 Results
3.2.1 Performance
Our method produces quantitatively best results in both human organ and phantom datasets as listed in Table 2. Both PSNR and SSIM values are significantly higher than other methods. For example, the PSNR value of our method in the abdomen dataset is 3.07 dB higher than that of the second-best method SART.
We also provide visualization results of different methods in Fig. 2. FDK restores low-quality models with notable artifacts, as analytical methods demand large amounts of projections.
Chest | Jaw | Foot | Abdomen | Aorta | |
FDK [7] | 22.89/.78 | 28.59/.78 | 23.92/.58 | 22.39/.59 | 12.11/.21 |
SART [2] | 32.12/.95 | 32.67/.93 | 30.13/.93 | 31.38/.92 | 27.31/.77 |
ASD-POCS [20] | 29.78/.92 | 32.78/.93 | 28.67/.89 | 30.34/.91 | 27.30/.76 |
IntraTomo3D [28] | 31.94/.95 | 31.95/.91 | 31.43/.91 | 30.43/.90 | 29.38/.82 |
NAF (Ours) | 33.05/.96 | 34.14/.94 | 31.63/.94 | 34.45/.95 | 30.34/.88 |

Iterative method SART suppresses noise in the sacrifice of losing certain details. The reconstruction results of ASD-POCS are heavily smeared because total-variation regularization encourages removing high-frequency details, including unwanted noise and expected tiny structures. IntraTomo3D produces clean results. However, edges between media are slightly blurred, which shows that the frequency encoder fails to teach the network to focus on edges. With the help of hash encoding, results of the proposed NAF have the most details, clearest edges and fewest artifacts. Fig. 4 indicates that NAF outperforms other methods in all slices of the reconstructed CT volume.
Figure 4 shows the performance of iterative methods and learning-based methods under different number of views. It is clear that the performance increases with the rise of input views. Our methods achieves better results than others under most circumstances.
3.2.2 Time
We record the running time of iterative and learning-based methods as shown in Fig. 5. All methods use CUDA [15] to accelerate the computation process. Overall, the methods spend less time on datasets with small projections (chest, jaw and foot) and increasingly more time on big datasets (abdomen and aorta). IntraTomo3D requires more than one hour to train the network. Benefiting from the compact network design, NAF spends similar running time to iterative methods and is 3 faster than the frequency-encoding deep learning method IntraTomo3D.



4 Conclusion
This paper proposes NAF, a fast self-supervised learning-based solution for sparse-view CBCT reconstruction. Our method trains a fully-connected deep neural network that consumes a 3D spatial coordinate and outputs the attenuation coefficient at that location. NAF synthesizes projections by attenuating incident X-rays based on the predicted attenuation coefficients. The network is updated by minimizing the projection error. We show that frequency encoding is not computationally efficient for tomographic reconstruction tasks. As an alternative, a learning-based encoder entitled hash encoding is adopted to extract valuable features. Experimental results on human organ and phantom datasets indicate that the proposed method achieves significantly better results than other baselines and spends reasonably short computation time.
References
- [1] Adler, J., Öktem, O.: Learned primal-dual reconstruction. IEEE transactions on medical imaging 37(6), 1322–1332 (2018)
- [2] Andersen, A.H., Kak, A.C.: Simultaneous algebraic reconstruction technique (sart): a superior implementation of the art algorithm. Ultrasonic imaging 6(1), 81–94 (1984)
- [3] Anirudh, R., Kim, H., Thiagarajan, J.J., Mohan, K.A., Champley, K., Bremer, T.: Lose the views: Limited angle ct reconstruction via implicit sinogram completion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6343–6352 (2018)
- [4] Armato III, S.G., McLennan, G., Bidaut, L., McNitt-Gray, M.F., Meyer, C.R., Reeves, A.P., Zhao, B., Aberle, D.R., Henschke, C.I., Hoffman, E.A., et al.: The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans. Medical physics 38(2), 915–931 (2011)
- [5] Biguri, A., Dosanjh, M., Hancock, S., Soleimani, M.: Tigre: a matlab-gpu toolbox for cbct image reconstruction. Biomedical Physics & Engineering Express 2(5), 055010 (2016)
- [6] Chen, H., Zhang, Y., Zhang, W., Sun, H., Liao, P., He, K., Zhou, J., Wang, G.: Learned experts’ assessment-based reconstruction network (” learn”) for sparse-data ct,”. arXiv preprint arXiv:1707.09636 (2017)
- [7] Feldkamp, L.A., Davis, L.C., Kress, J.W.: Practical cone-beam algorithm. Josa a 1(6), 612–619 (1984)
- [8] Gao, Y., Bian, Z., Huang, J., Zhang, Y., Niu, S., Feng, Q., Chen, W., Liang, Z., Ma, J.: Low-dose x-ray computed tomography image reconstruction with a combined low-mas and sparse-view protocol. Optics express 22(12), 15190–15210 (2014)
- [9] Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural networks 2(5), 359–366 (1989)
- [10] Kang, E., Chang, W., Yoo, J., Ye, J.C.: Deep convolutional framelet denosing for low-dose ct via wavelet residual network. IEEE transactions on medical imaging 37(6), 1358–1369 (2018)
- [11] Kasten, Y., Doktofsky, D., Kovler, I.: End-to-end convolutional neural network for 3d reconstruction of knee bones from bi-planar x-ray images. In: International Workshop on Machine Learning for Medical Image Reconstruction. pp. 123–133. Springer (2020)
- [12] Klacansky, P.: Open scientific visualization datasets (2022), https://klacansky.com/open-scivis-datasets/
- [13] Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. In: European conference on computer vision. pp. 405–421. Springer (2020)
- [14] Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. arXiv:2201.05989 (Jan 2022)
- [15] NVIDIA, Vingelmann, P., Fitzek, F.H.: Cuda, release: 10.2.89 (2020), https://developer.nvidia.com/cuda-toolkit
- [16] Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: Deepsdf: Learning continuous signed distance functions for shape representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 165–174 (2019)
- [17] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: An imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc. (2019), http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
- [18] Rahaman, N., Baratin, A., Arpit, D., Draxler, F., Lin, M., Hamprecht, F., Bengio, Y., Courville, A.: On the spectral bias of neural networks. In: International Conference on Machine Learning. pp. 5301–5310. PMLR (2019)
- [19] Scarfe, W.C., Farman, A.G., Sukovic, P., et al.: Clinical applications of cone-beam computed tomography in dental practice. Journal-Canadian Dental Association 72(1), 75 (2006)
- [20] Sidky, E.Y., Pan, X.: Image reconstruction in circular cone-beam computed tomography by constrained, total-variation minimization. Physics in Medicine & Biology 53(17), 4777 (2008)
- [21] Tancik, M., Srinivasan, P., Mildenhall, B., Fridovich-Keil, S., Raghavan, N., Singhal, U., Ramamoorthi, R., Barron, J., Ng, R.: Fourier features let networks learn high frequency functions in low dimensional domains. Advances in Neural Information Processing Systems 33, 7537–7547 (2020)
- [22] Tang, C., Zhang, W., Li, Z., Cai, A., Wang, L., Li, L., Liang, N., Yan, B.: Projection super-resolution based on convolutional neural network for computed tomography. In: 15th International Meeting on Fully Three-Dimensional Image Reconstruction in Radiology and Nuclear Medicine. vol. 11072, p. 1107233. International Society for Optics and Photonics (2019)
- [23] Teschner, M., Heidelberger, B., Müller, M., Pomerantes, D., Gross, M.H.: Optimized spatial hashing for collision detection of deformable objects. In: Vmv. vol. 3, pp. 47–54 (2003)
- [24] Wang, C., Zhang, H., Li, Q., Shang, K., Lyu, Y., Dong, B., Zhou, S.K.: Improving generalizability in limited-angle ct reconstruction with sinogram extrapolation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 86–96. Springer (2021)
- [25] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004)
- [26] Wu, Q., Li, Y., Xu, L., Feng, R., Wei, H., Yang, Q., Yu, B., Liu, X., Yu, J., Zhang, Y.: Irem: High-resolution magnetic resonance image reconstruction via implicit neural representation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 65–74. Springer (2021)
- [27] Ying, X., Guo, H., Ma, K., Wu, J., Weng, Z., Zheng, Y.: X2ct-gan: reconstructing ct from biplanar x-rays with generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10619–10628 (2019)
- [28] Zang, G., Idoughi, R., Li, R., Wonka, P., Heidrich, W.: Intratomo: Self-supervised learning-based tomography via sinogram synthesis and prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1960–1970 (2021)