Noisy Neonatal Chest Sound Separation for High-Quality Heart and Lung Sounds

E. GROOBY, , C. SITAULA, D. FATTAHI, R. SAMENI, , K. TAN, L. ZHOU, A. KING, A. RAMANATHAN, A. MALHOTRA, G.A. DUMONT, , AND F. MARZBANRAD E. Grooby acknowledges the support of the MIME-Monash Partners-CSIRO sponsored PhD research support program and Research Training Program (RTP). A. Malhotra research is supported by the Kathleen Tinsley Trust and a Cerebral Palsy Alliance Research Grant. The study is supported by Monash Institute of Medical Engineering (MIME).E. Grooby, C. Sitaula and F. Marzbanrad are with the Department of Electrical and Computer Systems Engineering, Monash University, Melbourne, VIC, Australia.D. Fattahi is with Shiraz University, Shiraz, Iran.R. Sameni is with the Department of Biomedical Informatics, Emory University, Atlanta, GA, USAK. Tan, L. Zhou, A. King, A. Ramanathan and A. Malhotra are with Monash Newborn, Monash Children’s Hospital and Department of Paediatrics, Monash University, Melbourne, Australia.G. Dumont is with the Department of Electrical and Computer Engineering, University British Columbia, Vancouver, BC, Canada and with the BC Children’s Hospital Research Institute, Vancouver, BC, Canada.email: ethan.grooby@monash.edu

Abstract

Stethoscope-recorded chest sounds provide the opportunity for remote cardio-respiratory health monitoring of neonates. However, reliable monitoring requires high-quality heart and lung sounds. This paper presents novel Non-negative Matrix Factorisation (NMF) and Non-negative Matrix Co-Factorisation (NMCF) methods for neonatal chest sound separation. To assess these methods and compare with existing single-source separation methods, an artificial mixture dataset was generated comprising of heart and lung and noise sounds. Signal-to-noise ratios were then calculated for these artificial mixtures. These methods were also tested on real-world noisy neonatal chest sounds and assessed based on vital sign estimation error and a signal quality score of 1-5 developed in our previous works. Additionally, the computational cost of all methods was assessed to determine the applicability for real-time processing. Overall, both the proposed NMF and NMCF methods outperform the next best existing method by 2.7 dB to 11.6 dB for the artificial dataset and 0.40 to 1.12 signal quality improvement for the real-world dataset. The median processing time for the sound separation of a 10 s recording was found to be 28.3 s for NMCF and 342 ms for NMF. Because of stable and robust performance, we believe that our proposed methods are useful to denoise neonatal heart and lung sound in a real-world environment. Codes for proposed and existing methods can be found at: https://github.com/egrooby-monash/Heart-and-Lung-Sound-Separation.

Index Terms:

Single-source separation, breath sound, heart sound, lung sound, neonatal monitoring, phonocardiogram (PCG), signal quality, sound separation, telehealth.

TABLE I: Existing Blind Single-Source Separation Methods.

Method (letter reference*)	Description
Adaptive Fourier Decomposition (c)	Based on energy distribution, adaptive Fourier decomposition is used to clean identified heart sound S1 and S2 peaks [1, 2, 3]. Top 5, 10, 15, 20 and 25 decomposition levels are assigned as heart sounds was tested.
Adaptive Line Enhancement (d)	Applies adaptive filtering on the original and time-delayed recording to extract semi-periodic components (heart sounds) [4]. Time delay values of 1, 10, 40, 100, 200 and 400 samples are tested.
Empirical Mode Decomposition (e)	Decomposes signal into a sum of oscillatory functions called IMF. For each MF, S1 and S2 peaks are identified and filtered to obtain heart and lung components [5]. Empirical mode decomposition, ensemble empirical mode decomposition and complete ensemble empirical mode decomposition with adaptive noise are tested.
Filtering (f)	4^th-order Butterworth bandpass filter with passband frequencies 50-250 Hz and 200-1000 Hz, was used to obtain heart and lung sounds, respectively [3, 6].
Interpolation (g)	Identified heart sound S1 and S2 peaks are removed in time-frequency domain either by 20-300 Hz bandstop filter or complete elimination of them sections [7, 2, 3]. Interpolation is then performed to recover the lung sounds in the removed segments [7].
Modulation Filtering (h)	Involves bandpass and bandstop filtering of temporal trajectories of short-term spectral components to obtain heart and lung sounds respectively [8]. Filter ranges tested were 1-20 Hz, 2-20 Hz, 3-20 Hz, 4-20 Hz, 5-20 Hz and 6-20 Hz.
NMF Clustering 1 and 2 (i and j)	Both methods blindly decompose the mixture into numerous sub-components and then cluster all components into either heart or lung based on spectral or temporal criteria [9, 10, 11].
Recursive Least Squares Adaptive Filtering (k)	Identified heart sound S1 and S2 peaks used to create reference heart sound [12, 2, 3]. The recursive least squares filter uses original recording and the reference heart sound to obtain clean heart and lung sounds [12].
Singular Spectrum Analysis (l)	Decomposes signal into principal components using singular value decomposition [13]. Principal components associated with top eigenvector pairs with the strongest frequency component less than 250 Hz are assigned as heart sounds, and the remaining are assigned as lung sound [13, 14].
Wavelet Transform based Stationary Non-Stationary Filter (m)	Wavelet thresholding is used to separate station sounds (lung sounds) from non-stationary sounds (heart sounds using an adaptive threshold with adjusting multiplicative factor x [15, 16]. Adjusting multiplicative factors of 2, 2.5, 2.7, 3, 3.5 and 4 were tested.
Wavelet Decomposition and Singular Spectrum Analysis (n)	Wavelet decomposition is performed to obtain 0-500 Hz signal. Singular spectrum analysis is then performed to obtain just heart sounds [14].
Non-Local Means	Identifies and averages similar heart and lung sound components [17]. Due to the high computational cost of the method on our dataset, we did not implement this method in this study.
Autoencoder	Encoder decomposes signal and temporal clustering is performed to identify heart and lung components. These components are placed into the decoder to obtain heart and lung sounds [18]. This method was not implemented as the autoencoder needs to be retrained successfully and properly with a large and diverse set of neonatal chest sound recordings instead of adult chest sounds.
*Letter reference is used in Section IV, Figure 3 and Table III as shorthand for method names.

I Introduction

Accurate and timely assessment for signs of serious health problems such as cardio-respiratory diseases is an essential requirement to provide care to newborns [19]. Recording chest sounds with a stethoscope is a common and simple method to obtain such information. In recent times, the availability of digital stethoscopes for neonates has attracted several studies [20, 21, 22]. Existing digital stethoscopes can connect with smartphones for mobile health and can enable remote healthcare through telehealth. However, a higher level of noise in neonatal intensive care in comparison to adult and paediatric wards has resulted in poor quality chest sound recordings and inaccurate assessment. For instance, estimation of heart rate and breathing rate from low-quality signals is error-prone [3, 6].

Noise interference in chest sounds can be broken up into four groups. Firstly, there are external noises, such as crying, stethoscope movement, talking and general background noise in neonatal intensive care [23]. Secondly, while heart and lung sounds are both diagnostically important, they act as noise sources for one another. Thirdly, other internal body sounds such as bowel sounds, gastric reflux, and air swallowing. Finally, neonates, particularly ones born earlier than 32 weeks, can experience respiratory-related conditions, requiring respiratory support such as high flow, ventilator or bubble Continuous Positive Airway Pressure (CPAP), which are major sources of interference. Overall, it is essential to reduce these noises, and separate heart and lung sounds before any assessment and diagnosis.

Denoising and sound separation methods to obtain high-quality heart and lung sounds can be broken up into multi-channel and single-channel methods. In multi-channel methods, a reference signal such as an electrocardiogram or secondary microphone placed either to capture external noise and/or secondary chest recording is used [24]. Overall, these methods require additional sensors, which are not always accessible and feasible to implement.

For single-source sound separation and denoising, current methods have proven only partially effective. Table I summarises existing single-source sound separation methods and Section IV compares them. Many of the methods utilise heart sound segmentation to obtain S1 and S2 peaks. As shown in our past works, heart sound segmentation accuracy drops significantly for low-quality recordings [3, 6]. Overall this means many of these methods are only effective in low-noise scenarios. For lung sound separation, this has proven difficult due to its large frequency band that overlaps with heart and noise sources.

Another limitation with past works is that they rely on adult-based parameters, which are not suitable for the separation of neonatal chest sounds. To address this, existing methods that rely on heart sound segmentation, utilise a modified version suitable for neonates, which was developed in our past work [3]. Additionally, instead of a single-value parameter being utilised, a set of relevant parameters suitable for neonatal chest sound separation are considered and tested, as highlighted in Table I. Finally, for singular spectrum analysis (l) specifically, an additional constraint of top eigenvalue pairs having the strongest frequency component less than 250 Hz are assigned as heart components is added. This constraint was added to address the issue of misclassification of lung components as heart, as they also contain semi-periodic components.

This paper presents a new approach, based on Non-negative Matrix Factorisation (NMF) and Non-negative Matrix Co-Factorisation (NMCF) to obtain high-quality heart and lung sounds. The model is trained with high-quality heart, lung and noise sound examples. In NMF this training occurs beforehand, whereas in NMCF this training occurs in parallel with separating sounds from the noisy recording into heart, lung and noise. The training set enables the utilisation of more detailed information about the frequency and temporal aspects of heart, lung and noise. Whereas for NMCF, the parallel training and separation of sounds from the noisy recordings enable the model to adapt more specifically to that recording. Overall, this method enables higher quality lung and heart sounds to be generated for analysis.

A preliminary version of the NMCF work has been reported, which utilised high-quality heart and lung sounds, with no reference noise sound examples [9]. Initial results on a real-world dataset showed it was superior to existing NMF methods [9].

Four key contributions are presented in this paper. First, existing single-source denoising and heart and lung sound separation methods originally developed for adults were adapted and implemented on newborn chest sound recordings. Second, new NMF and NMCF approaches focused on obtaining high-quality heart and lung sounds specifically for the newborn population are proposed. Third, we incorporate a noise component in the NMF and NMCF models, to separate the sounds into not only heart and lung sounds but also noise sounds. Finally, the methods are assessed using artificial and real-world noisy neonatal chest sounds with heart and lung signal quality, and heart and breathing rate accuracy.

The rest of this paper is organised as follows. Section II provides the background of proposed NMF and NMCF methods. Section III presents a detailed implementation of the methods and their evaluation. Results and discussion are provided in Sections IV and V, respectively. Section VI concludes the whole work.

II Non-Negative Matrix Factorisation and Non-Negative Matrix Co-Factorisation

NMF decomposes a given non-negative matrix $V\in\Re^{F\times T}_{+}$ into two non-negative matrices $W\in\Re^{F\times K}_{+}$ and $H\in\Re^{K\times T}_{+}$ (1), where $K<min(F,T))$ and E represents the reconstruction error between V and WH.

\begin{split}V&=WH+E\\ \end{split}

(1)

In denoising and sound separation, V represents the magnitude of the time-frequency representation of the recording mixture [9].

The basis matrix, W contains the basis column vectors $w_{1}$ to $w_{K}$ that represent the spectral pattern of different types of signals sources (e.g. heart, lung and noise) or their sub-components. The activation matrix H, contains the temporal activation row vectors $h_{1}$ to $h_{k}$ , that represents when the signal sources occur during a particular time frame. These sub-components can be combined such that the first set of components (1 to $b_{h}$ ), second set of components ( $b_{h}+1$ to $b_{h}+b_{l}$ ) and third set of components ( $b_{h}+b_{l}+1$ to $b_{h}+b_{l}+b_{n}$ ) represent heart ( $V_{h}=W_{h}H_{h}$ ), lung ( $V_{l}=W_{l}H_{l}$ ) and noise respectively ( $V_{n}=W_{n}H_{n}$ ) [9].

For supervised and semi-supervised NMF, the basis matrix W is optimised with reference heart, lung and noise sounds during the training phase. The basis matrix W is then fixed and H is optimised with the noisy mixture recording during the test phase. In NMCF, instead of having a training and test phase, the basis matrix W is optimised simultaneously with sound separation. This method enables more efficient sound separation as the mixture recording can also contribute to the training of W [9].

However, as obtaining pure heart or lung sounds is not feasible, we propose a modified version of NMF and NMCF (equations (2) and (3), respectively) as described in Algorithm 1, and Figure 1. In this version, datasets of high-quality heart, lung, and noise sounds are used in the cost function to enable the generalisation of $W_{h}$ , $W_{l}$ and $W_{n}$ respectively for the sound separation. Note, these reference datasets are generalisable and are not obtained from the same subject as the noisy mixture recording that is being denoised.

In both the proposed NMF and NMCF equations (2) and (3), the matrices W and H are optimised by minimising the cost function D (4). $D_{\beta}$ refers to $\beta$ -divergence cost function, with the most popular values being $\beta=0,1,2$ [9]. A sparsity penalty on the activation matrix H is calculated based on the L1-norm of H, and $\mu$ controls the importance of the sparsity constraint. The sparsity penalty enables more detailed decomposition both temporally and spectrally while ensuring only a small set of meaningful basis vectors are active at a single time frame [25].

For proposed NMF (2) we have introduced a cost function that utilises datasets of clean heart, lung and noise sounds that are obtained from different subjects than the noisy mixture recording. This differs from past work which either requires simultaneous reference recordings from the same subject or relies on blind decomposition [24, 9].

\begin{split}W_{h}&=\min\limits_{W_{h},H_{h}}(\sum_{ih=1}^{eh}D(V_{h}^{(ih)}|\hat{W_{h}}H_{h}^{(ih)}))\\ W_{l}&=\min\limits_{W_{l},H_{l}}(\sum_{il=1}^{el}D(V_{l}^{(il)}|\hat{W_{l}}H_{l}^{(il)}))\\ W_{n}&=\min\limits_{W_{n},H_{n}}(\sum_{in=1}^{en}D(V_{n}^{(in)}|\hat{W_{n}}H_{n}^{(in)}))\\ H_{m}&=\min\limits_{W_{un},H_{m}}(D(V_{m}|\hat{W}H_{m}))\\ Where:&\quad H_{m}=[H_{mh};H_{ml};H_{mn};H_{mun}],\\ &\hat{W}=[\hat{W}_{h},\hat{W}_{l},\hat{W}_{n},\hat{W}_{un}]\\ \end{split}

(2)

While for proposed NMCF (3), building on our past work, we have introduced a supervised noise component [9]. The weighting factors $\lambda_{h}$ , $\lambda_{l}$ and $\lambda_{n}$ represent the level of co-factorisation and are treated as hyperparameters. Additionally, an unsupervised component $W_{un}$ is added to deal with the large variety of noises that are not covered, therefore avoiding these components being assigned to the heart or lung components.

\begin{split}W,H_{m}&=\min\limits_{W,H_{m},H_{h},H_{l},H_{n}}(\lambda_{h}D(V_{m}|\hat{W_{h}}H_{mh})\\ &+\lambda_{l}D(V_{m}|\hat{W_{l}}H_{ml})+\lambda_{n}D(V_{m}|\hat{W_{n}}H_{mn})\\ &+D(V_{m}|\hat{W_{un}}H_{mun})\\ &+\frac{1}{eh}\sum_{ih=1}^{eh}D(V_{h}^{(ih)}|\hat{W_{h}}H_{h}^{(ih)})\\ &+\frac{1}{el}\sum_{il=1}^{el}D(V_{l}^{(il)}|\hat{W_{l}}H_{l}^{(il)})\\ &+\frac{1}{en}\sum_{in=1}^{en}D(V_{n}^{(in)}|\hat{W_{n}}H_{n}^{(in)}))\\ Where:&\quad H_{m}=[H_{mh};H_{ml};H_{mn};H_{mun}],\\ &\hat{W}=[\hat{W}_{h},\hat{W}_{l},\hat{W}_{n},\hat{W}_{un}]\\ \end{split}

(3)

\begin{split}Where:&\quad D(V|\hat{W}H)=D_{\beta}(V|\hat{W}H)+\mu||H||_{1},\\ Where:&\quad\hat{W}=[\frac{w_{1}}{||w_{1}||},\frac{w_{2}}{||w_{2}||},...,\frac{w_{K}}{||w_{K}||}]\end{split}

(4)

Based on the cost function in (4), the multiplicative update rule for W and H are shown in (5) and (6) respectively. Note that division and $\otimes$ refer to element-wise division and multiplication.

Refer to caption — Figure 1: Proposed NMCF Flowchart

\begin{split}W&\leftarrow\hat{W}\otimes\\ &\frac{(\Lambda^{\beta-2}\otimes V)H^{T}+\hat{W}\otimes(11^{T}(\hat{W}\otimes(\Lambda^{\beta-1}H^{T})))}{\Lambda^{\beta-1}H^{T}+\hat{W}\otimes(11^{T}(\hat{W}\otimes((\Lambda^{\beta-2}\otimes V)H^{T})))}\\ Where:&\quad\Lambda=WH,\quad 1=\text{length F column vector of ones}\\ W&\leftarrow W\otimes\frac{W_{num}(V,W,H)}{W_{dem}(V,W,H)}\end{split}

(5)

\begin{split}H&\leftarrow H\otimes\frac{\hat{W}^{T}(V\otimes\Lambda^{\beta-2})}{\hat{W}^{T}\Lambda^{\beta-1}+\mu},\Lambda=WH\\ H&\leftarrow H\otimes\frac{H_{num}(V,W,H)}{H_{dem}(V,W,H)}\end{split}

(6)

Algorithm 1 Proposed NMCF

V_{m},Phase=stft(audio_{m})

V_{h}^{(ih)}=stft(audio_{h}^{(ih)})

V_{l}^{(il)}=stft(audio_{l}^{(il)})

V_{n}^{(in)}=stft(audio_{n}^{(in)})

init\quad H_{mh},H_{ml},H_{mn},H_{mun},H_{h}^{ih},H_{l}^{il},H_{n}^{in}

init\quad\hat{W}_{h},\hat{W}_{l},\hat{W}_{n},\hat{W}_{un}

W=[\hat{W}_{h},\hat{W}_{l},\hat{W}_{n},\hat{W}_{un}]

H_{m}=[H_{mh};H_{ml};H_{mn};H_{mun}]

9: for i=1:maxiter do

10:

H_{m}\leftarrow H_{m}\otimes\frac{H_{num}(V_{m},W,H_{m})}{H_{dem}(V_{m},W,H_{m})}

11:

H_{h}^{ih}\leftarrow H_{h}^{ih}\otimes\frac{H_{num}(V_{h}^{ih},W_{h},H_{h}^{ih})}{H_{dem}(V_{h}^{ih},W_{h},H_{h}^{ih})}

12:

H_{l}^{il}\leftarrow H_{l}^{il}\otimes\frac{H_{num}(V_{l}^{il},W_{l},H_{l}^{il})}{H_{dem}(V_{l}^{il},W_{l},H_{l}^{il})}

13:

H_{n}^{in}\leftarrow H_{n}^{in}\otimes\frac{H_{num}(V_{n}^{in},W_{n},H_{n}^{in})}{H_{dem}(V_{n}^{in},W_{n},H_{n}^{in})}

14:

W_{h}\leftarrow W_{h}\otimes\frac{\lambda_{h}W_{num}(V_{m},W_{h},H_{mh})+\frac{1}{eh}\sum_{ih=1}^{eh}W_{num}(V^{(ih)},W_{h},H_{h}^{(ih)})}{\lambda_{h}W_{dem}(V_{m},W_{h},H_{mh})+\frac{1}{eh}\sum_{ih=1}^{eh}W_{num}(V^{(ih)},W_{h},H_{h}^{(ih)})}

15:

W_{h}=normalisation(W_{h})

16:

W_{l}\leftarrow W_{l}\otimes\frac{\lambda_{l}W_{num}(V_{m},W_{l},H_{ml})+\frac{1}{el}\sum_{il=1}^{el}W_{num}(V^{(il)},W_{l},H_{l}^{(il)})}{\lambda_{l}W_{dem}(V_{m},W_{l},H_{ml})+\frac{1}{el}\sum_{il=1}^{el}W_{num}(V^{(il)},W_{l},H_{l}^{(il)})}

17:

W_{l}=normalisation(W_{l})

18:

W_{n}\leftarrow W_{n}\otimes\frac{\lambda_{n}W_{num}(V_{m},W_{n},H_{mn})+\frac{1}{en}\sum_{in=1}^{en}W_{num}(V^{(in)},W_{n},H_{n}^{(in)})}{\lambda_{n}W_{dem}(V_{m},W_{n},H_{mn})+\frac{1}{en}\sum_{in=1}^{en}W_{num}(V^{(in)},W_{n},H_{n}^{(in)})}

19:

W_{n}=normalisation(W_{n})

20:

W_{un}\leftarrow W_{un}\otimes\frac{W_{num}(V_{m},W_{un},H_{mun})}{W_{dem}(V_{m},W_{un},H_{mun})}

21:

W_{un}=normalisation(W_{un})

22: end for

23:

mask_{h}=\frac{\hat{W}_{h}H_{mh}}{WH}

24:

mask_{l}=\frac{\hat{W}_{l}H_{ml}}{WH}

25:

V_{h}=np.multiply(V_{m},mask_{h})

26:

V_{l}=np.multiply(V_{m},mask_{l})

27:

audio_{heart}=istft(V_{h},Phase)

28:

audio_{lung}=istft(V_{l},Phase)

III Methods

III-A Data Acquisition and Preprocessing

The study was conducted at Monash Newborn, Monash Children’s Hospital. It was approved by the Monash Health Human Research Ethics Committee (HREA/18/MonH/471). Recordings were obtained from the right anterior chest of preterm and term newborns using a digital stethoscope [21, 22, 6]. A subset of these recordings had synchronous vital signs from electrocardiogram for reference heart and breathing rate which was used in Section III-D2. The breakdown of the recordings is shown in Table II.

TABLE II: Data Breakdown

Type	Details
Without Respiratory Support	318 60 s recordings
With Respiratory Support	79 60 s recordings
Without Respiratory Support with Synchronous Vital Signs	22 60 s recordings
With Respiratory Support with Synchronous Vital Signs	9 60 s recordings

III-B Reference Sounds

Reference heart, lung, crying, stethoscope movement, and respiratory support sounds were required for two purposes. Firstly, several existing methods and the proposed NMF and NMCF methods required reference high-quality sounds to enable sound separation either through training or comparison during the sound separation procedure itself. Secondly, these reference sounds were used to create artificial mixtures to enable evaluation of the sound separation methods as described in Section III-D1. Example reference sounds are shown in Figure 2.

Heart and lung sounds were obtained from the recordings of newborns without respiratory support. As pure heart and lung sounds are required to construct an artificial dataset, these recordings were 4^th-order Butterworth bandpass filtered with passband frequencies 50-250 Hz and 200-1000 Hz, to separate heart and lung sounds, respectively [3, 24].

The filtered recordings were then annotated by 3 clinicians and 4 electrical engineers familiar with biomedical auscultation for heart and lung signal quality on a 5-level scale. The score of 1 referred to only noisy and hardly detectable heartbeats/breathing periods, and 5 referred to clear heart/lung sounds with little to no noise. Mean annotated scores of 4 and above were then assessed visually and through audio, and only the recordings with strong heart/lung sounds and little to no noise were chosen as reference signals. In total, 17 signals (9 subjects) with 7 (3 subjects) having synchronous heart rate remained for reference heart sounds and 9 signals (7 subjects) with no synchronous breathing rate remained for reference lung sounds.

To obtain the cry sounds, the cry detection algorithm implemented in previous work was used [3]. Regions containing at least 10 s of crying were then determined and extracted. These segments were then 2^nd-order Butterworth high-pass filtered with cutoff 300 Hz to remove heart sounds. Finally, regions not containing crying such as inhale and other lung sounds were replaced with zeros. In total 42 signals (41 subjects) were obtained.

Two clinicians and 1 electrical engineer familiar with biomedical auscultation manually annotated recordings for presence and volume (low, medium or high) of stethoscope movement and respiratory support noise. Stethoscope movement noise can be typically characterised as a short ¡2 s burst of noise. These sections of stethoscope movement were isolated and classified as either disconnection or rubbing noise. Disconnection noise only has the stethoscope movement sound present and not heart nor lung sounds, as the stethoscope has disconnected from the newborn’s chest. Whereas for rubbing noise, the stethoscope is still in connection with the newborn’s chest and can potentially still contain heart and lung sounds. In total 18 signals (12 subjects) of varying lengths were obtained. To ensure all signals were of length 10 s, zeros were appended to the beginning and end of the signals. The ratio of the total number of zeros placed at the beginning or end of the signals was randomly determined, to ensure that stethoscope movement occurs at a random position in the 10 s signals.

Recordings annotated with a high level of respiratory support noise and next to no presence of other sounds were chosen. From this, 2 signals (1 subject) of bubble CPAP and 5 signals (1 subject) of ventilator CPAP were obtained. Given the difficultly of obtaining pure support sounds from chest sound recordings, pure respiratory support sounds were also collected by placing the digital stethoscope on the respiratory support machine or tubing. An additional 3 signals for both bubble and ventilator CPAP were obtained from the respiratory support machine itself.

III-C Implementation

For both proposed NMCF and NMF, the number of bases used is 20 for heart, lung and all noise sounds each. For real-world sounds, an unsupervised noise component is also added with the number of bases equaling 10. At max 10 examples of heart, lung and noise sounds were used and 100 iterations of multiplicative update of activation and basis matrix.

The following parameters for the proposed NMCF and NMF were tested:

•
Time-Frequency Representation:
- –
  
  Gammatone filterbank [26] with a frequency range of 50-2000 Hz and frequency bin size of either 128, 256, 512 or 1028.
- –
  
  Q-transform (log-frequency) with 64 frequency bins per octave
- –
  
  Short-Time Fourier Transform (STFT) with Fast Fourier Transform (FFT) Size = 2048 samples, Window Size = 512 samples and Hop Size = 256 samples similar to previous work [9], or, FFT Size = 1024 samples, Window Size = 512 samples and Hop Size = 256 samples.
•

Sparsity penalty ( $\mu$ ): 0, 0.01 and 0.1
•

Beta loss ( $\beta$ ): 0, 1 and 2
•

NMCF $\lambda_{h}$ , $\lambda_{l}$ and $\lambda_{n}$ : 0, 0.25, 0.5, 0.75 and 1

For the artificial dataset, subject-wise cross-validation was used to generate 7 folds of artificial mixtures. One fold was used for hyperparameter optimisation for the proposed NMCF and NMF methods and existing sound separation methods. The remaining six-folds were used for evaluation of the sound separation, with results shown in Figures 3(a) and 3(b).

Based on the hyperparameter optimisation results, STFT with FFT size=1024 samples, window size=512 samples and hop size=256 samples, sparsity penalty=0.1 and beta loss=1 were used for both proposed NMCF and NMF methods to separate the real-word database recordings. Additionally, $\lambda_{h}$ =0, $\lambda_{l}$ =0 and $\lambda_{n}$ =0.25 (Stethoscope Movement Noise, Ventilator CPAP Noise) and $\lambda_{n}$ =0.75 (Cry Noise, Bubble CPAP Noise) were used for the proposed NMCF method.

III-D Performance Evaluation

III-D1 Artificial Mixtures Evaluation

Using the reference sounds obtained in Section III-B, artificial mixtures are generated. The most common method in literature for generating chest sound mixtures is the simple addition of all reference sounds, referred to as an instantaneous mixture (7). However, as chest sounds are not simple instantaneous mixtures of heart, lung, and noise sounds, a convolutive mixture model was also adopted (8) [13]. To achieve convolutive mixing, three randomly generated finite-impulse response (FIR) filters of length 4 were generated to mix heart ( $a_{heart}$ ), lung ( $a_{lung}$ )and noise ( $a_{noise}$ ). FIR filter coefficients $a_{heart}$ , $a_{lung}$ , $a_{noise}$ are scaled to have magnitude of 1 .

s_{mixture}(t)=s_{heart}(t)+s_{lung}(t)+s_{noise}(t)\\

(7)

\begin{split}s_{mixture}(t)=\sum_{k=0}^{3}(&a_{heart}(k)s_{heart}(t-k)+\\ &a_{lung}(k)s_{lung}(t-k)+\\ &a_{noise}(k)s_{noise}(t-k))\\ \end{split}

(8)

Before mixing heart, lung and noise sounds, they were scaled to achieve the desired signal-to-noise ratio. Heart sounds were first scaled to achieve a heart-to-lung sounds ratio of -10, -5, 0, 5, 10, 15 and 20 dB (9). Once scaled, heart and lung sounds were combined and then scaled to achieve chest-to-noise sounds ratio of -10, -5, 5, 0 and 10 dB (10).

s_{chest}(t)=10^{factor/20}*s_{heart}(t)+s_{lung}(t)\\

(9)

s_{mixture}(t)=10^{factor/20}*s_{chest}(t)+s_{noise}(t)\\

(10)

Using the generated artificial mixtures and reference sounds used to create these mixtures, several signal quality metrics can be calculated using the blind source separation evaluation toolbox [27, 28]. With this toolbox, estimated heart, lung and noise sounds are decomposed into 4 components, namely; true reference sound ( $s_{target}$ ), interference noise ( $e_{inter}$ ), additive noise ( $e_{noise}$ ), algorithmic artifact noise ( $e_{artif}$ ). Once decomposed, signal-to-distortion ratio (SDR (11)), signal-to-interference ratio (SIR (12)) and scale-invariant signal-to-distortion ratio (SI-SDR (11)) are calculated. Both SDR and SI-SDR are overall metrics of signal quality. The difference between SDR and SI-SDR is that SDR uses a full 512-tap FIR filter whereas SI-SDR uses a single coefficient to account for allowable scaling discrepancies between estimated separated sounds and reference sounds [27, 29]. Therefore SI-SDR harshly penalises temporal distortions and is only suitable for the evaluation of instantaneous mixtures in comparison to SDR.

SDR=10log_{10}\frac{||s_{target}(t)||^{2}}{||e_{inter}(t)+e_{noise}(t)+e_{artif}(t)||^{2}}\\

(11)

SIR=10log_{10}\frac{||s_{target}(t)||^{2}}{||e_{interf}(t)||^{2}}\\

(12)

III-D2 Heart Rate and Breathing Rate Error

A goal of obtaining high-quality heart and lung sounds is to achieve accurate heart and breathing rate estimates. These vital sign estimates are essential in cardio-respiratory health assessment, to enable proper clinical care to be determined and provided [30, 31].

For the heart audio recordings, heart rate in beats per minute was estimated for each second with a sliding window of 3 s. Heart rate was calculated using the modified version of the method by Springer et al. [2] as proposed in our past work [6].

For lung audio recordings, breathing rate in breaths per minute was estimated every second with a sliding window of 6 s. For breathing rate estimation, power spectral envelope is calculated for the frequency range 300-450 Hz and then peak detection is performed [3].

III-D3 Signal Quality Assessment

An automated signal quality assessment method to classify real-world heart and lung signal quality on a 5-level scale was developed in our previous works [3, 6]. A score of 1 referred to only noisy and hardly detectable heartbeats/breathing periods, and 5 referred to clear heart/lung sounds with little to no noise.

To calculate the signal quality, up to 15 features for heart signal quality, and up to 20 features for lung signal quality, were extracted from chest sound recordings and used as input into the regression classifier. Features extracted included statistical features (variance, skewness, and kurtosis), predictive fitting coefficients, heart and lung segmentation quality and agreement, Mel-frequency coefficients, wavelet, entropy and power.

The dataset used for training was a subset of the no respiratory support recordings. In total the regression classifier was trained on 206 recordings from 97 subjects for lung sound quality estimation and 223 recordings from 92 subjects for heart sound quality estimation. Reference quality annotations were provided by 3 clinicians and 4 electrical engineers familiar with biomedical auscultation[3, 6]. Note that these recordings are from different subjects than the reference sounds used for training the sound separation methods and creation of the artificial dataset in Section III-B.

III-D4 Real-Time Analysis

The median time for chest sound separation for an example 10 s was calculated using MATLAB 2021a with MacBook Pro CPU 2.3 GHz 8-Core Intel i9.

Computation cost per 10 s recording is shown in Table III. The proposed NMCF method takes a median of 28.2 s and 28.3 s with and without supervised decomposition components respectively. These computational times make the proposed NMCF method not suitable for real-time processing. For the proposed NMF method, the median computational time of 275 ms and 342 ms with and without supervised decomposition components are observed. As the computational times are less than 400 ms, the proposed NMF method is suitable for real-time processing using the stated laptop specifications [9]. For existing methods, adaptive Fourier decomposition (c), adaptive line enhancement (d), filtering (f), interpolation (g) and modulation filtering (h) are suitable for real-time processing using the stated laptop specifications.

III-D5 Statistical Analysis

Statistical tests were performed to determine if the proposed NMF and NMCF methods are significantly outperforming existing methods. Using the Jarque-Bera test, artificial dataset signal quality improvement results were not normally distributed. Therefore, median values are reported and a one-sided Wilcoxon signed-ranked test was used to test significance in Section IV. Similarly, vital sign estimation error results were not normally distributed and thus one-sided Wilcoxon signed-ranked test was used to test significance. However, as median results were predominately zero, mean and standard deviation results were shown in Table III to be more informative. Whereas, signal quality improvement values for the real-world dataset were normally disturbed. Therefore, mean and standard deviation values are reported in Table III and a one-sided t-test was used to test significance in Section IV.

IV Results

Figure 3 and Table III show the artificial and real-world chest sound separation results respectively. Methods (a), (a un), (b), (b un) are proposed NMCF, NMCF with Unsupervised Components, NMF and NMF with Unsupervised components methods respectively. Methods (c) to (n) are existing methods, as specified in Table I.

Figures 3(a) and 3(b) show the artificial mixture sound separation results for heart and lung sounds, as detailed in Section III-D1. Overall for heart and lung sound separation, both the proposed NMCF (a) and NMF (b) methods significantly outperform existing methods in all situations, with median SDR improvement ranging from 2.7 dB for respiratory support noise to 11.6 dB for general noise compared to next best existing method.

For heart sound separation, in the no noise case, the proposed NMCF (a) and NMF (b) methods outperforms all existing methods except adaptive line enhancement (d), which has a median SDR improvement of 1.5 dB and 0.8 dB over proposed methods respectively. However, adaptive line enhancement (d) produces minor temporal distortions in the separated heart sound, resulting in significantly lower SI-SDR values of 18.1 dB and 17.9 dB for proposed NMCF (a) and NMF (b) respectively. For the general noise case, proposed NMCF (a) and NMF (b) methods outperform the next best existing method by 4.3 dB and 4.0 dB for cry noise and 1.7 dB and 2.1 dB for stethoscope movement noise, for median SDR improvement. Similarly for respiratory support noise case, proposed NMCF (a) and NMF (b) methods outperform the next best existing method by 1.8 dB and 1.4 dB for bubble CPAP noise and 1.8 dB and 1.0 dB for ventilator CPAP noise.

For lung sound separation, in the noise case, the proposed NMCF (a) and NMF (b) outperformed the next best existing method by 0.9 dB and 0.8 dB (not significant p-value=0.055), this increases to 3.1 dB and 1.4 dB when for hard to separate situations (lung SNR less than -10 dB). For the general noise case, proposed NMCF (a) and NMF (b) methods outperform the next best existing method by 8.4 dB and 6.9 dB for cry noise and 0.3 dB and 0.7 dB for stethoscope movement noise. For high stethoscope movement noise (SNR greater than 0 dB) median SDR improvement increases to 0.5 dB and 0.9 dB. For respiratory support noise case, proposed NMCF (a) and NMF (b) methods outperform next best existing method by 0.6 dB and 0.0 dB (not significant p-value=0.29) for bubble CPAP noise and 1.0 dB and 0.4 dB for ventilator CPAP noise. For high respiratory support noise (SNR greater than 0 dB), median SDR improvement increases to 1 dB and 0.1 dB (not significant p-value=0.25) for bubble CPAP noise and 1.2 dB and 0.9 dB for ventilator CPAP noise.

As both proposed NMCF (a) and NMF (b) methods do not produce temporal distortions, SI-SDR results are comparable with SDR. Both methods significantly outperform existing methods in all situations, with median SI-SDR improvement ranging from 5.2 dB for respiratory support noise to 9.9 dB for general noise compared to the next best existing method. For specifically heart sound separation or lung sound separation, again, both proposed methods significantly outperformed all existing methods in all situations for SI-SDR improvement.

For SIR results, both proposed methods significantly outperform existing methods in all situations, with median SIR improvement ranging from 3.8 dB for respiratory support noise to 8.2 dB for general noise compared to the next best existing method. For heart sound separation, empirical mode decomposition (e) significantly outperformed both proposed methods in the no noise situation and was comparable for the general noise case. For lung sound separation, the filtering (f) method was comparable to the proposed NMF method for no noise case, but was significantly inferior for hard to separate situations (lung SNR less than -10 dB). For all other scenarios, both proposed methods significantly outperformed existing methods for heart and lung sound separation.

For existing methods, the best parameters based on artificial dataset SDR values are as follows:

•

Adaptive Fourier Decomposition (c): Decomposition level of 5 (Cry Noise) and 10 (No Noise, Stethoscope Movement Noise, Respiratory Support Noise)
•

Adaptive Line Enhancement (d): Delay of 1 (No Noise, Respiratory Support Noise), 10 (Stethoscope Movement Noise) and 50 (Cry Noise)
•

Empirical Mode Decomposition (e): Decomposition using ensemble empirical mode decomposition (All Cases)
•

Interpolation (g): Remove entire segments containing heart sounds and interpolate (All Cases)
•

Modulation Filtering (h): bandstop of 3-20 Hz (All Cases) for heart sounds and bandpass of 4-20 Hz (No Noise, Ventilator CPAP and General Noise) and 6-20 Hz (Bubble CPAP Noise) for lung sounds
•

Wavelet Transform Based Stationary Non-Stationary Filter (m): Adaptive threshold of 3 (No Noise, Respiratory Support Noise, Cry Noise) and 3.5 (Stethoscope Movement Noise)

For proposed NMCF (a) and NMF (b) methods, the best parameters based on artificial dataset SDR values are as follows:

•

Time-Frequency Representation = STFT, FFT Size = 1024 samples, Window Size = 512 samples and Hop Size = 256 samples (NMCF All Cases, NMF No Noise, General Noise, Bubble CPAP Noise). Time-Frequency Representation = STFT, FFT Size = 2048 samples, Window Size = 2048 samples and Hop Size = 512 samples (NMF Ventilator CPAP Noise)
•

Beta loss = 1 (All Cases)
•

Sparsity penalty = 0.1 (NMCF All Cases, NMF Stethoscope Movement Noise and Respiratory Support Noise), 0 (NMF No Noise and Cry Noise)
•

NMCF $\lambda_{h}$ = 0.25 (No Noise), 0 (General Noise, Respiratory Support Noise)
•

NMCF $\lambda_{l}$ = 0 (All Cases)
•

NMCF $\lambda_{n}$ = 0.25 (Stethoscope Movement Noise, Ventilator CPAP Noise), 0.75 (Cry Noise, Bubble CPAP Noise)

Table III shows the real-world chest sound separation results, as detailed in Sections III-D2 and III-D3. Overall for heart and lung sound separation, all proposed methods significantly outperform existing methods in all situations. Mean signal quality improvement ranged from 0.40-0.63 for no respiratory support case and 0.74-1.12 for respiratory support case compared to the next best existing method.

For heart sound separation, all proposed methods outperform existing methods. Mean signal quality improvement compared to the next best existing method ranged between 0.10-0.17 for no respiratory support case and 0.22-0.38 for the respiratory support case. All signal quality improvement values were significant except the proposed NMCF (a) with the no respiratory support case (p-value=0.06).

For lung sound separation, all proposed methods significantly outperformed existing methods except filtering (f). Mean signal quality improvement compared to the filtering method (f) ranged from -0.04-0.14 for the no respiratory support case and 0.00-0.27 for the respiratory support case. For no respiratory case, only the proposed NMF method (b) was significantly better than the filtering method (f). Whereas for the respiratory case, both the proposed NMCF (a) and NMCF with unsupervised components (a un) significantly outperformed the filtering method (f).

As seen in Table III vital sign improvement results were very sporadic. In general, all methods had a median vital sign error improvement of 0, but were positively skewed, as seen in the mean values. Due to the large variation in results, many of the comparisons between methods were not significant. For no respiratory support, both NMCF methods, (a) and (a un), performed best for heart rate estimation (not significant for NMF clustering 2 (j) and filtering (f)). Whereas NMF without supervised components (b un) performed best for breathing rate estimation (not significant for filtering (f) and modulation filtering (h)). For the presence of respiratory support noise, the existing filtering method (f) performed significantly better than all proposed methods for heart rate estimation and wavelet transform based stationary non-stationary filter (m) performed significantly better than proposed NMCF methods, (a) and (a un).

TABLE III: Real Noisy Neonatal Chest Sound Results. Mean and standard deviation in brackets is shown for signal quality improvement (SQI) and heart rate (HR) and breathing rate (BR) error improvement as calculated in Sections III-D3 and III-D2, is shown for heart and lung for recordings with and without respiratory support. Median processing time per 10 s recording as calculated in Section III-D4 is also shown. Results in bold refer to best performing sound separation methods. Methods a, a un, b, b un (in blue) are proposed NMCF, NMCF with Unsupervised Components, NMF and NMF with Unsupervised components methods respectively. Methods c to n (in black) are existing methods, as specified in Table I.

Method	Heart				Lung				Processing Time
	No Respiratory Support		Respiratory Support		No Respiratory Support		Respiratory Support
	SQI*	HR**	SQI	HR	SQI	BR***	SQI	BR
a	0.85 (0.94)	3.52 (25.4)	1.28 (0.61)	3.32 (24.5)	0.58 (0.85)	-0.19 (10.3)	0.56 (0.69)	-0.01 (10.3)	28.2 s
a un	0.92 (1.05)	3.64 (25.4)	1.34 (0.60)	1.90 (26.9)	0.62 (0.76)	-0.65 (10.9)	0.65 (0.80)	0.44 (10.5)	28.3 s
b	0.90 (0.92)	1.39 (25.6)	1.19 (0.59)	1.03 (25.1)	0.76 (0.93)	0.99 (10.6)	0.42 (0.57)	0.55 (9.1)	275 ms
b un	0.87 (0.95)	0.72 (25.3)	1.26 (0.57)	-0.62 (27.2)	0.67 (0.88)	0.12 (10.8)	0.38 (0.62)	1.18 (10.8)	342 ms
c	0.48 (1.09)	-8.67 (44.9)	0.92 (0.84)	1.18 (32.5)	-0.01 (0.69)	-1.22 (11.7)	-0.11 (0.49)	1.67 (11.9)	1.30 ms
d	0.10 (0.83)	-0.17 (23.5)	0.40 (0.74)	2.94 (28.7)	-0.01 (0.33)	0.20 (4.0)	-0.03 (0.35)	0.06 (3.8)	353 ms
e	0.18 (0.90)	1.06 (25.6)	0.76 (0.87)	0.98 (28.2)	-0.03 (0.68)	-0.53 (7.5)	0.11 (0.54)	0.47 (7.4)	5.53 s
f	0.41 (0.90)	1.59 (23.3)	0.18 (0.62)	7.54 (27.4)	0.62 (0.90)	0.30 (5.5)	0.38 (0.64)	0.02 (6.4)	2.90 ms
g	0.24 (0.98)	0.57 (27.32)	0.73 (0.76)	5.21 (34.2)	-0.18 (0.66)	-1.17 (5.2)	-0.09 (0.53)	0.84 (19.2)	310 ms
h	0.12 (0.81)	-2.85 (25.7)	0.47 (0.65)	0.77 (33.7)	0.05 (0.65)	0.33 (8.5)	-0.06 (0.44)	-0.40 (9.3)	67.5 ms
i	0.55 (0.86)	-1.88 (30.7)	1.02 (0.57)	-2.49 (32.7)	0.06 (0.60)	0.21 (8.3)	0.00 (0.48)	0.38 (7.09)	876 ms
j	0.21 (0.74)	2.46 (23.6)	0.64 (0.67)	1.77 (28.7)	0.37 (0.61)	-1.02 (9.6)	0.15 (0.48)	-0.97 (9.5)	5.26 s
k	0.25 (1.00)	0.48 (24.3)	0.70 (0.71)	3.01 (32.9)	0.01 (0.55)	-0.07 (6.3)	0.12 (0.53)	0.37 (6.3)	6.51 s
l	0.75 (0.94)	-11.03 (42.2)	0.75 (0.78)	2.48 *41.0)	-0.20 (0.71)	-4.2 (14.7)	0.09 (0.50)	0.07 (8.5)	402 ms
m	0.20 (0.90)	-5.14 (25.2)	0.88 (0.66)	-5.11 (35.3)	-0.39 (0.62)	-0.92 (15.4)	-0.17 (0.60)	3.46 (15.1)	36.9 s
n	0.38 (1.01)	-6.92 (37.8)	0.96 (0.92)	4.13 (33.1)	NA	NA	NA	NA	531 ms
SQI= Signal Quality Improvement (Section III-D3), HR= Heart Rate Error Improvement Improvement (Section III-D2), **BR= Breathing Rate Error Improvement (Section III-D2)

V Discussion

Overall, both proposed NMCF and NMF methods performed well, producing comparable or superior results to existing methods. In the artificial dataset, NMCF had a median SDR improvement of 0.00, 1.71, -0.55, 1.17 and 0.92 dB for no noise, cry noise, stethoscope movement noise, bubble CPAP noise and ventilator CPAP noise, respectively in comparison to NMF. Whereas in the real-world dataset, NMF outperformed NMCF in the no respiratory support case, and NMCF outperformed NMF in the respiratory support case. As seen in the results, in the presence of hard-to-separate noise such as respiratory support noise, co-factorisation in the NMCF method aids in the better adaption for effective sound separation, as opposed to the NMF method. Whereas, in the general case of just heart-lung sound separation, the NMF method is sufficient for successful sound separation. As the proposed NMF method can be implemented in real-time, it would be recommended to utilise this method initially. If the signal quality is still poor after the implementation of NMF due to hard-to-separate noise, then the proposed NMCF method should be utilised afterwards. NMCF allows for better adaptation of the basis vectors, as opposed to fixed basis vectors in the NMF method, enabling better ability to segregate noise.

Vital sign error results were less conclusive and promising for the purposed methods. With regards to median vital sign error improvement being 0, in general, the vital sign estimation methods are designed to be robust to an extent to noise. In particular, the heart rate estimation method was designed specifically to deal with noisy heart sounds [2]. Instead, the sound separation and denoising methods mainly only improved results when there was a major vital sign estimation error in the raw recordings, explaining the large standard deviation and positive mean seen in the results. For sound separation in the presence of respiratory support noise, the proposed NMF and NMCF methods were inferior to existing methods. One possible explanation for this is because after separation, with respiratory support noise partially removed, there were cases where heart and lung sounds were not present, resulting in an underestimation in vital signs. Whereas, for other methods that do not explicitly remove the respiratory support sounds, these sounds were miscounted as heartbeats/breathing periods, avoiding the underestimation in vital signs. These conclusions are further supported by existing methods that poorly estimated vital signs in no respiratory support case showing improvement in the more difficult respiratory support case. In particular for lung sounds, all methods struggled to successfully obtain clean lung sounds for the respiratory support case, suggesting all breathing rate error improvements could be inaccurate. Future work is required to accurately determine if these removed respiratory support sounds are also unintentionally removing these heartbeats/breathing periods, or if there is a subset of recordings in which heart and lung sounds cannot be recovered.

As seen in Table III, the introduction of the unsupervised noise component led to an improvement in signal quality results and a decrease in signal quality for the NMF method. For the NMCF, the benefit of the unsupervised noise component can be explained by the fact that it prevents co-factorisation and learning of the basis vectors and temporal activation to be heavily affected by noise that was not present in the reference sound set. Instead, other noise sources are allocated to the unsupervised component as they are not related to any of the reference sounds. For NMF, the basis matrix has already been pre-computed and fixed for the sound separation phase. Hence, this method is not as affected by the presence of other sources of noise, especially if they have fairly distinct frequency components. Instead, relevant heart and lung sound components can be misplaced into the unsupervised components, due to the unsupervised nature of calculating these basis vectors, resulting in a decrease in performance. This is a key limitation of the introduction of unsupervised components, as whilst it can remove unknown noise sources, it can also remove some of the heart and lung sound components.

Whilst our proposed sound separation methods worked better or comparably to existing methods for respiratory support noise, there is still a large room for improvement. In both the artificial dataset and real-world recordings, respiratory support noise was still present in the separated sounds. This can be explained by the fact that respiratory support noise has a large frequency overlap with the desired heart and lung sounds. This large frequency overlap results in the frequency basis vectors being quite similar for the respiratory support noise and heart and lung sounds. Some possible ways to improve sound separation in the presence of respiratory support noise is to have additional reference sounds and add a temporal constrain or decomposition. Currently, only 5 and 8 reference sounds were provided for bubble and ventilator CPAP respectively, which may not provide the entire variety of different flow rates and sounds that can occur. With regards to temporal constraint or decomposition, respiratory support noise tends to be present for the entire recordings, whereas heart and lung sounds occur on and off. Therefore, there is a potential for reassignment of sub-components after NMF or NMCF based on temporal activation, similar to past works [18, 32]. Additionally, instead of reassignment post decomposition, there is the possibility of adding a temporal constraint or cost function, which would enable the decomposition of the components based on temporal activation. However, future work would be required to develop this temporal decomposition method.

With regards to computational cost, the proposed NMCF method is not suitable for real-time processing, whereas the proposed NMF method is. Given, for the most part, the proposed NMF method produces superior results to all existing methods and comparable results to the proposed NMCF method, this may be sufficient. There are a couple of ways to improve the computation speed of the proposed NMCF method. Firstly, as the optimal heart and lung sound co-factorisation weights were zero (i.e., no co-factorisation with mixture chest sound recording), a fusion of the NMF and NMCF could be employed. For these zero co-factorisation weights, instead of calculating the associated basis vectors during the separation of the mixture chest sound, instead, similar to NMF, the basis matrices of heart and lung sounds can be pre-computed and fixed. Accordingly, the computation time would be reduced by up to 36% potentially. Secondly, the pre-computed base matrices could be used as the initialisation of the basis matrices for the proposed NMCF method. This initialisation would place the basis matrices closer to the optimal result, meaning fewer iteration of the update algorithm would be required. Thirdly, there is still room for optimisation of the number of reference examples, the number of bases for each sound component, FFT size and hop size, all of which could reduce the size of the basis and activation matrices, overall reducing the computation time. Finally, the multiplicative update algorithm could be optimised, for instance using C code MEX functions, as opposed to MATLAB.

By and large, all these conclusions on computational time with regards to real-time processing is specific to the laptop parameters used in this particular study. It would be expected similar computational times for high-end laptops, general desktops, or a general device connected to cloud computing, however, computational times would be expected to increase if these sound separation methods were used on a smartphone with its onboard processing power.

An artificial dataset was proposed to quantitatively assess the effectiveness of sound separation with heart, lung and various noise sounds. However, there are several limitations with this dataset. Firstly, to obtain clean reference sounds, bandpass filtering of 50-250 Hz and 200-1000 Hz was applied to heart and lung sounds. Similarly, for cry noise, highpass filtering of 300 Hz was applied. Whilst these frequency bands contain the majority of the heart, lung and cry sounds, there can still be prominent sounds outside these frequency bands. Therefore, to more accurately replicate the larger frequency overlap between these sounds and assess the sound separation methods, pure and unfiltered reference sounds would be required. However, this is quite difficult to obtain with newborns. For example, clean reference heart sounds are typically acquired from adults by asking them to hold their breath, which is not feasible with neonates. However neonates, in particular preterm neonates, have sporadic breathing patterns of fast periodic breathing followed by a period of no breathing. Therefore, it may be possible to obtain short segments of clean heart sounds. Obtaining clean reference lung sounds is more difficult as heart sound noise will always be present. Future collection of chest sounds with digital stethoscope placement on the chest but further away from the heart or on the back may minimise heart sounds and enable clearer lung sounds. Another option to obtain clean reference sounds with a full frequency range is to artificially generate the sounds with the desired properties. Past works on modelling adult heart sounds have been done and could be transferable for neonates. Similarly, a model of neonatal lung sounds and various noise sources could be created.

The second limitation with the artificial dataset is how cry noise was mixed with lung sounds. As inspiration and expiration cannot occur simultaneously as a newborn is crying, this should be taken into account. While for stethoscope disconnect noise, this was resolved by zeroing out the section of the heart-lung sound mixture that it was going to be inserted, this is not appropriate for cry noise. For a more appropriate lung sound and cry noise mixture, the periods and order of inspiration and expiration should be taken into account and inserted reasonably. However, as lung sounds and breathing rate change while a baby is crying, it may not even be useful or necessary to try and accurately represent this mixture.

The third limitation with the artificial dataset is the relatively small evaluation dataset, in particular for respiratory support noise and heart and lung sounds. For respiratory support noise, future collections of pure respiration support noise obtained by placing the stethoscope on the breathing tube will be acquired. Whereas the limitation with heart and lung sounds is not the number of chest sound recordings available, but the difficulty in obtaining clean sounds. Again, one potential solution to this is to artificially create heart and lung sounds.

Furthermore, signal quality estimation of real chest sounds was proposed to quantitatively assess the effectiveness of sound separation on real-world data. This is an essential assessment as it overcomes many of the limitations with artificial datasets and assesses the sound separation methods on the actual data it would be used for. However, there are also several limitations with the proposed signal quality estimation method. Firstly, while the method attempts to objectively assess heart and lung sound quality, it is not 100% accurate [3, 6]. Therefore, individual comparison of methods with a single recording may not be appropriate due to misclassification errors with the signal quality estimation model. Whereas, overall trends on signal quality improvement and comparison of methods is more appropriate. Secondly, the method assesses the overall quality of the heart and lung sounds. This overall assessment does not directly take into account if significant portions of the heart or lung sound components are also being removed to obtain the ”noise-free” sounds. These limitations can also partially explain why the artificial dataset vs real-world dataset sound separation methods differs.

VI Conclusion

This paper has reviewed, adapted and tested a wide variety of existing methods for neonatal chest sound separation for clean heart and lung sounds. Additionally, two methods were proposed, namely, NMF and NMCF with reference datasets of heart, lung and noise sounds. The evaluation results show, our proposed methods outperformed existing methods in both the artificial and real-world datasets with regards to SDR, SIR and SI-SDR, and signal quality, respectively. However, further work is still required for sound separation of respiratory support noise, as can be seen in the inferior results compared to existing methods for vital sign estimation. Possible directions are obtaining more clean samples either in the real-world setting or artificially and looking into deep learning methods.

Acknowledgment

E. Grooby thanks and acknowledges the support from the Monash Newborn team at Monash Children’s Hospital, Australia for data collection and Jinyuan He for the analysis of the newborn data in previous works.

References

[1] Z. Wang, J. N. da Cruz, and F. Wan, “Adaptive Fourier decomposition approach for lung-heart sound separation,” in 2015 IEEE International Conference on Computational Intelligence and Virtual Environments for Measurement Systems and Applications (CIVEMSA). IEEE, 2015, pp. 1–5.
[2] D. B. Springer, L. Tarassenko, and G. D. Clifford, “Logistic regression-HSMM-based heart sound segmentation,” IEEE Transactions on Biomedical Engineering, vol. 63, no. 4, pp. 822–832, 2015.
[3] E. Grooby, J. He, J. Kiewsky, D. Fattahi, L. Zhou, A. King, A. Ramanathan, A. Malhotra, G. A. Dumont, and F. Marzbanrad, “Neonatal Heart and Lung Sound Quality Assessment for Robust Heart and Breathing Rate Estimation for Telehealth Applications,” IEEE Journal of Biomedical and Health Informatics, vol. 25, no. 12, pp. 4255–4266, 2021.
[4] T. Tsalaile and S. Sanei, “Separation of heart sound signal from lung sound signal by adaptive line enhancement,” in 2007 15th European Signal Processing Conference. IEEE, 2007, pp. 1231–1235.
[5] A. Mondal, P. Bhattacharya, and G. Saha, “Reduction of heart sound interference from lung sound signals using empirical mode decomposition technique,” Journal of medical engineering & technology, vol. 35, no. 6-7, pp. 344–353, 2011.
[6] E. Grooby, C. Sitaula, D. Fattahi, R. Sameni, K. Tan, L. Zhou, A. King, A. Ramanathan, A. Malhotra, G. A. Dumont et al., “Real-Time Multi-Level Neonatal Heart and Lung Sound Quality Assessment for Telehealth Applications,” arXiv preprint arXiv:2109.15127, 2021.
[7] M. Pourazad, Z. Moussavi, and G. Thomas, “Heart sound cancellation from lung sound recordings using time-frequency filtering,” Medical and biological engineering and computing, vol. 44, no. 3, pp. 216–225, 2006.
[8] T. H. Falk and W.-Y. Chan, “Modulation filtering for heart and lung sound separation from breath sound recordings,” in 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE, 2008, pp. 1859–1862.
[9] E. Grooby, J. He, D. Fattahi, L. Zhou, A. King, A. Ramanathan, A. Malhotra, G. A. Dumont, and F. Marzbanrad, “A New Non-Negative Matrix Co-Factorisation Approach for Noisy Neonatal Chest Sound Separation,” in 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). IEEE, 2021, pp. 5668–5673.
[10] G. Shah, P. Koch, and C. B. Papadias, “On the blind recovery of cardiac and respiratory sounds,” IEEE Journal of Biomedical and Health Informatics, vol. 19, no. 1, pp. 151–157, 2014.
[11] F. Canadas-Quesada, N. Ruiz-Reyes, J. Carabias-Orti, P. Vera-Candeas, and J. Fuertes-Garcia, “A non-negative matrix factorization approach based on spectro-temporal clustering to extract heart sounds,” Applied Acoustics, vol. 125, pp. 7–19, 2017.
[12] J. Gnitecki, Z. Moussavi, and H. Pasterkamp, “Recursive least squares adaptive noise cancellation filtering for heart sound reduction in lung sounds recordings,” in Proceedings of the 25th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (IEEE Cat. No. 03CH37439), vol. 3. IEEE, 2003, pp. 2416–2419.
[13] F. Ghaderi, H. R. Mohseni, and S. Sanei, “Localizing heart sounds in respiratory signals using singular spectrum analysis,” IEEE Transactions on Biomedical Engineering, vol. 58, no. 12, pp. 3360–3367, 2011.
[14] A. Mondal, I. Saxena, H. Tang, and P. Banerjee, “A noise reduction technique based on nonlinear kernel function for heart sound analysis,” IEEE journal of biomedical and health informatics, vol. 22, no. 3, pp. 775–784, 2017.
[15] L. J. Hadjileontiadis and S. M. Panas, “A wavelet-based reduction of heart sound noise from lung sounds,” International Journal of Medical Informatics, vol. 52, no. 1-3, pp. 183–190, 1998.
[16] I. Hossain and Z. Moussavi, “An overview of heart-noise reduction of lung sound using wavelet transform based filter,” in Proceedings of the 25th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (IEEE Cat. No. 03CH37439), vol. 1. IEEE, 2003, pp. 458–461.
[17] A. Rudnitskii, “Using nonlocal means to separate cardiac and respiration sounds,” Acoustical Physics, vol. 60, no. 6, pp. 719–726, 2014.
[18] K.-H. Tsai, W.-C. Wang, C.-H. Cheng, C.-Y. Tsai, J.-K. Wang, T.-H. Lin, S.-H. Fang, L.-C. Chen, and Y. Tsao, “Blind monaural source separation on heart and lung sounds based on periodic-coded deep autoencoder,” IEEE Journal of Biomedical and Health Informatics, vol. 24, no. 11, pp. 3203–3214, 2020.
[19] “Newborns: reducing mortality,” https://www.who.int/news-room/fact-sheets/detail/newborns-reducing-mortality, (Accessed on 05/10/2019).
[20] A. Ramanathan, L. Zhou, F. Marzbanrad, R. Roseby, K. Tan, A. Kevat, and A. Malhotra, “Digital stethoscopes in paediatric medicine,” Acta Paediatrica, vol. 108, no. 5, pp. 814–822, 2019.
[21] L. Zhou, F. Marzbanrad, A. Ramanathan, D. Fattahi, P. Pharande, and A. Malhotra, “Acoustic analysis of neonatal breath sounds using digital stethoscope technology,” Pediatric Pulmonology, vol. 55, no. 3, pp. 624–630, 2020.
[22] A. Ramanathan, F. Marzbanrad, K. Tan, F.-T. Zohra, M. Acchiardi, R. Roseby, A. Kevat, and A. Malhotra, “Assessment of breath sounds at birth using digital stethoscope technology,” European Journal of Pediatrics, pp. 1–9, 2020.
[23] A. Lahav, “Questionable sound exposure outside of the womb: frequency analysis of environmental noise in the neonatal intensive care unit,” Acta Paediatrica, vol. 104, no. 1, pp. e14–e19, 2015.
[24] R. Nersisson and M. M. Noel, “Heart sound and lung sound separation algorithms: a review,” Journal of Medical Engineering & Technology, vol. 41, no. 1, pp. 13–21, 2017.
[25] J. Le Roux, F. J. Weninger, and J. R. Hershey, “Sparse NMF–half-baked or well done?” Mitsubishi Electric Research Labs (MERL), Cambridge, MA, USA, Tech. Rep., no. TR2015-023, vol. 11, pp. 13–15, 2015.
[26] B. Gao, W. L. Woo, and S. S. Dlay, “Unsupervised single-channel separation of nonstationary signals using gammatone filterbank and itakura–saito nonnegative matrix two-dimensional factorizations,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 60, no. 3, pp. 662–675, 2012.
[27] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1462–1469, 2006.
[28] C. Févotte, R. Gribonval, and E. Vincent, “BSS_EVAL toolbox user guide–Revision 2.0,” 2005.
[29] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-baked or well done?” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 626–630.
[30] A. King, D. Blank, R. Bhatia, F. Marzbanrad, and A. Malhotra, “Tools to assess lung aeration in neonates with respiratory distress syndrome,” Acta Paediatrica, vol. 109, no. 4, pp. 667–678, 2020.
[31] D. Jain and E. Bancalari, “Neonatal monitoring during delivery room emergencies,” in Seminars in Fetal and Neonatal Medicine, vol. 24, no. 6. Elsevier, 2019, p. 101040.
[32] T.-H. Lin, S.-H. Fang, and Y. Tsao, “Improving biodiversity assessment via unsupervised separation of biological sounds from long-duration recordings,” Scientific reports, vol. 7, no. 1, pp. 1–10, 2017.