This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\jyear

2022

Reformulating van Rijsbergen’s FβF_{\beta} metric for weighted binary cross-entropy

\fnmSatesh Ramdhani satesh.ramdhani@gmail.com
Abstract

The separation of performance metrics from gradient based loss functions may not always give optimal results and may miss vital aggregate information. This paper investigates incorporating a performance metric alongside differentiable loss functions to inform training outcomes. The goal is to guide model performance and interpretation by assuming statistical distributions on this performance metric for dynamic weighting. The focus is on van Rijsbergen’s Fβ\text{F}_{\beta} metric – a popular choice for gauging classification performance. Through distributional assumptions on the Fβ\text{F}_{\beta}, an intermediary link can be established to the standard binary cross-entropy via dynamic penalty weights. First, the Fβ\text{F}_{\beta} metric is reformulated to facilitate assuming statistical distributions with accompanying proofs for the cumulative density function. These probabilities are used within a knee curve algorithm to find an optimal β\beta or βopt\beta_{opt}. This βopt\beta_{opt} is used as a weight or penalty in the proposed weighted binary cross-entropy. Experimentation on publicly available data along with benchmark analysis mostly yields better and interpretable results as compared to the baseline for both imbalanced and balanced classes. For example, for the IMDB text data with known labeling errors, a 14% boost in F1F_{1} score is shown. The results also reveal commonalities between the penalty model families derived in this paper and the suitability of recall-centric or precision-centric parameters used in the optimization. The flexibility of this methodology can enhance interpretation.

keywords:
Performance metrics, Metrics, F-Beta Metric, Penalty Optimization, C.J. van Rijsbergen, Information Retrieval, Weighted Cross-Entropy, Binary Cross-Entropy, Text Retrieval

1 Acronym List

FβF_{\beta}

F-Beta Metric

βopt\beta_{opt}

Optimal β\beta from Algorithm 1

M1βM_{1}^{\beta}

Model 1: U & IU from (5.1)

M2(λ,σ2)M_{2}^{(\lambda,\sigma^{2})}

Model 2: Ga & IE from (5.2)

PV

Pressure Vessel Design

TsT_{s}

Thickness of the pressure vessel shell

ThT_{h}

Thickness of the pressure vessel head

UST

Underground Storage Tank

fCf_{C}

Equation for the volume a cylindrical UST (12)

fCHf_{CH}

Equation for the volume of a cylindrical UST with hemispherical endcaps (13)

fEDf_{ED}

Equation for the volume of an ellipsoidal UST (14)

fEDHf_{EDH}

Equation for the volume of an ellipsoidal UST with hemi-ellipsoidal end-caps (15)

CvE

Cylindrical UST versus Ellipsoidal UST

CHvEH

Cylindrical UST with hemispherical endcaps versus ellipsoidal UST with hemi-ellipsoidal end-caps

UCI

UCI Machine Learning Repository

2 Introduction

Data imbalance is a known, and widespread real world issue that affects performance metrics for a variety of learning algorithm problems (i.e., image detection and segmentation, text categorization and classification). Approaches to mitigate this issue generally fall into three categories: adjusting the neural network architecture (including multiple models or ensembles like Fujino et al. 2008), adjusting the loss function used for training, or adjusting the data (i.e., collecting more data, or leveraging sampling techniques like Chawla et al. 2002 and Hasanin et al. 2019). This research looks at adjusting the loss function with a focus on incorporating the Fβ\text{F}_{\beta} performance metric. The interconnection between performance metric and loss function is crucial for understanding both model behavior and the inherent nature of that specific dataset. This connection has already been approached from the angle of thresholding (a post model step) as in Lipton et al. (2014) or developing a problem specific metric, as Ho and Wookey (2019), Li et al. (2019), and Oksuz et al. (2018) did for real world mislabeling costs, dynamic weighting for easy negative samples, and object detection, respectively. This paper takes a uniquely different and novel approach where statistical distributions act as an intermediary to connect the Fβ\text{F}_{\beta} metric to the binary cross-entropy through dynamic penalty weights.

First, the derivation of the FβF_{\beta} metric from van Rijsbergen’s effectiveness score, EE, is revisited to prove a limiting case of F1F_{1} in section 4. This result supports the default case for the main algorithm in section 6.

Second, the FβF_{\beta} metric is reformulated into a multiplicative form by assuming two independent random variables. Then parametric statistical distributions are assumed for these random variables. In particular, the Uniform and Inverse Uniform (U & IU) case and the Gaussian and Inverse Exponential (Ga & IE) case are proposed. The idea behind U & IU is that no known insight is assumed on the FβF_{\beta} cumulative density function’s (CDF) surface. But the Ga & IE provides the practitioner more flexibility in setting some insight to this CDF surface. This leads to a more interpretable performance metric that is configurable to the data without having to create a new problem specific metric (or loss function).

Third, for both distributional cases, the CDF or Pr(Fβ)\text{Pr}\left(F_{\beta}\right) shown in section 5 facilitates finding an optimal β\beta through a knee curve algorithm in section 6.1. This algorithm gets the best β\beta from a monotonic knee curve given precision and recall. It is the value when the curve levels off. The βopt\beta_{opt} surface for different parameter settings found in section 6.3 suggests a slightly more recall centric penalty. This is discussed further in section 7.

Finally, a weighted binary cross-entropy loss function based on βopt\beta_{opt} is proposed in section 6.2. This loss methodology is applied to three data categories: image, text and tabular/structured data. For contextual data (i.e., image and text), model performance for F1F_{1} improves, and the best result occurs for the text data that contains (known) labeling errors. The structured/tabular or non-contextual data does not show significant F1F_{1} improvement, but provides an important result: when considering neural embedding architectures for training, the type (or category) of data matters.

3 Related Work

Logistic regression models are one of the most fundamental statistically based classifier. Jansche (2005) provides a training procedure that uses a sigmoid approximation to maximize the FβF_{\beta} on this class of classifiers. When comparing the surface plots of the likelihood from Jansche and that from section 5 – a similar but not an equivalent comparison – a comparable rate of change can be seen for both surfaces with respect to their respective parameters. This is an important similarity because this paper’s procedure applies distributional assumptions to provide dynamic penalties to a well-known binary cross-entropy loss. Also, implementation of this paper’s methodology is straightforward because it avoids the need to provide updated partial derivatives for the loss function. Furthermore, Jansche alludes to (future work that considers) a general method to optimize several operating points simultaneously, which is a fundamental and indirect assertion in this paper. The sigmoid approximation is also used by Fujino et al. (2008) in the multi-label setting for text categorization. In their framework, multiple binary classifiers are trained per category and combined with weights estimated to maximize micro- or macro-averaged F1F_{1} scores.

Similarly, Aurelio et al. (2022) propose a methodology for performance metric learning that uses a metric approximation (i.e., AUC, F1F_{1}) derived from the confusion matrix. The back-propagation error term involves the first derivative, followed by the application of gradient descent. This method provides an alternative means of integrating performance metrics with gradient-based learning. However, there are cases where the back-propagation term proposed by Aurelio et al. may pose issues. For instance, when considering equation 13 from Aurelio et al. in conjunction with batch training and severe imbalance, there could be a division by zero error if a batch with only the zero label appears. Moreover, Aurelio et al. test several metrics: F1F_{1}, G-mean, AG-mean and AUC for their method. But the G-mean, AG-mean and AUC, based on the confusion matrix approximation, can be derived as functions of F1F_{1}. This suggests that FβF_{\beta} is more flexible than G-mean, AG-mean and AUC. In other words, β\beta is unique to FβF_{\beta} yet generalized across other metrics when equal to 1. In fact, for class imbalance, the AUC metric - an average over many thresholds, and G-mean - a geometric mean, is less stringent and more generous in accuracy reporting compared to the F1F_{1}. This is the reason all results in this paper are reported using the F1F_{1} score.

Surrogate loss functions attempt to mimic certain aspects of the FβF_{\beta} and is another related area. For example, sigmoidF1 from Bénédict et al. (2021) creates smooth versions for the entries of the confusion matrix, which is used to create a differentiable loss function that imitates the F1F_{1}. This smooth differentiability is another application of a sigmoid approximation similar to Jansche. Lee et al. (2021) formulates a surrogate loss by adjusting the cross-entropy loss such that its gradient matches the gradient of a smooth version of the FβF_{\beta}.

In terms of metric creation or variation to the FβF_{\beta}, Ho and Wookey (2019), Li et al. (2019), Oksuz et al. (2020) and Yan et al. (2022) are highlighted. The Real World Weight Cross Entropy (RWWCE) loss function from Ho and Wookey is a metric similar in spirit to Oksuz et al. The idea is to set (not train or tune) cost related weights based on the dataset and the main problem, by introducing costs (i.e., financial costs) that reflect the real world. RWWCE affects both the positive and negative labels by tying each to its own real world cost implication. The dice loss from Li et al. propose a dynamic weight adjustment to address the dominating effect of easy-negative examples. The formulation is based on the FβF_{\beta} using a smoothing parameter and a focal adaptation from Lin et al. (2017). A ranking loss based on the Localisation Recall Precision (LRP) metric Oksuz et al. (2018) is developed by Oksuz et al. (2020) for object detection. They propose an averaged LRP alongside a ranking loss function for not only classification but also localisation of objects in images. This provides a balance between both positive and negative samples. Along a similar theme, Yan et al. (2022) explores a discriminative loss function that aims to maximize the expected FβF_{\beta} directly for speech mispronunciation. Their loss function is based on the FβF_{\beta} (comparing human assessors and the model prediction) weighted by a probability distribution (i.e., normal distribution) for that score. The final objective function is a weighted average between their loss function and the ordinary cross-entropy.

When considering the components of performance metrics, precision and recall are often the primary focus. Mohit et al. (2012) and Tian et al. (2022) propose two different loss functions that are both recall oriented. Mohit et al. adjust the hinge loss by adding a recall based cost (and penalty) into the objective function. As they said, by favoring recall over precision it results in a substantial boost to recall and F1F_{1}. By leveraging the concept of inverse frequency weighting (i.e., a sampling based technique), Tian et al. adjust the cross-entropy to reflect an inverse weighting on false negatives per class. They state that their loss function sits between regular and inverse frequency weighted cross-entropy by balancing the excessive false positives introduced by constantly up-weighting minority classes. When they consider a similar loss function using precision, this loss function shows irregular behavior. These findings are insightful because this paper’s βopt\beta_{opt} surface as seen in section 6.3 is more recall centric with the added benefit of being able to incorporate precision weighting through the assumed probability surface.

4 Background

The FβF_{\beta} measure comes directly from van Rijsbergen’s effectiveness score, EE, for information retrieval (chapter 7 in Rijsbergen 1979). For the theory on the six conditions supporting EE as a measure, refer to Rijsbergen. This paper highlights two of these conditions. First, EE guides the practitioner’s ability to quantify effectiveness given any point (rr, pp) – where rr and pp are recall and precision – as compared to some other point. Second, precision and recall contribute effects independently of EE. As said by Rijsbergen, for a constant rr (or pp) the difference in EE from any set of varying points of pp (or rr) can not be removed by changing the constant. These conditions suggest equivalence relations and imply a common effectiveness (CE) curve based on precision and recall (definition 3 in Rijsbergen 1979). They also motivate the rationale on using statistical distributions to understand the CE curve. The van Rijsbergen’s effectiveness measure is given in (1).

E=11α1p+(1α)1rE=1-\dfrac{1}{\alpha\frac{1}{p}+(1-\alpha)\frac{1}{r}} (1)

where, α=1β2+1\alpha=\frac{1}{\beta^{2}+1}. Sasaki (2007) gives the details on deriving Fβ=(β2+1)prβ2p+rF_{\beta}=\frac{(\beta^{2}+1)pr}{\beta^{2}p+r} from (1) with β=rp\beta=\frac{r}{p} and by solving Er\frac{\partial E}{\partial r}=Ep\frac{\partial E}{\partial p}. The β\beta parameter is intended to allow the practitioner control by giving β\beta times more importance to recall than precision. Using the derivation steps from Sasaki, a general form of FβF_{\beta} for any derivative can be shown as (2),

Fβn(p,r)=(β2n2+1)prβ2n2p+r,F_{\beta}^{n}(p,r)=\frac{(\beta^{\frac{-2}{n-2}}+1)pr}{\beta^{\frac{-2}{n-2}}p+r}, (2)

where nn pertains to nErn\frac{\partial^{n}E}{\partial r^{n}}=nEpn\frac{\partial^{n}E}{\partial p^{n}} resulting in αn=1β2n2+1\alpha_{n}=\frac{1}{\beta^{\frac{-2}{n-2}}+1}. Note that n>0n>0, and n2n\neq 2. The proof is found in Appendix A. For n=2n=2 the equation reduces to the equality p=rp=r implying β=1\beta=1. Using (2), it can be seen that the limnFβn=F11=2prp+r\displaystyle{\lim_{n\to\infty}F_{\beta}^{n}=F_{1}^{1}=\frac{2pr}{p+r}}, which is most commonly used in the literature. The reason for showing this limiting case is to provide a justification on fixing β=1\beta=1 (instead of claiming equal importance for rr and pp) in the default case of any algorithm – in particular the algorithm in section 6.

5 Reformulating the F-Beta to leverage statistical distributions

CE for neural networks is seen when different network weights give different precision and recall yet resulting in similar performance scores. CE also provides a basis for this paper’s use of β\beta from the FβF_{\beta} measure to guide training through penalties, in lieu of an explicit loss (or surrogate loss) function. In fact, Vashishtha et al. (2022) uses the FF-score as part of a preprocessing step for feature selection prior to their ensemble model (EM-PCA then ELM) for fault diagnosis. They show significant performance improvement in their approach which adds supporting evidence to this paper’s use of the FβF_{\beta} as a loss penalty for feature selection via gradient based learning.

The first step is to reformulate (2) for n=1n=1. This makes assuming statistical distributions easier. Consider the following reformulation through multiplicative decomposition in (3) which assumes X1X_{1} and X2X_{2} to be independent random variables.

Fβ=X1X2,F_{\beta}=X_{1}X_{2}, (3)

where X1=r+βX_{1}=r^{\prime}+\beta^{\prime}, X2=(β′′+r)1X_{2}=(\beta^{\prime\prime}+r)^{-1} with r=prr^{\prime}=pr, β=β2pr\beta^{\prime}=\beta^{2}pr and β′′=β2p\beta^{\prime\prime}=\beta^{2}p. X1X_{1} indirectly captures imbalance in the model prediction from the underlying data. If precision and recall are on opposite ends of the [0,1][0,1] scale, then X1X_{1} will reflect this, while maintaining continuity when precision and recall are directionally consistent. X2X_{2} can be thought of as a weighting scheme that appears recall centric with a precision based penalty. For instance, for both high (or both low) precision and recall, the weighting is consistent with intuition. However, when precision and recall are on opposite ends of the [0,1][0,1] scale, the weighting sways by the aggregate with the lower score. Two use cases are considered for (3): X1X_{1} and X2X_{2} follow U & IU, respectively, and X1X_{1} and X2X_{2} follow Ga & IE, respectively.

5.1 Case 1: Uniform and Inverse Uniform

The thought behind U & IU is to apply (flat) equal distribution for both X1X_{1} and X2X_{2}. These assumed distributions are applied to β\beta^{\prime} and β′′\beta^{\prime\prime} as follows:

Let βU(0,β)\displaystyle\beta^{\prime}\sim U(0,\beta^{*})
β′′U(0,β)\displaystyle\beta^{\prime\prime}\sim U(0,\beta^{*})
then X1U(r,r+β)\displaystyle X_{1}\sim U\left(r^{\prime},r^{\prime}+\beta^{*}\right) (4)
and X2IU(1r+β,1r),\displaystyle X_{2}\sim IU\left(\frac{1}{r+\beta^{*}},\frac{1}{r}\right), (5)

where r and r[0,1]r\text{ and }r^{\prime}\in[0,1] and β>0\beta^{*}>0. Note that for both distributions there is only one β\beta^{*} chosen and this value replaces the need to have an explicit form that includes β\beta as a parameter. This is for convenience, as well as noticing that both β\beta^{\prime} and β′′\beta^{\prime\prime} differ by a factor of rr. So allowing β\beta^{*} to vary broadly (which is the βmax\beta_{max} in section 6) would be enough to balance this convenience tradeoff. Next is to derive the joint distribution which would be used in section 6. It can be shown (the proof is in Appendix B) that the joint distribution is:

Pr(Fβ)=Pr(Fβz)=1{zp&(r+β)zr+β1}×z2(r+βrz)2(β)2\displaystyle\text{Pr}\left(F_{\beta}\right)=\text{Pr}\left(F_{\beta}\leq z\right)=1_{\{z\leq p\&\frac{(r+\beta^{*})z}{r^{\prime}+\beta^{*}}\leq 1\}}\times\dfrac{\frac{z}{2}\left(r+\beta^{*}-\frac{r^{\prime}}{z}\right)^{2}}{(\beta^{*})^{2}}
+1{z>p&(r+β)zr+β>1}×(rzrβ+1(β)2((r+β+r+β2z)(r+βrz)\displaystyle+1_{\{z>p\&\frac{(r+\beta^{*})z}{r^{\prime}+\beta^{*}}>1\}}\times\left(\dfrac{rz-r^{\prime}}{\beta^{*}}+\dfrac{1}{\left(\beta^{*}\right)^{2}}\left(\left(r+\beta^{*}+\frac{r^{\prime}+\beta^{*}}{2z}\right)\left(r^{\prime}+\beta^{*}-rz\right)\right.\right.
+r(rz(r+β))2))\displaystyle+\left.\left.\dfrac{r\left(rz-\left(r^{\prime}+\beta^{*}\right)\right)}{2}\right)\right)
+1{z>p&(r+β)zr+β1}×[(rzrβ+1(β)2((r+β+r+β2z)(r+βrz)\displaystyle+1_{\{z>p\&\frac{(r+\beta^{*})z}{r^{\prime}+\beta^{*}}\leq 1\}}\times\left[\left(\dfrac{rz-r^{\prime}}{\beta^{*}}+\dfrac{1}{\left(\beta^{*}\right)^{2}}\left(\left(r+\beta^{*}+\frac{r^{\prime}+\beta^{*}}{2z}\right)\left(r^{\prime}+\beta^{*}-rz\right)\right.\right.\right.
+r(rz(r+β))2))1(β)2((r+β)(r+β)(r+β)22z(r+β)2z2)]\displaystyle+\left.\left.\left.\dfrac{r\left(rz-\left(r^{\prime}+\beta^{*}\right)\right)}{2}\right)\right)-\dfrac{1}{\left(\beta^{*}\right)^{2}}\left((r+\beta^{*})(r^{\prime}+\beta^{*})-\frac{(r^{\prime}+\beta^{*})^{2}}{2z}-\frac{(r+\beta^{*})^{2}z}{2}\right)\right]
Refer to caption
(a) P(Fβ<0.4)P(F_{\beta}<0.4)
Refer to caption
(b) P(Fβ<0.8)P(F_{\beta}<0.8)
Figure 1: Probability Mass Surface: U & IU for precision versus recall, and β[8,16]\beta^{*}\in[8,16]. The cumulative probability is computed for low (0.40.4) and high (0.80.8) values.

To understand this flat mixture, consider Figure 1 - the CDF surface for a grid of precision and recall where β[8,16]\beta^{*}\in[8,16]. (Note: the blue and red heat coloring is from the CDF and highlights curvature and/or rate of change). For a lower z value of 0.40.4, Figure 1(a) shows that β=8\beta^{*}=8 has a faster rate of change as compared to β=16\beta^{*}=16. The same conclusion is apparent in Figure 1(b), which is for a higher z value of 0.80.8. For both figures more curvature is seen for lower β\beta^{*} values. This suggests that a larger β\beta^{*} value smooths the surface and is a better candidate for βmax\beta_{max} in the algorithm in Section 6.

5.2 Case 2: Gaussian and Inverse Exponential

A more informed distributional approach for X1X_{1} and X2X_{2} considers Ga & IE, respectively. The reason to use Gaussian distribution for X1X_{1} is to allow a bell-shaped variability around a fixed rr^{\prime} that is based on β\beta^{\prime} and ultimately β\beta. The weighting of X1X_{1} by X2X_{2} uses the Inverse Exponential distribution because with selections of the rate parameter λ\lambda the distribution can shift mass from left to right as well as appear uniformly distributed around rr. This provides practitioners enough flexibility on experimenting with different weights. The following shows the assumptions for β\beta^{\prime} and β′′\beta^{\prime\prime}:

Let βGa(0,σ2)\displaystyle\beta^{\prime}\sim\text{Ga}(0,\sigma^{2})
β′′Exponential(λ)\displaystyle\beta^{\prime\prime}\sim\text{Exponential}(\lambda)
then X1Ga(r,σ2)\displaystyle X_{1}\sim\text{Ga}(r^{\prime},\sigma^{2}) (7)
and X2IE(λ;r),\displaystyle X_{2}\sim\text{IE}(\lambda\mathchar 24635\relax\;r), (8)

where rr in (8) is the location shift by recall from the definition of X2X_{2} in (3) and σ2\sigma^{2} is the variability captured by β\beta^{\prime}. Using both (7) and (8), the distribution for (3) is now split around z=0z=0 as follows:

Pr(Fβ)\displaystyle\text{Pr}\left(F_{\beta}\right) =Pr(Fβz)=1z>0×[Φ(rz;r,σ2)+exp(λr+(λσ2z)22rλσ2z2σ2)\displaystyle=\text{Pr}\left(F_{\beta}\leq z\right)=1_{z>0}\times\left[\Phi(rz\mathchar 24635\relax\;r^{\prime},\sigma^{2})+\exp\left(\lambda r+\dfrac{\left(\frac{\lambda\sigma^{2}}{z}\right)^{2}-\frac{2r^{\prime}\lambda\sigma^{2}}{z}}{2\sigma^{2}}\right)\right.
×(1Φ(rz;(rλσ2z),σ2))]+1z=0×Φ(0;r,σ2)\displaystyle\left.\times\left(1-\Phi\left(rz\mathchar 24635\relax\;\left(r^{\prime}-\frac{\lambda\sigma^{2}}{z}\right),\sigma^{2}\right)\right)\right]+1_{z=0}\times\Phi(0\mathchar 24635\relax\;r^{\prime},\sigma^{2})
+1z<0×[Φ(rz;r,σ2)exp(λr+(λσ2z)22rλσ2z2σ2)\displaystyle+1_{z<0}\times\left[\Phi(rz\mathchar 24635\relax\;r^{\prime},\sigma^{2})-\exp\left(\lambda r+\dfrac{\left(\frac{\lambda\sigma^{2}}{z}\right)^{2}-\frac{2r^{\prime}\lambda\sigma^{2}}{z}}{2\sigma^{2}}\right)\right.
×Φ(rz;(rλσ2z),σ2)],\displaystyle\left.\times\Phi\left(rz\mathchar 24635\relax\;\left(r^{\prime}-\frac{\lambda\sigma^{2}}{z}\right),\sigma^{2}\right)\right], (9)

where Φ(x;μ,σ2)\Phi(x\mathchar 24635\relax\;\mu,\sigma^{2}) denotes the standard normal or gaussian distribution at the value xx for a mean, μ\mu and variance, σ2\sigma^{2}. (Refer to Appendix C for the proof). Similar to before the focus is on the indicator 1z01_{z\geq 0} as defined in (9).

Refer to caption
(a) P(Fβ<0.4)P(F_{\beta}<0.4) and λ=0.5\lambda=0.5
Refer to caption
(b) P(Fβ<0.8)P(F_{\beta}<0.8) and λ=0.5\lambda=0.5
Refer to caption
(c) P(Fβ<0.4)P(F_{\beta}<0.4) and λ=2.0\lambda=2.0
Refer to caption
(d) P(Fβ<0.8)P(F_{\beta}<0.8) and λ=2.0\lambda=2.0
Figure 2: Probability Mass Surface: Ga & IE for precision versus recall, β=16\beta=16, λ[0.5,2.0]\lambda\in[0.5,2.0], and σ2[0.5,2.0]\sigma^{2}\in[0.5,2.0]. The cumulative probability is computed for low (0.40.4) and high (0.80.8) values.

Since this distributional mixture has more flexibility due to more parameters, Figure 2 highlights this when λ[0.5,2.0]\lambda\in[0.5,2.0], and σ2[0.5,2.0]\sigma^{2}\in[0.5,2.0]. The probabilities are computed again at a lower z value, 0.40.4, and at a higher z value, 0.80.8 for comparison. For a fixed λ\lambda, varying σ2\sigma^{2} impacts the curvature of the surface with higher σ2\sigma^{2} values producing a flattening effect. Figure 2(a) shows this distinctly. Conversely, as λ\lambda increases with a fixed σ2\sigma^{2}, the rate at which the surface changes is very apparent. This can be seen by juxtaposing Figure 2(c) and 2(a) or Figure 2(d) and 2(b) and noticing that the increase of λ\lambda produces a clear increase in the rate of change. These observations match the intuition that σ2\sigma^{2} is linked to the shape of the bell curve, and λ\lambda is linked to a rate of change. It also serves as a basis of intuition behind the algorithm in Section 6. That is, a faster rate of change along with a curved (and/or smoother) surface would provide loss penalties that adapt quickly per batch using the aggregated information from precision and recall.

6 Knee algorithm and Weighted Cross Entropy

6.1 Knee algorithm to find optimal β\beta values

Now that probabilities, or Pr(Fβz)\text{Pr}(F_{\beta}\leq z) for some z[0,1]z\in[0,1] are established in section 5.1 and 5.2, the goal is to use them to get an optimal β\beta value, βopt\beta_{opt}. There are a couple things to consider. First, because β\beta is grouped into β\beta^{\prime} and β′′\beta^{\prime\prime} with distributional assumptions, using maximum likelihood estimation (MLE) is not particularly suitable here. Also, βmax\beta_{max}, σ2\sigma^{2}, and λ\lambda from (4), (5), and (8) are set in advance and do not need to be estimated. Second, our observed data is only one data point per training batch, namely precision and recall. Given this and the natural bend of the FβF_{\beta} function, a knee algorithm is applicable. From Satopaa et al. (2011), the knee of a curve is associated with good operator points in a system right before the performance levels off. This removes the need for complex system-specific analysis. Furthermore, they have provided a definition of curvature that supports their method being application independent – an important property for this paper. Algorithm 1 implements (and slightly alters) Kneedles algorithm from Satopaa et al. to detect the knee in the FβF_{\beta} curve. Refer to Algorithm 1 for the formal pseudo code.

Algorithm 1 Calculate βopt\beta_{opt}
1:n>0βmax>0n>0\wedge\beta_{max}>0
2:βopt>0\beta_{opt}>0
3:Compute pp, and rr from the training batch.
4:Initialize bsn,bd,blmx,psn,pd,plmx,psb_{sn},b_{d},b_{lmx},p_{sn},p_{d},p_{lmx},p_{s} to empty arrays.
5:bs[bs1,,bsn]b_{s}\Leftarrow[b_{s_{1}},...,b_{s_{n}}] where bsn=βmaxb_{s_{n}}=\beta_{max}
6:for (i=0i=0 to nndo
7:     zFβ=bsin=1(p,r)z\Leftarrow F_{\beta=b_{s_{i}}}^{n=1}(p,r) using eqn (2)
8:     psip_{s_{i}}\Leftarrow Pr(Fβz|β=bsi,p=p,r=r)\text{Pr}(F_{\beta}\leq z|\beta=b_{s_{i}},p=p,r=r) using section 5.1 or 5.2
9:end for
10:if r<pr<p then
11:     pmax=max(ps)p_{max}=\max(p_{s})
12:     for (i=0i=0 to nndo
13:         psip_{s_{i}}\Leftarrow pmaxpsip_{max}-p_{s_{i}}
14:     end for
15:end if
16:bmax=max(bs)b_{max}=\max(b_{s}), pmax=max(ps)p_{max}=\max(p_{s}), bmin=min(bs)b_{min}=\min(b_{s}), pmin=min(ps)p_{min}=\min(p_{s})
17:for (i=0i=0 to nndo
18:     bsnibsibminbmaxbminb_{sn_{i}}\Leftarrow\dfrac{b_{s_{i}}-b_{min}}{b_{max}-b_{min}}
19:     psnipsipminpmaxpminp_{sn_{i}}\Leftarrow\dfrac{p_{s_{i}}-p_{min}}{p_{max}-p_{min}}
20:     bdibsnib_{d_{i}}\Leftarrow b_{sn_{i}}
21:     pdipsnibsnip_{d_{i}}\Leftarrow p_{sn_{i}}-b_{sn_{i}}
22:     if (i1)(i<n)(i\geq 1)\wedge(i<n) then
23:         if (pdi1<pdi)(pdi+1<pdi)(p_{d_{i-1}}<p_{d_{i}})\wedge(p_{d_{i+1}}<p_{d_{i}}) then
24:              plmxipdip_{lmx_{i}}\Leftarrow p_{d_{i}}
25:              blmxibdib_{lmx_{i}}\Leftarrow b_{d_{i}}
26:         end if
27:     end if
28:end for
29:if plmxp_{lmx} is a non-empty array then
30:     βopt= mean(plmx)\beta_{opt}=\text{ mean}(p_{lmx})
31:else
32:     βopt=1\beta_{opt}=1 as per section 4
33:end if

A brief explanation in plain words is as follows:

  1. 1.

    For any training batch, compute precision (p) and recall (r). Then with a predefined βmax\beta_{max} value, set nn equally spaced values, bsib_{s_{i}}, up to βmax\beta_{max}, and use section 5 to compute psi=Pr(Fβz|β=bsi,p=p,r=r)p_{s_{i}}=\text{Pr}(F_{\beta}\leq z|\beta=b_{s_{i}},p=p,r=r). (This replaces step 1 from Satopaa et al.). Let DsD_{s} represent this smooth curve as Ds={(bsi,psi)2|bsi,psi0}D_{s}=\{(b_{s_{i}},p_{s_{i}})\in\mathbb{R}^{2}|b_{s_{i}},p_{s_{i}}\geq 0\} for i=1,ni=1,...n.

  2. 2.

    When r<pr<p, convert to a knee by taking the difference of the probabilities from the maximum. That is, psi=max(ps)psip_{s_{i}}=\max(p_{s})-p_{s_{i}} for i=1,ni=1,...n. This is necessary because of the formulation of the FβF_{\beta} metric.

  3. 3.

    Normalize the points to a unit square and call these bsnb_{sn} and psnp_{sn}.

  4. 4.

    Take the difference of points and label that bdb_{d} and pdp_{d}.

  5. 5.

    Find the candidate knee points by getting all local maxima’s, label that blmxb_{lmx} and plmxp_{lmx}.

  6. 6.

    Take the average of plmxp_{lmx} and this will be βopt\beta_{opt}. (This simplifies Satopaa et al.).

6.2 Proposal Weighted Binary Cross-Entropy

The weighted binary cross-entropy loss is primarily focused around the imbalanced use case where a minority class exists. This paper posits that from the shuffling of data observations, as is frequently done while training, relevant aggregate information is available to use from the batch. For instance, say for a fixed minority class observation, yy, it is grouped among different batches of the majority class. The interaction effect of yy among these randomly varying training batches is often overlooked. It is this interaction that can be inferred through the precision and recall aggregates, then transferred as a penalty to the loss function via βopt\beta_{opt} in a probabilistic way. By using Algorithm 1 to get βopt\beta_{opt}, the proposed loss is,

L(f(x; θ)|βopt2,x)\displaystyle L(\textbf{f(x; $\theta$)}|\beta^{2}_{opt},\textbf{x}) =i{yilog(fi(x;θ))+(1yi)×log(1fi(x;θ))\displaystyle=-\sum_{i}\left\{y_{i}\log\left(f_{i}(\textbf{x}\mathchar 24635\relax\;\theta)\right)+\left(1-y_{i}\right)\times\log\left(1-f_{i}(\textbf{x}\mathchar 24635\relax\;\theta)\right)\right.
×(1{[1fi(x;θ)]0.5}1+βopt2+(1+βopt2)×1{[1fi(x;θ)]>0.5})},\displaystyle\left.\times\left(\frac{1_{\{[1-f_{i}(\textbf{x}\mathchar 24635\relax\;\theta)]\leq 0.5}\}}{1+\beta^{2}_{opt}}+(1+\beta^{2}_{opt})\times 1_{\{[1-f_{i}(\textbf{x}\mathchar 24635\relax\;\theta)]>0.5\}}\right)\right\}, (10)

where the function fi(x;θ)f_{i}(\textbf{x}\mathchar 24635\relax\;\theta) is the ii-th element from the prediction of a neural network using the inputs x and training weights, θ\theta; and yiy_{i} is the ii-th element of the true target label. When considering the majority class, or yi=0y_{i}=0 for i={1,,m}i=\{1,...,m\}, the loss is weighted by (1+βopt2)(1+\beta^{2}_{opt}). Therefore, for correctly predicted observations, the loss has a reduction by (1+βopt2)(1+\beta^{2}_{opt}). When incorrectly predicted, the loss is magnified by the same amount. For the minority class, or yi=1y_{i}=1 for i={1,,nm}i=\{1,...,n-m\}, the loss is unchanged. This is intentional because under imbalanced data there are far less observations, and computing precision and recall lead to numerical instability or frequent edge cases for Algorithm 1.

6.3 Understanding the βopt\beta_{opt} Surface and Weighted Cross-Entropy

Refer to caption
(a) U & IU for βmax[8,16]\beta_{max}\in[8,16]
Refer to caption
(b) Ga & IE for βmax=16\beta_{max}=16, λ=0.5\lambda=0.5
Refer to caption
(c) Ga & IE for βmax=16\beta_{max}=16, λ=2.0\lambda=2.0
Figure 3: βopt\beta_{opt} Surface: (a) U & IU where βmax[8,16]\beta_{max}\in[8,16], (b) and (c) Ga & IE for fixed βmax=16\beta_{max}=16, λ[0.5,2.0]\lambda\in[0.5,2.0] respectively. Note for Algorithm 1 n=300n=300 equally spaced points.

Figure 3 highlights the surface generated from Algorithm 1 leveraging probabilities from Section 5. First, the U & IU mixture or Figure 3(a) suggests that the shape of the surface remains relatively similar even when doubling βmax\beta_{max}. This is an important point toward fixing βmax=16\beta_{max}=16 for the Ga & IE. Based on Figure 3(a), the U & IU mixture penalizes more on the outskirts of recall, while the immediate penalties arise on the diagonal of the unit square. This suggests that precision and recall estimates from training cause immediate penalties when they are on opposite ends of the [0,1][0,1] range as well as on the diagonal when these values start to even out. For Ga & IE mixture, Figures 3(b) and 3(c) show some similar conclusions as U & IU mixture along with additional insights. For a lower rate of λ=0.5\lambda=0.5, or Figure 3(b), diagonal spikes for σ2[0.5,2.0]\sigma^{2}\in[0.5,2.0] as well as a precision centric penalty for higher σ2\sigma^{2} (i.e., σ2=2\sigma^{2}=2) are seen. For a higher rate of λ=2.0\lambda=2.0, or Figure 3(c), a similar diagonal is retained for σ2[0.5,2.0]\sigma^{2}\in[0.5,2.0] as in Figure 3(b). Furthermore, for increasing values of σ2\sigma^{2}, the penalty evolves from precision centric to a vertical separation on the unit grid at around a precision of 0.40.4. The overall interpretation is the following: for a lower λ\lambda, increasing σ2\sigma^{2} creates a slightly more precision based penalty; while for a higher λ\lambda, increasing σ2\sigma^{2} causes the penalty to become more balanced between recall and precision. The choice of these parameters are problem specific, but provides the practitioner flexibility in determining the best selection for their use case. On a separate note, from Figure 3, a spiky surface is obvious, which is partially explained by having a default setting in the algorithm. This is a strong sign of immediate and configurable penalties.

7 Datasets and Experimentation

7.1 Datasets

The origins of the FβF_{\beta} metric come from text retrieval, so it is important to verify this method across different categories of data. In particular, image data from CIFAR-10111https://www.cs.toronto.edu/ kriz/cifar.html, text data from IMDB movie sentiments Maas et al. (2011) and structured/tabular data from the Census Income Dataset Dua and Graff (2019) are tested. For each experiment, the primary label (i.e., label 1) is either imbalanced or forced to be imbalanced to reflect real world scenarios. Because CIFAR-10 contains multiple image labels, the airplane label is the primary label and all others are combined. This yields a 10% class imbalance. IMDB movie sentiment reviews (positive/negative text) are not imbalanced. The positive sentiments in the training data are reduced to 1K randomly sampled sentiments yielding a 7.4% imbalance (Table 1). The Census Income Tabular data contains 14 input features (i.e., age, work class, education, occupation, etc) with 5 numerical and 9 categorical features. The binary labels are greater than 50K salary (label 1) and less than 50K salary (label 0). By default, greater than 50K salary is already imbalanced at 6.2%. The training and validation dataset sizes for each data category are as follows: for CIFAR-10, 50K training and 10K validation, for IMDB, 13.5K training and 25K validation, and for the Census data, 200K training and 100K validation. In terms of class imbalance, this paper considers an imbalance or proportion of label 1 under 10% to be significantly imbalanced and between 10% to 25% to be moderately imbalanced. Some heuristic rationale for a 10% imbalance, a model that has perfect recall or 100%, a precision of p=13p=\frac{1}{3} would be required to get F1=0.5F_{1}=0.5. In practical examples, this scenario can occur with weakly discriminative features. Therefore, this paper seeks to test this algorithm in scenarios that would need an improved precision. A variety of imbalanced and balanced scenarios will be tested in this paper.

Two real-life use cases related to cylindrical tanks are also considered, providing a physical domain to test Algorithm 1. Chauhan et al. (2022) developed an arithmetic optimizer with slime mould algorithm and Chauhan et al. (2023) developed an evolutionary based algorithm with slime mould algorithm; both algorithms focus on global parameter optimization. They tested these algorithms on several benchmark problems, one of which is called the pressure vessel design. The problem is a constrained parameter optimization (i.e., material thickness, and cylinder dimensions) for minimizing a cost function. This paper focuses on using the HAOASMA algorithm by Chauhan et al. in a simulation to convert the problem into a binary classification. The second use case is derived from Underground Storage Tanks (UST) and is also inspired by Chauhan et al.’s pressure vessel problem. The physical shape (i.e., the cylindrical shape) of USTs is similar to the pressure vessel design. USTs are used to store petroleum products, chemicals, and other hazardous materials underground. These structures could deform underground and possibly explain a false positive leak. Ramdhani (2016) and Ramdhani et al. (2018) explored parameter optimization of UST dimensions changing from cylindrical to ellipsoidal. The observed data are vertical (underground) height measurements, which can contain uniformly distributed error. Ramdhani et al. used these measurements and the volumetric equations (12), (13), (14), and (15) - derived from a cross-sectional view - to develop a methodology to estimate tank dimensions and test if the shape has deformed. The cross-sectional view can be seen in Figure 5.

The conversion of both of these real-life use cases into a classification involves establishing a baseline set of parameters to simulate data for label 0. Varying these parameters will allow simulation of data for label 1. For the pressure vessel design, the baseline parameters from HAOASMA are Ts=1.8048T_{s}=1.8048, Th=0.0939T_{h}=0.0939, R=13.8360R=13.8360, and L=123.2019L=123.2019. To convert this to a classification, the parameters for thickness TsT_{s} and ThT_{h} are changed from the baseline while RR and LL dimensions are drawn from a normal distribution. By using values for TsT_{s}, ThT_{h}, RR and LL, the cost function is computed using 11. These cost values concatenated with RR and LL arrays serve as the input to a neural network classifier. The label 1 reflect simulated data using TsT_{s} and ThT_{h} that are changed from the baseline. These variations are Ts=1.7887T_{s}=1.7887 and Th{0.0313,0.2817}T_{h}\in\{0.0313,0.2817\}. Label 0 is HAOASMA baseline values Ts=1.8048T_{s}=1.8048 and Th=0.0939T_{h}=0.0939. Appendix D provides the equations, the distributional plots seen in Figure 4, and a detailed explanation of the simulation procedure (Algorithm 2). For the UST problem, Ramdhani used a measurement error model with an error on the height measurement and another on the volume computation. The same model is used to simulate data in this paper. The baseline is a cylinder and the variations to vertical and horizontal axes aa and bb represent a deformed cylinder to an ellipse. By using rr, LL, hh, aa and bb along with (12) and (14) or (13) and (15) the volume is computed. These volumes concatenated with noisy height measurements are the inputs to a neural network classifier. The label 1 reflect simulated data using the variations to the cylinder or the baseline. These variations are a{3.2,3.8}a\in\{3.2,3.8\} and b{5.0,4.2105}b\in\{5.0,4.2105\}. Label 0 will be the baseline cylinder with radius r=4r=4 and the length of L=32L=32. Refer to Appendix E for detailed explanation of the simulation Algorithm 3 along with comparison plots and volume equations.

7.2 Model Networks

7.2.1 Image Network

For the CIFAR-10 image dataset, ResNet (He et al. 2016) version 1 is applied. The number of layers for ResNet is 20, which upon initial experimentation is adequate for speed and generalization in this case. Adam optimizer was implemented with a learning rate of 1e31e^{-3} with total epochs of 30. No learning rate schedule because of the intentional lower number of epoch in order to validate faster training via this proposed loss algorithm. The training batch size is 32. Modest data augmentation is done – random horizontal and vertical shifts of 10%, and horizontal and vertical flips.

7.2.2 Text Network

For the IMDB movie sentiments, a Transformer block (which applies self-attention Vaswani et al. 2017) is used. The token embedding size is 32, and the transformer has 2 attention heads and a hidden layer size of 32 including dropout rates of 10%. A pooling layer and the two layers that follow – a dense RELU activated layer of size 20 and a dense sigmoid layer of size 1 – give the final output probability. As for preprocessing, a vocabulary size of 20K and maximum sequence length of 200 is implemented. The training batch size is 32.

7.2.3 Structured/Tabular Network

For the Census Income Dataset, a standard encoder embedding paradigm Schmidhuber (2015) is used. Specifically, all categorical features with an embedding size of 64 are concatenated, then numerical features are concatenated to this embedding vector. Afterwards, a 25% dropout layer and the two layers that follow – a fully connected dense layer with GELU activation of size 64 and a sigmoid activated layer of size 1 – provide the final output probability. The training batch size is 256.

7.2.4 UST/Vessel Network

For the real life use cases on simulated data the model network is simple because of the minimal amount of features. The network is a sequential set of dense layers of sizes 20, 10 and 1. The last layer of size 1 has a sigmoid activation to give the final output probability. Additionally, a drop out of 10% is added after both middle layers. The training batch size is 128.

7.3 Experimental Results

The results in Table 1 compare the use of the loss function (10) by different models based on U & IU and Ga & IE to a baseline case of ordinary cross-entropy. All results shown in this table are computed on the validation datasets for each data category above (see above for the dataset sizes). For ease of presentation, M1βM_{1}^{\beta} is Model 1: U & IU from (5.1). M2(λ,σ2)M_{2}^{(\lambda,\sigma^{2})} is Model 2: Ga & IE from (5.2). The superscripts β\beta and (λ,σ2)(\lambda,\sigma^{2}) are the parameters being explored. MBM_{B} is the baseline or the same model network that is trained using ordinary cross-entropy.

7.3.1 Image Results

For the image network, Table 1 shows modest improvement over the baseline under the M1βM_{1}^{\beta} for a moderately sized β=8\beta=8. This suggests that image data trains better under constant penalties on the outskirts of the unit square toward the imbalance of high precision and low recall. High precision and low recall imply image confusion between classes in the feature embedding space. In fact, this can lead to large implications as in Grush (2015). Algorithms like DeepInspect Tian et al. (2020) help to detect confusion and bias errors to isolate misclassified images leading to repair based training algorithms such as Tian (2020) and Zhang et al. (2021). But Qian et al. (2021) empirically shows that such repair or de-biasing algorithms can be inaccurate with one fixed-seed training run. The importance of the M1βM_{1}^{\beta} result is now evident because M1βM_{1}^{\beta} quickly penalizes the network in a way that inherently mirrors algorithms like DeepInspect’s confusion/bias detection without the need for repair algorithms.

7.3.2 Text Results

The training results for the text network by far show the most improvement with a nearly 14% boost in the F1F_{1} score over the baseline for the M2(λ,σ2)M_{2}^{(\lambda,\sigma^{2})} model. Not only is the performance notable, the model parameter selections are consistent – the parameters move in the same direction. In other words, given the parameters λ=0.5\lambda=0.5 and σ2=0.5\sigma^{2}=0.5, the training shows improvement over the baseline and this improvement continues in the same direction when λ=0.01\lambda=0.01 and σ2=0.01\sigma^{2}=0.01. This is similar to section 7.3.1 because first, the architecture is generalizing better (seen by the F1F_{1} score) for label confusion (i.e., language context) and second, it adjusts for intentionally configured imbalance and incorrect labeling (a known issue for this dataset). The incorrect labeling in the IMDB dataset is shown to be non-negligible – upwards of 2-3% – by Klie et al. (2022) and Northcutt et al. (2021). In particular, Northcutt et al. show that small increases in label errors often cause a destabilizing effect on machine learning models for which confident learning methodology is developed to detect them. Klie et al. analyze 18 methods (including confident learning) for Automated Error Detections (AED) and shows the importance of AED for data cleaning. In close proximity to the AED methodology, another paradigm is Robust Training with Label Noise. Song et al. (2022) provides an exhaustive survey ranging from robust architectures (i.e., noise adaptation layers) and robust regularization (i.e., explicit and implicit regularization) to robust loss (i.e., loss correction, re-weighting, etc.) and sample selection. It is in this context that the M2(λ,σ2)M_{2}^{(\lambda,\sigma^{2})} framework sits between AED and Robust Training with Label Noise on this IMDB dataset which is known to have errors. M2(λ,σ2)M_{2}^{(\lambda,\sigma^{2})} serves two purposes: (1) as a robust loss through the βopt\beta_{opt} re-weighting on the batch and, (2) as a means to detect and down weight possible label errors.

7.3.3 Structured/Tabular Results

The results for the structured/tabular network do not show any F1F_{1} improvement over the baseline nor any indication of possible improvement through the extra parameter variations. From Table 1, the best performing model for this dataset (not the baseline) is M2(λ,σ2)M_{2}^{(\lambda,\sigma^{2})} where λ=2.0\lambda=2.0 and σ2=0.5\sigma^{2}=0.5. The interpretation of this parameter configuration suggests that training tabular data is very susceptible to both low precision and recall, hence the high penalty in that area of the unit square in Figure 3. Despite embedding categories and numeric features into a richer vector space, the non-contextual nature of tabular data may not necessarily be best trained through these architectures. Furthermore, Sun et al. (2019) applies a two dimensional embedding (i.e., simulating an image) to this Census dataset and the results show that a decision tree (i.e., xgboost) would perform similarly. It is worth mentioning that Sun et al. present these results with an accuracy measure (not F1F_{1}) which is misleading since the data is naturally imbalanced. However, a similar general conclusion is given by Borisov et al. (2021) for tabular data – decision trees have faster training time and generally comparable accuracy as compared with embedding based architectures. These results are unsurprising because as stated by Wen et al. (2022) tabular data is not contextually driven data like images or languages which contain position-related correlations. It is heartening to notice, that after Wen et al. apply a casually aware GAN to the census data, the resulting F1F_{1} score (0.5090.509) is similar to the baseline result in Table 1 (0.51930.5193). Because of these results, there is an important finding: the type of data, in particular contextual data which is the basis for the creation of the FβF_{\beta} metric, plays a significant role when using the metric alongside a loss function. This hypothesis is studied further in the benchmark data in Section 7.4.

7.3.4 UST/Pressure Vessel Results

The results for the simulation of real life use cases can be found in Table 2. In the UST case, it is evident that this methodology outperforms the baseline cross-entropy in determining a shape change from a cylinder to an ellipse. For example, in the easier scenario for CvE (a=3.2a=3.2) the M2(λ,σ2)M_{2}^{(\lambda,\sigma^{2})} model family appears to be better. However, in cases of the extra variations, the M18M_{1}^{8} and M2(0.01,0.01)M_{2}^{(0.01,0.01)} perform the same. This trend is also observed in the results for the Image and Text data presented in Table 1. The interpretation is that a slightly more recall-centric penalty may be optimal for this scenario. Interestingly, for the easier CHvEH scenario (a=3.2a=3.2), the M2(λ,σ2)M_{2}^{(\lambda,\sigma^{2})} model family also appears to be better, and the extra variations for M132M_{1}^{32} and M2(5.0,5.0)M_{2}^{(5.0,5.0)} perform the same. These variations mirror CvE but in the other direction, suggesting that a balanced or slightly more precision-centric penalty is optimal. In the difficult scenario (a=3.8a=3.8), both CvE and CHvEH are closely aligned with M2(λ,σ2)M_{2}^{(\lambda,\sigma^{2})} model family. For CHvEH the best performer is the M132M_{1}^{32} variation. Overall, there is between 12% to 28% improvement over the baseline or standard cross-entropy for this simulation. Regarding the PV data, for the easier scenario (th=0.0313t_{h}=0.0313) the M1βM_{1}^{\beta} family appears to be better, with the M2(λ,σ2)M_{2}^{(\lambda,\sigma^{2})} model family not far behind. In the difficult scenario (th=0.2817t_{h}=0.2817) there is no improvement over the baseline cross-entropy but the best performing model family is M2(λ,σ2)M_{2}^{(\lambda,\sigma^{2})}. The reason is likely due to the significant overlap in distribution seen in Figure 4. These results are impactful because the commonality between model families begins to surface. For the easier scenario, a more recall-centric penalty turns out to be better, while in the difficult scenario, a balanced or slightly precision-centric penalty is more effective. This finding is intuitive.This finding is intuitive.

\sidewaystablefn
Table 1: Best F1 Score for M1βM_{1}^{\beta} and M2(λ,σ2)M_{2}^{(\lambda,\sigma^{2})} over 3030 Epochs
Baseline Parameter Variations Section 6.3 Extra Variations
Dataset MBM_{B} M116M_{1}^{16} M2(0.5,0.5)M_{2}^{(0.5,0.5)} M2(0.5,2.0)M_{2}^{(0.5,2.0)} M2(2.0,0.5)M_{2}^{(2.0,0.5)} M2(2.0,2.0)M_{2}^{(2.0,2.0)} M18M_{1}^{8} M132M_{1}^{32} M2(0.01,0.01)M_{2}^{(0.01,0.01)} M2(5.0,5.0)M_{2}^{(5.0,5.0)}
Image111The image dataset is the CIFAR10. The airplane label versus the remaining labels is the binary label basis. It gives a training data imbalance of 10%. Training data size is 50K and validation is 10K. 0.8161 0.8261 0.8085 0.8193 0.8232 0.8257 0.8266 0.8068 0.8087 0.8178
Text222The text dataset for NLP is the IMDB movie sentiment with binary label of positive/negative sentiment. The vocabulary size is 20K and the maximum review length is 200. The training set is imbalanced by choosing only 1K positive sentiments which yields an imbalance of 7.4%. The training data size is 13.5K and validation is 25K. 0.6749 0.6393 0.7175 0.6170 0.6547 0.6673 0.7236 0.5460 0.7666 0.7364
Structured333The structured or tabular data set is the Census Income Dataset from UCI repository. The labels are greater than or less than 50K salary. The data is already imbalanced with a rate of 6.2% for >>50K. The training data size is 200K and the validation is 100K. 0.5193 0.4170 0.3917 0.4126 0.4635 0.3930 0.3824 0.3511 0.3890 0.4516
\botrule
\sidewaystablefn
Table 2: Best F1 Score for M1βM_{1}^{\beta} and M2(λ,σ2)M_{2}^{(\lambda,\sigma^{2})} over 3030 Epochs
Baseline Parameter Variations Section 6.3 Extra Variations
Dataset00footnotemark: 0 MBM_{B} M116M_{1}^{16} M2(0.5,0.5)M_{2}^{(0.5,0.5)} M2(0.5,2.0)M_{2}^{(0.5,2.0)} M2(2.0,0.5)M_{2}^{(2.0,0.5)} M2(2.0,2.0)M_{2}^{(2.0,2.0)} M18M_{1}^{8} M132M_{1}^{32} M2(0.01,0.01)M_{2}^{(0.01,0.01)} M2(5.0,5.0)M_{2}^{(5.0,5.0)}
CvE111The simulation has label 0 with r=4r=4 and L=32L=32 versus label 1 of a=3.2a=3.2 and b=5.0b=5.0. 0.9691 0.9228 0.9983 0.9915 0.9898 0.9565 0.9966 0.9915 0.9966 0.9673
CvE222The simulation has label 0 with r=4r=4 and L=32L=32 versus label 1 of a=3.8a=3.8 and b=4.2105b=4.2105. 0.3169 0.3147 0.3351 0.3469 0.3296 0.3333 0.3224 0.3401 0.3362 0.3573
CHvEH333The simulation has label 0 with r=4r=4 and L=32L=32 versus label 1 of a=3.2a=3.2 and b=5.0b=5.0. 0.9831 0.9813 0.9813 0.9898 0.9915 0.9831 0.9726 0.9882 0.9831 0.9882
CHvEH444The simulation has label 0 with r=4r=4 and L=32L=32 versus label 1 of a=3.8a=3.8 and b=4.2105b=4.2105. 0.2891 0.3345 0.3427 0.3515 0.3262 0.3159 0.3636 0.3701 0.3395 0.3425
PV555The simulation has label 0 with ts=1.8048t_{s}=1.8048 and th=0.0939t_{h}=0.0939 versus label 1 of ts=1.7887t_{s}=1.7887 and th=0.0313t_{h}=0.0313. 0.9967 0.9992 0.9983 0.9483 0.9831 0.9967 0.9967 0.9891 0.9727 0.9958
PV666The simulation has label 0 with ts=1.8048t_{s}=1.8048 and th=0.0939t_{h}=0.0939 versus label 1 of ts=1.7887t_{s}=1.7887 and th=0.2817t_{h}=0.2817. 0.7515 0.4552 0.5057 0.4893 0.4722 0.5248 0.4934 0.4861 0.4675 0.5161
\botrule
00footnotetext: The simulations for UST (Underground Storage Tanks) are for the cylinder versus ellipse (CvE) or cylinder with hemispherical end-caps versus ellipsoidal with hemi-ellipsoidal end-caps (CHvEH). For the PV or pressure vessel, the simulation is between varying thickness of the surface and head. Refer to Appendix D and E for details. The label 1 proportion is 25% and total training and testing size is 1200.

7.4 Further Experimentation: Benchmark Analysis

Following the benchmark analysis from Aurelio et al., a similar approach is done for the Image, Text, and Tabular data. This expands the analysis from Table 1 to provide a more detailed and comprehensive view across various well-known datasets. The results can be found in Table 3, 4, and 5. The footnotes in these tables are explained as follows: the breakdown of train and test data sizes, the proportion of label 1, the labeling convention for label 1 versus label 0 (if multiple labels exist), and the location of the data, if necessary. For example, label 9 vs all means the label 9 is the label 1 and everything else is marked as label 0. Detailed explanations, links, and training details for all the datasets are provided in the footnotes for each table. At a high level, for images CIFAR-10, CIFAR-100 and Fashion MNIST are analyzed. For text, AG’s News Corpus, Reuters Corpus Volume 1, Hate Speech and Stanford Sentiment Treebank are analyzed. For the tabular data, 10 classical datasets from UCI repository are analyzed. Finally, the same model networks from Section 7.3 will be used.

7.4.1 Image Results

Comparing the CIFAR-10 result in Table 1 versus 3, the model family changes from M1βM_{1}^{\beta} to M2(λ,σ2)M_{2}^{(\lambda,\sigma^{2})}. The interpretation remains the consistent: a recall centric based penalty is favored. The CIFAR-100 examples, with an imbalance of 1%, follow a similar recall centric penalty for M116M_{1}^{16} under the label convention 9 vs all. However, under the labeling 39 versus all, a more precision centric penalty is preferred. This illustrates the problem-specific nature of selecting a model family and parameter, showcasing the flexibility of this paper’s methodology. Notably, there is a 14% increase in the F1F_{1} score for CIFAR-100 under the 39 versus all label convention. Fashion MNIST favors the M2(λ,σ2)M_{2}^{(\lambda,\sigma^{2})} with a more precision centric penalty. The most intriguing result is that, for all the extra variations, M2(5.0,5.0)M_{2}^{(5.0,5.0)} is the most frequent performer, which is a more balanced penalty. This suggests that M2(5.0,5.0)M_{2}^{(5.0,5.0)} could be a starting point of exploration given the balanced nature of the penalty distribution.

\sidewaystablefn
Table 3: Best F1 Score for M1βM_{1}^{\beta} and M2(λ,σ2)M_{2}^{(\lambda,\sigma^{2})} over 3030 Epochs
Baseline Parameter Variations Section 6.3 Extra Variations
Dataset00footnotemark: 0 MBM_{B} M116M_{1}^{16} M2(0.5,0.5)M_{2}^{(0.5,0.5)} M2(0.5,2.0)M_{2}^{(0.5,2.0)} M2(2.0,0.5)M_{2}^{(2.0,0.5)} M2(2.0,2.0)M_{2}^{(2.0,2.0)} M18M_{1}^{8} M132M_{1}^{32} M2(0.01,0.01)M_{2}^{(0.01,0.01)} M2(5.0,5.0)M_{2}^{(5.0,5.0)}
CIFAR-10111Train/test 50K/10K, label 1 10%, labeling is 1 vs all. 0.9216 0.9088 0.9204 0.9196 0.9263 0.9122 0.9194 0.9119 0.9268 0.9173
CIFAR-100222Train/test 50K/10K & 50K/10K, label 1 1% & 1%, labeling is 9 vs all & 39 vs all. 0.7345 0.7804 0.7273 0.6941 0.7594 0.7692 0.7501 0.7167 0.7314 0.7683
CIFAR-100222Train/test 50K/10K & 50K/10K, label 1 1% & 1%, labeling is 9 vs all & 39 vs all. 0.6021 0.6592 0.6778 0.6871 0.6381 0.6818 0.6509 0.6351 0.6702 0.6704
Fashion MNIST333Train/test 50K/10K & 50K/10K, label 1 10% & 10%, labeling is 0 vs all & 9 vs all. 0.8651 0.8638 0.8663 0.8462 0.8651 0.8593 0.8544 0.8558 0.8638 0.8672
Fashion MNIST333Train/test 50K/10K & 50K/10K, label 1 10% & 10%, labeling is 0 vs all & 9 vs all. 0.9627 0.9621 0.9656 0.9656 0.9675 0.9648 0.9615 0.9648 0.9641 0.9681
\botrule
00footnotetext: All the datasets are easily found in Keras repository. The link is : https://keras.io/api/datasets/.

7.4.2 Text Results

Referring to Table 4, for the AG’s News Corpus and Reuters Corpus Volume 1 under the labeling crude vs all, the M2(λ,σ2)M_{2}^{(\lambda,\sigma^{2})} model family, particularly M2(0.5,2.0)M_{2}^{(0.5,2.0)} is preferred. These parameter selections suggest a slightly more precision centric penalty. When considering Reuters Corpus Volume 1 with the labeling crude vs all and the Stanford Sentiment Treebank, there is no observed improvement. In the case of the Hate Speech Data, a more distinctive context, there is roughly a 4% boost under the M2(2.0,2.0)M_{2}^{(2.0,2.0)} model. This parameter selection is also a balanced penalty between recall and precision. Overall, similar to the Image benchmark conclusion, the M2(5.0,5.0)M_{2}^{(5.0,5.0)} is a frequent performer in the extra variation set of parameters. This insight of balanced penalty selection also holds for contextual text data.

\sidewaystablefn
Table 4: Best F1 Score for M1βM_{1}^{\beta} and M2(λ,σ2)M_{2}^{(\lambda,\sigma^{2})} over 3030 Epochs
Baseline Parameter Variations Section 6.3 Extra Variations
Dataset00footnotemark: 0 MBM_{B} M116M_{1}^{16} M2(0.5,0.5)M_{2}^{(0.5,0.5)} M2(0.5,2.0)M_{2}^{(0.5,2.0)} M2(2.0,0.5)M_{2}^{(2.0,0.5)} M2(2.0,2.0)M_{2}^{(2.0,2.0)} M18M_{1}^{8} M132M_{1}^{32} M2(0.01,0.01)M_{2}^{(0.01,0.01)} M2(5.0,5.0)M_{2}^{(5.0,5.0)}
ag_news111Train/test 90K/30K, label 1 25%, labeling is 3 vs all, AG’s News Corpus Data found here base-url/ag_news. 0.9632 0.9474 0.9632 0.9639 0.9626 0.9624 0.9553 0.9404 0.9634 0.9655
rcv1222Train/test 5485/2189 & 5485/2189, label 1 4.61% & 4.57%, labeling is crude vs all & trade vs all, Reuters Corpus Volume 1 Data found here base-url/yangwang825/reuters-21578. 0.9333 0.9298 0.9396 0.9461 0.9211 0.9451 0.9316 0.927 0.9356 0.9501
rcv1222Train/test 5485/2189 & 5485/2189, label 1 4.61% & 4.57%, labeling is crude vs all & trade vs all, Reuters Corpus Volume 1 Data found here base-url/yangwang825/reuters-21578. 0.9324 0.92 0.9251 0.9189 0.9178 0.9189 0.9127 0.9139 0.9251 0.9054
hate333Train/test 8027/2676, label 1 11%, labeling is 1 vs 0, Hate Speech Data found here base-url/hate_speech18. 0.8671 0.8304 0.8741 0.9045 0.8621 0.9046 0.8383 0.7669 0.8655 0.8868
sst444Train/test 67K/872, label 1 55%, labeling is 1 vs 0, Stanford Sentiment Treebank found here base-url/sst2. 0.8175 0.7619 0.7955 0.7909 0.8071 0.8001 0.7727 0.7494 0.8018 0.8004
\botrule
00footnotetext: Datasets are found in Hugging Face repository. The base-url is https://huggingface.co/datasets.

7.4.3 Structured/Tabular Results

The tabular or structured benchmark results in Table 5 show that this paper’s methodology outperforms the baseline for all but one dataset (the breast cancer dataset). A key insight is that, for the parameter variations from section 6.3 and the extra variations, a more recall centric penalty is preferred. In particular, the M1βM_{1}^{\beta} and M2(2.0,0.5)M_{2}^{(2.0,0.5)} model families for the datasets iono, pima, vehicle, glass, vowel, yeast and abalone are favored. The remaining datasets - seg and sat - show modest improvement for the balanced penalty or M2(0.5,2.0)M_{2}^{(0.5,2.0)} model. Compared to the Census results in Table 1, it appears that feature distinctiveness plays a major part for tabular data. This paper defines feature distinctiveness as a neural network learning better discriminative features with respect to the dependent variable. This conclusion arises from the more recall centric penalty showing up in the result, suggesting that for tabular or structured data, the network should focus on learning strong discriminative features to enhance recall. This result underscores the hypothesis of this paper that the type of data, particularly contextual data, matters for a metric-based penalty and further supports the flexibility of this FβF_{\beta} penalty methodology.

\sidewaystablefn
Table 5: Best F1 Score for M1βM_{1}^{\beta} and M2(λ,σ2)M_{2}^{(\lambda,\sigma^{2})} over 3030 Epochs
Baseline Parameter Variations Section 6.3 Extra Variations
Dataset00footnotemark: 0 MBM_{B} M116M_{1}^{16} M2(0.5,0.5)M_{2}^{(0.5,0.5)} M2(0.5,2.0)M_{2}^{(0.5,2.0)} M2(2.0,0.5)M_{2}^{(2.0,0.5)} M2(2.0,2.0)M_{2}^{(2.0,2.0)} M18M_{1}^{8} M132M_{1}^{32} M2(0.01,0.01)M_{2}^{(0.01,0.01)} M2(5.0,5.0)M_{2}^{(5.0,5.0)}
iono111Train/test 235/116, label 1 34%, Ionosphere Data found in UCI-url. 0.7845 0.8364 0.8068 0.8161 0.8092 0.8256 0.8205 0.8742 0.8114 0.7845
pima222Train/test 514/254, label 1 35%, Pima Indians Diabetes Data found in R-url. 0.5253 0.4407 0.4109 0.3645 0.5454 0.5088 0.5124 0.2711 0.5058 0.5208
breast333Train/test 381/188, label 1 38%, Breast Cancer Wisconsin Data found in UCI-url. 0.9416 0.7985 0.6464 0.8633 0.7934 0.9387 0.7832 0.8239 0.7589 0.7832
vehicle444Train/test 566/280, label 1 27%, labeling is opel vs all, Vehicle Data found in R-url. 0.3942 0.4423 0.4000 0.3363 0.4507 0.3470 0.2105 0.3247 0.3103 0.3333
seg555Train/test 210/2100, label 1 14%, labeling is brickface vs all, Segmentation Data found in UCI-url. 0.6798 0.5099 0.6078 0.6987 0.5571 0.5295 0.3915 0.3130 0.6645 0.3247
glass666Train/test 143/71, label 1 13%, labeling is 7 vs all, Glass Data found in R-url. 0.8695 0.7200 0.7407 0.7826 0.9473 0.6250 0.9473 0.7000 0.8333 0.9523
sat777Train/test 4308/1004, label 1 9%, labeling is 4 vs all, Satellite Data found in UCI-url. 0.5511 0.1674 0.3274 0.5571 0.4963 0.1313 0.2375 0.1714 0.5849 0.3779
vowel888Train/test 663/327, label 1 9%, labeling is hYd vs all, Vowel Data found in R-url. 0.2752 0.3439 0.3076 0.2926 0.3103 0.2434 0.2464 0.1851 0.1647 0.2979
yeast999Train/test 344/170 label 1 9%, labeling is CYT vs ME2, Yeast Data found in UCI-url. 0.5491 0.8717 0.7500 0.5079 0.2185 0.2010 0.6046 0.5084 0.6857 0.2105
abalone101010Train/test 489/242, label 1 6%, labeling is 18 vs 9, Abalone Data found in UCI-url. 0.9723 0.9723 0.9723 0.9723 0.9723 0.9723 0.9723 0.9765 0.9723 0.9723
\botrule

8 Conclusion

This paper proposes a weighted cross-entropy based on van Rijsbergen FβF_{\beta} measure. By assuming statistical distributions as an intermediary, an optimal β\beta can be found, which is then used as a penalty weighting in the loss function. This approach is convenient since van Rijsbergen defines β\beta to be a weighting parameter between recall and precision. Guided training by the FβF_{\beta} hypothesizes that the interaction of the many combinations between the minority and majority classes has information that can help in three ways. First, as in Vashishtha et al. it can improve feature selection. Second, model training can generalize better. Lastly, overall performance may improve. Results from Table 1 show that this methodology helps in achieving better F1F_{1} scores in some cases, with the added benefit of parameter interpretation from M1βM_{1}^{\beta} and M2(λ,σ2)M_{2}^{(\lambda,\sigma^{2})}. Furthermore, when considering results from real-life use cases as in Table 2, commonalities between model families start to surface. Parameter selections that yield recall-centric penalties for both M1βM_{1}^{\beta} and M2(λ,σ2)M_{2}^{(\lambda,\sigma^{2})} can be observed. The analyses from this paper provide the following insights: (1) the balanced penalty distribution is a good starting point for M2(λ,σ2)M_{2}^{(\lambda,\sigma^{2})} model family, (2) feature distinctiveness impacts parameter selections between both model families, (3) non-contextual data such as tabular or structured data, seem to benefit from a recall centric penalty, (4) M1βM_{1}^{\beta} may be better for image data, and M2(λ,σ2)M_{2}^{(\lambda,\sigma^{2})} for text, and (5) contextual-based data are better positioned for embedding architectures than non-contextual data - except when the tabular data can be mapped to contextual data or the features are discriminative.These points show that FβF_{\beta} as a performance metric can be integrated alongside a loss function through penalty weights by using statistical distributions.

References

  • Aurelio et al. (2022) Aurelio, Y. S., de Almeida, G. M., de Castro, C. L., and Braga, A. P. (2022). Cost-Sensitive Learning based on Performance Metric for Imbalanced Data. Neural Processing Letters, 54(4), 3097-3114.
  • Chauhan et al. (2022) Chauhan, S., Vashishtha, G., and Kumar, A. (2022). A symbiosis of arithmetic optimizer with slime mould algorithm for improving global optimization and conventional design problem. The Journal of Supercomputing, 78(5), 6234-6274.
  • Chauhan et al. (2023) Chauhan, S., and Vashishtha, G. (2023). A synergy of an evolutionary algorithm with slime mould algorithm through series and parallel construction for improving global optimization and conventional design problem. Engineering Applications of Artificial Intelligence, 118, 105650.
  • Chawla et al. (2002) Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
  • Fujino et al. (2008) Fujino, A., Isozaki, H., and Suzuki, J. (2008). Multi-label text categorization with model combination based on f1-score maximization. In Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II.
  • Hasanin et al. (2019) Hasanin, T., Khoshgoftaar, T. M., Leevy, J. L., and Seliya, N. (2019). Examining characteristics of predictive models with imbalanced big data. Journal of Big Data, 6(1), 1-21.
  • Oksuz et al. (2018) Oksuz, K., Cam, B. C., Akbas, E., and Kalkan, S. (2018). Localization recall precision (LRP): A new performance metric for object detection. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 504-519).
  • Li et al. (2019) Li, X., Sun, X., Meng, Y., Liang, J., Wu, F., and Li, J. (2019). Dice loss for data-imbalanced NLP tasks. arXiv preprint arXiv:1911.02855.
  • Ho and Wookey (2019) Ho, Y., and Wookey, S. (2019). The real-world-weight cross-entropy loss function: Modeling the costs of mislabeling. IEEE Access, 8, 4806-4813.
  • Lipton et al. (2014) Lipton, Z. C., Elkan, C., and Narayanaswamy, B. (2014). Thresholding classifiers to maximize F1 score. arXiv preprint arXiv:1402.1892.
  • Bénédict et al. (2021) Bénédict, G., Koops, V., Odijk, D., and de Rijke, M. (2021). sigmoidF1: A Smooth F1 Score Surrogate Loss for Multilabel Classification. arXiv preprint arXiv:2108.10566.
  • Borisov et al. (2021) Borisov, V., Leemann, T., Seßler, K., Haug, J., Pawelczyk, M., and Kasneci, G. (2021). Deep neural networks and tabular data: A survey. arXiv preprint arXiv:2110.01889.
  • Dua and Graff (2019) Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
  • Dudewicz and Mishra (1988) Dudewicz, E. J., and Mishra, S. (1988). Modern mathematical statistics. John Wiley & Sons, Inc.
  • Grush (2015) Grush, L. (2015). Google engineer apologizes after Photos app tags two black people as gorillas. The Verge, 1.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
  • Hogg and Craig (1995) Hogg, R. V., and Craig, A. T. (1995). Introduction to mathematical statistics.(5”” edition). Englewood Hills, New Jersey.
  • Jansche (2005) Jansche, M. (2005, October). Maximum expected F-measure training of logistic regression models. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (pp. 692-699).
  • Klie et al. (2022) Klie, J. C., Webber, B., and Gurevych, I. (2022). Annotation Error Detection: Analyzing the Past and Present for a More Coherent Future. arXiv preprint arXiv:2206.02280.
  • Lee et al. (2021) Lee, N., Yang, H., and Yoo, H., A surrogate loss function for optimization of FβF_{\beta} score in binary classification with imbalanced data, arXiv preprint arXiv:2104.01459, 2021.
  • Lin et al. (2017) Lin, T. Y., Goyal, P., Girshick, R., He, K., and Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (pp. 2980-2988).
  • Maas et al. (2011) Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C., (June 2011) Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (pp. 142-150).
  • Mohit et al. (2012) Mohit, B., Schneider, N., Bhowmick, R., Oflazer, K., and Smith, N. A. (2012, April). Recall-oriented learning of named entities in Arabic Wikipedia. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp. 162-173).
  • Northcutt et al. (2021) Northcutt, C. G., Athalye, A., and Mueller, J. (2021). Pervasive label errors in test sets destabilize machine learning benchmarks. arXiv preprint arXiv:2103.14749.
  • Oksuz et al. (2020) Oksuz, K., Cam, B. C., Akbas, E., and Kalkan, S. (2020). A ranking-based, balanced loss function unifying classification and localisation in object detection. Advances in Neural Information Processing Systems, 33, 15534-15545.
  • Qian et al. (2021) Qian, S., Pham, V. H., Lutellier, T., Hu, Z., Kim, J., Tan, L., … and Shah, S. (2021). Are my deep learning systems fair? An empirical study of fixed-seed training. Advances in Neural Information Processing Systems, 34, 30211-30227.
  • Ramdhani (2016) Ramdhani, S. (2016). Some contributions to underground storage tank calibration models, leak detection and shape deformation (Doctoral dissertation, The University of Texas at San Antonio).
  • Ramdhani et al. (2018) Ramdhani, S., Tripathi, R., Keating, J., and Balakrishnan, N. (2018). Underground storage tanks (UST): A closer investigation statistical implications to changing the shape of a UST. Communications in Statistics-Simulation and Computation, 47(9), 2612-2623.
  • Sandler et al. (2018) Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L. C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4510-4520).
  • Sasaki (2007) Sasaki, Y., The Truth of the F-Measure, University of Manchester Technical Report, 2007.
  • Satopaa et al. (2011) Satopaa, V., Albrecht, J., Irwin, D., and Raghavan, B., Finding a kneedle in a haystack: Detecting knee points in system behavior, In 2011 31st international conference on distributed computing systems workshops (pp. 166-171). IEEE
  • Schmidhuber (2015) Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural networks, 61, 85-117.
  • Song et al. (2022) Song, H., Kim, M., Park, D., Shin, Y., and Lee, J. G. (2022). Learning from noisy labels with deep neural networks: A survey. IEEE Transactions on Neural Networks and Learning Systems.
  • Sun et al. (2019) Sun, B., Yang, L., Zhang, W., Lin, M., Dong, P., Young, C., and Dong, J. (2019). Supertml: Two-dimensional word embedding for the precognition on structured tabular data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (pp. 0-0).
  • Tian et al. (2022) Tian, J., Mithun, N. C., Seymour, Z., Chiu, H. P., and Kira, Z. (2022, May). Striking the Right Balance: Recall Loss for Semantic Segmentation. In 2022 International Conference on Robotics and Automation (ICRA) (pp. 5063-5069). IEEE.
  • Tian et al. (2020) Tian, Y., Zhong, Z., Ordonez, V., Kaiser, G., and Ray, B. (2020, June). Testing dnn image classifiers for confusion & bias errors. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (pp. 1122-1134).
  • Tian (2020) Tian, Y. (2020, November). Repairing confusion and bias errors for DNN-based image classifiers. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (pp. 1699-1700).
  • Rijsbergen (1979) Van Rijsbergen, C. J., Information retrieval 2nd, Newton MA, 1979.
  • Vashishtha et al. (2022) Vashishtha, G., and Kumar, R. (2022). Pelton wheel bucket fault diagnosis using improved shannon entropy and expectation maximization principal component analysis. Journal of Vibration Engineering & Technologies, 1-15.
  • Vashishtha et al. (2022) Vashishtha, G., and Kumar, R. (2022). Unsupervised learning model of sparse filtering enhanced using wasserstein distance for intelligent fault diagnosis. Journal of Vibration Engineering & Technologies, 1-18.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
  • Yan et al. (2022) Yan, B. C., Wang, H. W., Jiang, S. W. F., Chao, F. A., and Chen, B. (2022, July). Maximum f1-score training for end-to-end mispronunciation detection and diagnosis of L2 English speech. In 2022 IEEE International Conference on Multimedia and Expo (ICME) (pp. 1-5). IEEE.
  • Zhang et al. (2021) Zhang, X., Zhai, J., Ma, S., and Shen, C. (2021, May). AUTOTRAINER: An Automatic DNN Training Problem Detection and Repair System. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (pp. 359-371). IEEE.
  • Wen et al. (2022) Wen, B., Cao, Y., Yang, F., Subbalakshmi, K., and Chandramouli, R. (2022, March). Causal-TGAN: Modeling Tabular Data Using Causally-Aware GAN. In ICLR Workshop on Deep Generative Models for Highly Structured Data.

Appendix A General form of F-Beta: n-th derivative

The derivation pattern using Sasaki’s Sasaki (2007) steps are straightforward for any partial derivative after the first derivative. To set the stage, a few equations are listed.

  • From (1), it can easily be shown that 1α1p+(1α)1r=prαr+(1α)p\frac{1}{\alpha\frac{1}{p}+(1-\alpha)\frac{1}{r}}=\frac{pr}{\alpha r+(1-\alpha)p}.

  • Keeping the notation similar to Sasaki (2007), let g=αr+(1α)pg=\alpha r+(1-\alpha)p then gr=α\frac{\partial g}{\partial r}=\alpha and gp=1α\frac{\partial g}{\partial p}=1-\alpha.

  • Taking the first derivative of (1) via the chain rule yields the following: Er=pg+prgrg2\frac{\partial E}{\partial r}=\frac{-pg+pr\frac{\partial g}{\partial r}}{g^{2}} and Ep=pg+prgpg2\frac{\partial E}{\partial p}=\frac{-pg+pr\frac{\partial g}{\partial p}}{g^{2}}.

  • After simplifying, Er=(1α)p2g2\frac{\partial E}{\partial r}=\frac{-(1-\alpha)p^{2}}{g^{2}} and Ep=αr2g2\frac{\partial E}{\partial p}=\frac{-\alpha r^{2}}{g^{2}}.

After setting nErn\frac{\partial^{n}E}{\partial r^{n}}=nEpn\frac{\partial^{n}E}{\partial p^{n}} for n=1n=1 it’s easy to see that (1α)p2=αr2(1-\alpha)p^{2}=\alpha r^{2} and using β=rp\beta=\frac{r}{p} yields α\alpha that pertains to the original FβF_{\beta} measure or (2) with n=1n=1. With the same steps, for n=2n=2 the equality becomes 2(1α)p2α=2αr2(1α)2(1-\alpha)p^{2}\alpha=2\alpha r^{2}(1-\alpha) or p=rp=r implying β=1\beta=1. With each successive differentiation where n>2n>2, the pattern is as follows: cαn2p2=c(1α)n2r2c\alpha^{n-2}p^{2}=c(1-\alpha)^{n-2}r^{2}, where cc is the same constant on both sides. Using r=βpr=\beta p will then give the generalized equality αn=1β2n2+1\alpha_{n}=\frac{1}{\beta^{\frac{-2}{n-2}}+1}.

Appendix B Case 1: Joint Probability Distribution for U and IU

To prove (LABEL:eq6a) it is sufficient to set up both integrals and explain the bounds. The computation itself is straightforward. From the following probability Pr(Fβ)=Pr(Fβz)=Pr(X1X2z)\text{Pr}\left(F_{\beta}\right)=\text{Pr}\left(F_{\beta}\leq z\right)=\text{Pr}\left(X_{1}X_{2}\leq z\right) it is clear that the domain is in [rr+β,r+βr][\frac{r^{\prime}}{r+\beta^{*}},\frac{r^{\prime}+\beta^{*}}{r}] based on (4) and (5). With a slight rearrangement, we can say the following:

Pr(Fβ)\displaystyle\text{Pr}\left(F_{\beta}\right) =Pr(X1X2z)\displaystyle=\text{Pr}\left(X_{1}X_{2}\leq z\right)
=rr+βPr(x2zx1)fx1𝑑x1,\displaystyle=\int_{r^{\prime}}^{r^{\prime}+\beta^{*}}\text{Pr}\left(x_{2}\leq\frac{z}{x_{1}}\right)f_{x1}dx_{1},

where fx1f_{x_{1}} is the probability density of X1X_{1} and Pr(x2zx1)\text{Pr}\left(x_{2}\leq\frac{z}{x_{1}}\right) will be the cumulative distribution for X2X_{2}. These are quite common and can be found online, or in Dudewicz and Mishra (1988) or Hogg and Craig (1995). The bounds come from (4). Using these bounds, notice that for x2zx1x_{2}\leq\frac{z}{x_{1}} to exist, then zx11r+β\frac{z}{x_{1}}\geq\frac{1}{r+\beta^{*}} and zx11r\frac{z}{x_{1}}\leq\frac{1}{r}. This results in the range rzx1(r+β)zrz\leq x_{1}\leq(r+\beta^{*})z. Recall, that the existence of x1x_{1} is in the range rx1r+βr^{\prime}\leq x_{1}\leq r^{\prime}+\beta^{*}. From both intervals on x1x_{1}, define condition 1 as rzrrz\leq r^{\prime} or zpz\leq p and condition 2 as (r+β)zr+β(r+\beta^{*})z\leq r^{\prime}+\beta^{*} or (r+β)zr+β1\frac{(r+\beta^{*})z}{r^{\prime}+\beta^{*}}\leq 1. We need to consider separately the following scenarios: condition 1 and 2 are true, condition 1 and 2 are both false, and condition 1 is false and condition 2 is true. The scenario of condition 1 being true and condition 2 being false does not occur.

Proof: [Proof] For zpz\leq p & (r+β)zr+β1\frac{(r+\beta^{*})z}{r^{\prime}+\beta^{*}}\leq 1 we get the following:

rr+βPr(x2zx1)fx1𝑑x1=1βr(r+β)zPr(x2zx1)𝑑x1,\displaystyle\int_{r^{\prime}}^{r^{\prime}+\beta^{*}}\text{Pr}\left(x_{2}\leq\frac{z}{x_{1}}\right)f_{x1}dx_{1}=\dfrac{1}{\beta^{*}}\int_{r^{\prime}}^{(r+\beta^{*})z}\text{Pr}\left(x_{2}\leq\frac{z}{x_{1}}\right)dx_{1},
=1(β)2r(r+β)z((r+β)x1z)𝑑x1\displaystyle=\dfrac{1}{(\beta^{*})^{2}}\int_{r^{\prime}}^{(r+\beta^{*})z}\left(\left(r+\beta^{*}\right)-\frac{x_{1}}{z}\right)dx_{1}
=z2(r+βrz)2(β)2.\displaystyle=\dfrac{\frac{z}{2}\left(r+\beta^{*}-\frac{r^{\prime}}{z}\right)^{2}}{(\beta^{*})^{2}}.

For z>pz>p & (r+β)zr+β>1\frac{(r+\beta^{*})z}{r^{\prime}+\beta^{*}}>1 we get the following:

rr+βPr(x2zx1)fx1𝑑x1\displaystyle\int_{r^{\prime}}^{r^{\prime}+\beta^{*}}\text{Pr}\left(x_{2}\leq\frac{z}{x_{1}}\right)f_{x1}dx_{1}
=1βrrz𝑑x1+1βrzr+βPr(x2zx1)𝑑x1,\displaystyle=\dfrac{1}{\beta^{*}}\int_{r^{\prime}}^{rz}dx_{1}+\dfrac{1}{\beta^{*}}\int_{rz}^{r^{\prime}+\beta^{*}}\text{Pr}\left(x_{2}\leq\frac{z}{x_{1}}\right)dx_{1},
=(rzrβ+1(β)2((r+β+r+β2z)(r+βrz)\displaystyle=\left(\dfrac{rz-r^{\prime}}{\beta^{*}}+\dfrac{1}{\left(\beta^{*}\right)^{2}}\left(\left(r+\beta^{*}+\frac{r^{\prime}+\beta^{*}}{2z}\right)\left(r^{\prime}+\beta^{*}-rz\right)\right.\right.
+r(rz(r+β))2))\displaystyle+\left.\left.\dfrac{r\left(rz-\left(r^{\prime}+\beta^{*}\right)\right)}{2}\right)\right)

For z>pz>p & (r+β)zr+β1\frac{(r+\beta^{*})z}{r^{\prime}+\beta^{*}}\leq 1 we get the following:

rr+βPr(x2zx1)fx1𝑑x1\displaystyle\int_{r^{\prime}}^{r^{\prime}+\beta^{*}}\text{Pr}\left(x_{2}\leq\frac{z}{x_{1}}\right)f_{x1}dx_{1}
=1βrrz𝑑x1+1βrzr+βPr(x2zx1)𝑑x11β(r+β)zr+βPr(x2zx1)𝑑x1,\displaystyle=\dfrac{1}{\beta^{*}}\int_{r^{\prime}}^{rz}dx_{1}+\dfrac{1}{\beta^{*}}\int_{rz}^{r^{\prime}+\beta^{*}}\text{Pr}\left(x_{2}\leq\frac{z}{x_{1}}\right)dx_{1}-\dfrac{1}{\beta^{*}}\int_{(r+\beta^{*})z}^{r^{\prime}+\beta^{*}}\text{Pr}\left(x_{2}\leq\frac{z}{x_{1}}\right)dx_{1},
=rzrβ+1(β)2((r+β+r+β2z)(r+βrz)\displaystyle=\dfrac{rz-r^{\prime}}{\beta^{*}}+\dfrac{1}{\left(\beta^{*}\right)^{2}}\left(\left(r+\beta^{*}+\frac{r^{\prime}+\beta^{*}}{2z}\right)\left(r^{\prime}+\beta^{*}-rz\right)\right.
+r(rz(r+β))2)1(β)2((r+β)(r+β)(r+β)22z(r+β)2z2)\displaystyle+\left.\dfrac{r\left(rz-\left(r^{\prime}+\beta^{*}\right)\right)}{2}\right)-\dfrac{1}{\left(\beta^{*}\right)^{2}}\left((r+\beta^{*})(r^{\prime}+\beta^{*})-\frac{(r^{\prime}+\beta^{*})^{2}}{2z}-\frac{(r+\beta^{*})^{2}z}{2}\right)

For the scenario zpz\leq p & (r+β)zr+β>1\frac{(r+\beta^{*})z}{r^{\prime}+\beta^{*}}>1, we need to show that it never occurs. By rearranging condition 2, and recalling r=prr^{\prime}=pr, we can get r(zp)>β(1z)r(z-p)>\beta^{*}(1-z). Then, if z1z\leq 1 then r(zp)0r(z-p)\leq 0 by zpz\leq p. Since β>0\beta^{*}>0, r(zp)>β(1z)r(z-p)>\beta^{*}(1-z) never occurs. If, z>1z>1 then it implies p>1p>1 since zpz\leq p. This never occurs since p[0,1]p\in[0,1].

Appendix C Case 2: Joint Probability Distribution for Ga and IE

The derivation of (9) is similar to Case 1 in that the integral will be broken into pieces and probability distribution proof will be used again. As before using the same rearrangement, we can say the following:

Pr(Fβ)\displaystyle\text{Pr}\left(F_{\beta}\right) =Pr(X1X2z)\displaystyle=\text{Pr}\left(X_{1}X_{2}\leq z\right)
=+Pr(x2zx1)fx1𝑑x1,\displaystyle=\int_{-\infty}^{+\infty}\text{Pr}\left(x_{2}\leq\frac{z}{x_{1}}\right)f_{x1}dx_{1},

Before moving forward, X2X_{2}’s marginal distribution or (8) will be given.

If, β′′Exponential(λ), then the CDF for X=β′′+r is:\displaystyle\text{If, }\beta^{\prime\prime}\sim\text{Exponential}(\lambda),\text{ then the CDF for }X=\beta^{\prime\prime}+r\text{ is: }
F(x)=Pr(Xx)=Pr(β′′+rx2)=Pr(β′′xr)\displaystyle F(x)=\text{Pr}(X\leq x)=\text{Pr}(\beta^{\prime\prime}+r\leq x^{\prime}_{2})=\text{Pr}(\beta^{\prime\prime}\leq x-r)
F(x)=1exp(λ(xr)),xr.\displaystyle F(x)=1-\exp\left(-\lambda(x-r)\right),\forall x\geq r.
With transformation: Y=g(X)=1X so, g1(y)=β′′=1ryy\displaystyle\text{With transformation: }Y=g(X)=\frac{1}{X}\text{ so, }g^{-1}(y)=\beta^{\prime\prime}=\frac{1-ry}{y}
Then, Fy(y)=Fx(g1(y))=exp(λ(1ryy)) since g is a\displaystyle\text{Then, }F_{y}(y)=F_{x}(g^{-1}(y))=\exp\left(-\lambda\left(\frac{1-ry}{y}\right)\right)\text{ since $g$ is a}
strictly decreasing function.

Now, we can see that X2X_{2} has the distribution F(X2)=exp(λ(1rx2x2))F(X_{2})=\exp\left(-\lambda\left(\frac{1-rx_{2}}{x_{2}}\right)\right) where x2x_{2} \in [0,1r\frac{1}{r}]. Using this property we complete the proof.

Proof: [Proof] For z >> 0:

rzfx1dx1+rz+exp{λ(1rzx1zx1)}fx1dx1\displaystyle\int_{-\infty}^{rz}f_{x_{1}}d_{x_{1}}+\int_{rz}^{+\infty}\exp\left\{-\lambda\left(\frac{1-r\frac{z}{x_{1}}}{\frac{z}{x_{1}}}\right)\right\}f_{x_{1}}d_{x_{1}}
=Φ(rz;r,σ2)+12πσ2rz+exp{λ(1rzx1zx1)}exp{12σ2(x1r)2}\displaystyle=\Phi(rz\mathchar 24635\relax\;r^{\prime},\sigma^{2})+\dfrac{1}{\sqrt{2\pi\sigma^{2}}}\int_{rz}^{+\infty}\exp\left\{-\lambda\left(\frac{1-r\frac{z}{x_{1}}}{\frac{z}{x_{1}}}\right)\right\}\exp\left\{-\frac{1}{2\sigma^{2}}\left(x_{1}-r^{\prime}\right)^{2}\right\}
=Φ(rz;r,σ2)+12πσ2rz+exp{x122rx1+(r)22σ2λx1z+λr}𝑑x1\displaystyle=\Phi(rz\mathchar 24635\relax\;r^{\prime},\sigma^{2})+\dfrac{1}{\sqrt{2\pi\sigma^{2}}}\int_{rz}^{+\infty}\exp\left\{-\frac{x_{1}^{2}-2r^{\prime}x_{1}+(r^{\prime})^{2}}{2\sigma^{2}}-\frac{\lambda x_{1}}{z}+\lambda r\right\}dx_{1}
=Φ(rz;r,σ2)+12πσ2rz+exp{(x1(rλσ2z))22σ2\displaystyle=\Phi(rz\mathchar 24635\relax\;r^{\prime},\sigma^{2})+\dfrac{1}{\sqrt{2\pi\sigma^{2}}}\int_{rz}^{+\infty}\exp\left\{-\frac{\left(x_{1}-\left(r^{\prime}-\frac{\lambda\sigma^{2}}{z}\right)\right)^{2}}{2\sigma^{2}}\right.
+2rλσ2z(λσ2z)2+λr2σ2}dx1\displaystyle\left.+\frac{\frac{2r^{\prime}\lambda\sigma^{2}}{z}-\left(\frac{\lambda\sigma^{2}}{z}\right)^{2}+\lambda r}{2\sigma^{2}}\right\}dx_{1}
=[Φ(rz;r,σ2)+exp(λr+(λσ2z)22rλσ2z2σ2)\displaystyle=\left[\Phi(rz\mathchar 24635\relax\;r^{\prime},\sigma^{2})+\exp\left(\lambda r+\dfrac{\left(\frac{\lambda\sigma^{2}}{z}\right)^{2}-\frac{2r^{\prime}\lambda\sigma^{2}}{z}}{2\sigma^{2}}\right)\right.
×(1Φ(rz;(rλσ2z),σ2))].\displaystyle\left.\times\left(1-\Phi\left(rz\mathchar 24635\relax\;\left(r^{\prime}-\frac{\lambda\sigma^{2}}{z}\right),\sigma^{2}\right)\right)\right].

To be clear, the bounds of the integral around rzrz arise from the directionally based inequality on zx1\frac{z}{x_{1}}, in particular zx1>1r\frac{z}{x_{1}}>\frac{1}{r}.

For z=0: it can be seen that the entire mass is summarized by the Gaussian distribution or X1X_{1} since X2X_{2} is non-negative.

0fx1dx1=Φ(0;r,σ2).\displaystyle\int_{-\infty}^{0}f_{x_{1}}d_{x_{1}}=\Phi(0\mathchar 24635\relax\;r^{\prime},\sigma^{2}).

For z << 0: this is a bit different because though the Inverse Exponential redistributes the mass for the Gaussian as before, it being non-negative needs to be adjusted for. Consider the interval [,0][-\infty,0] that represents the domain for this (probability) mass. Let pzp_{z} be the probability mass of interest. Next, define b1b_{1} to be the mass from [,rz][-\infty,rz], b2b_{2} to be the mass from [rz,0][rz,0], and b3b_{3} as the mass for [,0][-\infty,0]. Notice for b1b_{1} and b2b_{2}, the separation of the integral is similar as before but with different bounds. So we have the following:

b1=rzexp{λ(1rzx1zx1)}fx1dx1\displaystyle b_{1}=\int_{-\infty}^{rz}\exp\left\{-\lambda\left(\frac{1-r\frac{z}{x_{1}}}{\frac{z}{x_{1}}}\right)\right\}f_{x_{1}}d_{x_{1}}
b2=rz0fx1dx1\displaystyle b_{2}=\int_{rz}^{0}f_{x_{1}}d_{x_{1}}
b3=Φ(0;r,σ2).\displaystyle b_{3}=\Phi(0\mathchar 24635\relax\;r^{\prime},\sigma^{2}).

By X2X_{2}’s redistribution, the mass for the negative values is pz=b3b1b2p_{z}=b_{3}-b_{1}-b_{2} for z<0z<0. The proof is now simplified to solving the pzp_{z} expression and by using some of the results from the z>0z>0 case we have the following:

Φ(0;r,σ2)rzexp{λ(1rzx1zx1)}fx1dx1rz0fx1dx1\displaystyle\Phi(0\mathchar 24635\relax\;r^{\prime},\sigma^{2})-\int_{-\infty}^{rz}\exp\left\{-\lambda\left(\frac{1-r\frac{z}{x_{1}}}{\frac{z}{x_{1}}}\right)\right\}f_{x_{1}}d_{x_{1}}-\int_{rz}^{0}f_{x_{1}}d_{x_{1}}
=Φ(0;r,σ2)exp(λr+(λσ2z)22rλσ2z2σ2)×Φ(rz;(rλσ2z),σ2)\displaystyle=\Phi(0\mathchar 24635\relax\;r^{\prime},\sigma^{2})-\exp\left(\lambda r+\dfrac{\left(\frac{\lambda\sigma^{2}}{z}\right)^{2}-\frac{2r^{\prime}\lambda\sigma^{2}}{z}}{2\sigma^{2}}\right)\times\Phi\left(rz\mathchar 24635\relax\;\left(r^{\prime}-\frac{\lambda\sigma^{2}}{z}\right),\sigma^{2}\right)
(Φ(0;r,σ2)Φ(rz;r,σ2))\displaystyle-\left(\Phi\left(0\mathchar 24635\relax\;r^{\prime},\sigma^{2}\right)-\Phi\left(rz\mathchar 24635\relax\;r^{\prime},\sigma^{2}\right)\right)
=Φ(rz;r,σ2)exp(λr+(λσ2z)22rλσ2z2σ2)×Φ(rz;(rλσ2z),σ2)\displaystyle=\Phi(rz\mathchar 24635\relax\;r^{\prime},\sigma^{2})-\exp\left(\lambda r+\dfrac{\left(\frac{\lambda\sigma^{2}}{z}\right)^{2}-\frac{2r^{\prime}\lambda\sigma^{2}}{z}}{2\sigma^{2}}\right)\times\Phi\left(rz\mathchar 24635\relax\;\left(r^{\prime}-\frac{\lambda\sigma^{2}}{z}\right),\sigma^{2}\right)

Appendix D Pressure Vessel Design

This is borrowed from Chauhan et al. (2022). See figure 7 from Chauhan et al. to see the structure design of pressure vessel, which looks similar to Underground Storage Tanks discussed earlier.

D.1 Problem Statement

The pressure vessel design objective is to minimize total cost, which includes material, forming and welding. The design variables are thickness of the shell (TsT_{s}), thickness of the head (ThT_{h}), the inner radius (RR) and the length of the cylinder (LL). The mathematical formulation is found in 11.

X\displaystyle X =[x1,x2,x3,x4]=[Ts,Th,R,L]\displaystyle=[x_{1},x_{2},x_{3},x_{4}]=[T_{s},T_{h},R,L]
f(X)\displaystyle f(X) =0.6224x1x3x4+1.7781x2x32+3.1661x12x4+19.84x12x3\displaystyle=0.6224x_{1}x_{3}x_{4}+1.7781x_{2}x_{3}^{2}+3.1661x_{1}^{2}x_{4}+19.84x_{1}^{2}x_{3} (11)

For this paper, Chauhan et al. HAOASMA algorithm results will be used as the baseline for the parameters. These parameter results serve as the best minimum result. To be specific, Ts=1.8048T_{s}=1.8048, Th=0.0939T_{h}=0.0939, R=13.8360R=13.8360, and L=123.2019L=123.2019. The next section will provide a couple variations to convert this problem to a classification problem.

D.2 Varying Design Parameter Plots

The simulation carried out can be done using Algorithm 2. Two simulated realizations from this algorithm can be seen in Figure 4. The left figure has values Tsv=1.7887T_{s}^{v}=1.7887 and Thv=0.0313T_{h}^{v}=0.0313 and the right figure has Tsv=1.7887T_{s}^{v}=1.7887 and Thv=0.2817T_{h}^{v}=0.2817. The superscript vv stands for variation. The variations are intended to be reflect two scenarios: the first is a clear separation between distribution (the left figure), hence an easier classification. The second has significant overlap (the right figure) or a tougher classification.

Algorithm 2 Simulation of Pressure Vessel Data for Classification
1:s>0s>0 \wedge i(0,1)i\in(0,1) and Tsv>0T_{s}^{v}>0 and Thv>0T_{h}^{v}>0 where ii is imbalance, ss is the size of the data set and TsvT_{s}^{v} and ThvT_{h}^{v} are parameter variations from the HAOASMA baseline.
2:Set Tsb=1.8048T_{s}^{b}=1.8048, Thb=0.0939T_{h}^{b}=0.0939, Rb=13.8360R^{b}=13.8360, and Lb=123.2019L^{b}=123.2019 where superscript bb is baseline.
3:Compute data label sizes based on the imbalance ii as s0=s×(1i)s_{0}=\lfloor s\times(1-i)\rfloor and s1=ss0s_{1}=s-s_{0} where the subscript 0 and 11 stand for data sizes for label 0 and label 1.
4:Initialize tsbt_{s}^{b} and thbt_{h}^{b} as s0s_{0} size arrays with values TsbT_{s}^{b} and ThbT_{h}^{b} respectively.
5:Draw rbr^{b} and lbl^{b} arrays of size s0s_{0} from the normal distributions: N(μ=Rb,σ2=1)N(\mu=R^{b},\sigma^{2}=1) and N(μ=Lb,σ2=1)N(\mu=L^{b},\sigma^{2}=1) respectively.
6:Concatenate the column vectors tsbt_{s}^{b}, thbt_{h}^{b}, rbr^{b} and lbl^{b} together and assign this array to variable X0X_{0}.
7:Apply 11 to each row of X0X_{0} to yield a cost value and assign this array to variable Y0Y_{0}.
8:Initialize l0l_{0} as a label 0 array of size s0s_{0} with the value 0.
9:Concatenate the column vectors rbr^{b}, lbl^{b}, Y0Y_{0}, and l0l_{0} together and assign this array to variable F0F_{0}.
10:Initialize tsvt_{s}^{v} and thvt_{h}^{v} as s1s_{1} size arrays with values TsvT_{s}^{v} and ThvT_{h}^{v} respectively.
11:Draw rbr^{b} and lbl^{b} arrays of size s1s_{1} from the normal distributions: N(μ=Rb,σ2=1)N(\mu=R^{b},\sigma^{2}=1) and N(μ=Lb,σ2=1)N(\mu=L^{b},\sigma^{2}=1) respectively.
12:Concatenate the column vectors tsvt_{s}^{v}, thvt_{h}^{v}, rbr^{b} and lbl^{b} together and assign this array to variable X1X_{1}.
13:Apply 11 to each row of X1X_{1} to yield a cost value and assign this array to variable Y1Y_{1}.
14:Initialize l1l_{1} as a label 1 array of size s1s_{1} with the value 11.
15:Concatenate the column vectors rbr^{b}, lbl^{b}, Y1Y_{1}, and l1l_{1} together and assign this array to variable F1F_{1}.
16:Concatenate or stack the arrays F0F_{0} and F1F_{1} which yields a total of ss rows with the last column being the labels for classification.
Refer to captionRefer to caption
Figure 4: Comparison of a simulated realization using Algorithm 2 and variations to the thickness parameter. Left: Tsv=1.7887T_{s}^{v}=1.7887 and Thv=0.0313T_{h}^{v}=0.0313 and Right: Tsv=1.7887T_{s}^{v}=1.7887 and Thv=0.2817T_{h}^{v}=0.2817

Appendix E Underground Storage Tank (UST)

This is section is borrowed from Ramdhani (2016) and all equations and derivations and further explorations can be found there.

E.1 Problem Statement

The UST problem deals with estimating tank dimensions by using only vertical height measurements. It is also possible that this cylindrical UST has hemispherical endcaps appended on the ends which will also contain volume. The equations for the volume based on cross-sectional measurements for the tanks with the cylindrical, cylindrical with hemispherical endcaps, ellipsoidal, and ellipsoidal with hemi-ellipsoidal endcaps shapes, are given in (12), (13), (14), and (15), respectively.

The equation for the Cylindrical shape is:

fC(r,L,h)=L{r2cos1(rhr)(rh)2rhh2}\displaystyle f_{C}(r,L,h)=L\left\{r^{2}cos^{-1}\left(\frac{r-h}{r}\right)-(r-h)\sqrt{2rh-h^{2}}\right\} (12)

If one were to add hemispherical endcaps to the cylinder ends the subsequent volume would be:

fCH(r,L,h)=L{r2cos1(rhr)(rh)2rhh2}+πh23(3rh)\displaystyle f_{CH}(r,L,h)=L\left\{r^{2}cos^{-1}\left(\frac{r-h}{r}\right)-(r-h)\sqrt{2rh-h^{2}}\right\}+\frac{\pi h^{2}}{3}(3r-h) (13)

The equation for the Elliptical shape for a deformed Cylinder is:

fED(a,b,L,h)\displaystyle f_{ED}(a,b,L,h) =L{(ab)cos1(aha2+(h22ha)(1b2a2))\displaystyle=L\left\{(ab)cos^{-1}\left(\frac{a-h}{\sqrt{a^{2}+(h^{2}-2ha)\left(1-\frac{b^{2}}{a^{2}}\right)}}\right)\right.
b(ah)1(1ha)2}\displaystyle\left.-b(a-h)\sqrt{1-\left(1-\frac{h}{a}\right)^{2}}\right\} (14)

If one were to add hemispherical endcaps to the cylinder which deforms to hemi-ellipsoidal endcaps the subsequent volume would be:

fEDH(a,b,L,h)\displaystyle f_{EDH}(a,b,L,h) =L{(ab)cos1(aha2+(h22ha)(1b2a2))\displaystyle=L\left\{(ab)cos^{-1}\left(\frac{a-h}{\sqrt{a^{2}+(h^{2}-2ha)\left(1-\frac{b^{2}}{a^{2}}\right)}}\right)\right.
b(ah)1(1ha)2}+2πa3+π(ah)(hb2a)(h2aa)3\displaystyle\left.-b(a-h)\sqrt{1-\left(1-\frac{h}{a}\right)^{2}}\right\}+\dfrac{2\pi a^{3}+\pi(a-h)\left(\frac{hb^{2}}{a}\right)\left(\frac{h-2a}{a}\right)}{3}
2πa3(ah)3a2+(h22ha)(1b2a2)\displaystyle-\dfrac{2\pi a^{3}(a-h)}{3\sqrt{a^{2}+(h^{2}-2ha)\left(1-\frac{b^{2}}{a^{2}}\right)}} (15)

E.2 Varying Tank Dimension

The parameters used in the Algorithm 3 are borrowed from Ramdhani (2016) and the baseline will be the cylindrical case with radius, r=4r=4 and length, L=32L=32 and the parameter variations will be on the aa and bb for an ellipse. Ramdhani used a measurement error based model for simulation, which will also be used here. The measurement errors will be on the heights hh. Similar to the pressure vessel we consider an easy and a tough simulation scenario for classification. This is seen in Figure 6 where the left figure is easier to distinguish between cylinder and ellipse versus the right figure. The same interpretation can be seen for the end-cap based equations or Figure 7.

Algorithm 3 Simulation of Tank Dimension Data for Classification
1:s>0s>0 \wedge i(0,1)i\in(0,1) and a>0a>0 and b>0b>0 where ii is imbalance, ss is the size of the data set and aa and bb are the parameters for the vertical and horizontal axis of an ellipse.
2:Set r=4r=4, L=32L=32 as the baseline parameters.
3:Compute data label sizes based on imbalance ii as s0=s×(1i)s_{0}=\lfloor s\times(1-i)\rfloor and s1=ss0s_{1}=s-s_{0} where the subscript 0 and 11 stand for data sizes for label 0 and label 1.
4:Initialize noise arrays ϵ0N(0,2)\epsilon_{0}\sim N(0,2) and γ0U(0.05,0.05)\gamma_{0}\sim U(-0.05,0.05) both of size s0s_{0}.
5:Draw an array of heights h0U(1,2×r1)h_{0}\sim U(1,2\times r-1) of size s0s_{0} for the vertical height of a cylinder.
6:Compute variable h0=h0+γ0h_{0}^{\prime}=h_{0}+\gamma_{0}.
7:Initialize l0l_{0} as a label 0 array of size s0s_{0} with the value 0.
8:Compute the volume from either (12) or (13) using h0h_{0}, rr, LL and assign this to variable X0X_{0}.
9:Assign variable Y0=X0+ϵ0Y_{0}=X_{0}+\epsilon_{0}.
10:Concatenate the column arrays Y0Y_{0}, h0h_{0}^{\prime}, and l0l_{0} and assign this to F0F_{0}.
11:Initialize noise arrays ϵ1N(0,2)\epsilon_{1}\sim N(0,2) and γ1U(0.05,0.05)\gamma_{1}\sim U(-0.05,0.05) both of size s1s_{1}.
12:Draw an array of heights h1U(1,2×a1)h_{1}\sim U(1,2\times a-1) of size s1s_{1} for the vertical height of an ellipse.
13:Compute variable h1=h1+γ1h_{1}^{\prime}=h_{1}+\gamma_{1}.
14:Initialize l1l_{1} as a label 1 array of size s1s_{1} with the value 11.
15:Compute the volume from either (14) or (15) using h1h_{1}, aa, bb, LL and assign this to variable X1X_{1}.
16:Assign variable Y1=X1+ϵ1Y_{1}=X_{1}+\epsilon_{1}.
17:Concatenate the column arrays Y1Y_{1}, h1h_{1}^{\prime}, and l1l_{1} and assign this to F1F_{1}.
18:Concatenate or stack the arrays F0F_{0} and F1F_{1} which yields a total of ss rows with the last column being the labels for classification.
Refer to caption
Refer to caption
Figure 5: Left: cylindrical UST with a radius (r) of 4 and total max height of 8. Right: elliptical UST (deformed) with vertical axis (a) of 3.6 and max height of 7.2.
Refer to captionRefer to caption
Figure 6: Comparison of one simulated realization of volume versus height measurements using Algorithm 3 for equation (12) versus (14). Left: a=3.2a=3.2, b=5.0b=5.0 versus r=4r=4 and Right: a=3.8a=3.8, b=4.2105b=4.2105 versus r=4r=4.
Refer to captionRefer to caption
Figure 7: Comparison of one simulated realization of volume versus height measurements using Algorithm 3 for equation (13) versus (15). Left: a=3.2a=3.2, b=5.0b=5.0 versus r=4r=4 and Right: a=3.8a=3.8, b=4.2105b=4.2105 versus r=4r=4.