\jyear

2022

Reformulating van Rijsbergen’s $F_{\beta}$ metric for weighted binary cross-entropy

\fnmSatesh Ramdhani satesh.ramdhani@gmail.com

Abstract

The separation of performance metrics from gradient based loss functions may not always give optimal results and may miss vital aggregate information. This paper investigates incorporating a performance metric alongside differentiable loss functions to inform training outcomes. The goal is to guide model performance and interpretation by assuming statistical distributions on this performance metric for dynamic weighting. The focus is on van Rijsbergen’s $\text{F}_{\beta}$ metric – a popular choice for gauging classification performance. Through distributional assumptions on the $\text{F}_{\beta}$ , an intermediary link can be established to the standard binary cross-entropy via dynamic penalty weights. First, the $\text{F}_{\beta}$ metric is reformulated to facilitate assuming statistical distributions with accompanying proofs for the cumulative density function. These probabilities are used within a knee curve algorithm to find an optimal $\beta$ or $\beta_{opt}$ . This $\beta_{opt}$ is used as a weight or penalty in the proposed weighted binary cross-entropy. Experimentation on publicly available data along with benchmark analysis mostly yields better and interpretable results as compared to the baseline for both imbalanced and balanced classes. For example, for the IMDB text data with known labeling errors, a 14% boost in $F_{1}$ score is shown. The results also reveal commonalities between the penalty model families derived in this paper and the suitability of recall-centric or precision-centric parameters used in the optimization. The flexibility of this methodology can enhance interpretation.

keywords:

Performance metrics, Metrics, F-Beta Metric, Penalty Optimization, C.J. van Rijsbergen, Information Retrieval, Weighted Cross-Entropy, Binary Cross-Entropy, Text Retrieval

1 Acronym List

$F_{\beta}$: F-Beta Metric
$\beta_{opt}$: Optimal $\beta$ from Algorithm 1
$M_{1}^{\beta}$: Model 1: U & IU from (5.1)
$M_{2}^{(\lambda,\sigma^{2})}$: Model 2: Ga & IE from (5.2)
PV: Pressure Vessel Design
$T_{s}$: Thickness of the pressure vessel shell
$T_{h}$: Thickness of the pressure vessel head
UST: Underground Storage Tank
$f_{C}$: Equation for the volume a cylindrical UST (12)
$f_{CH}$: Equation for the volume of a cylindrical UST with hemispherical endcaps (13)
$f_{ED}$: Equation for the volume of an ellipsoidal UST (14)
$f_{EDH}$: Equation for the volume of an ellipsoidal UST with hemi-ellipsoidal end-caps (15)
CvE: Cylindrical UST versus Ellipsoidal UST
CHvEH: Cylindrical UST with hemispherical endcaps versus ellipsoidal UST with hemi-ellipsoidal end-caps
UCI: UCI Machine Learning Repository

2 Introduction

Data imbalance is a known, and widespread real world issue that affects performance metrics for a variety of learning algorithm problems (i.e., image detection and segmentation, text categorization and classification). Approaches to mitigate this issue generally fall into three categories: adjusting the neural network architecture (including multiple models or ensembles like Fujino et al. 2008), adjusting the loss function used for training, or adjusting the data (i.e., collecting more data, or leveraging sampling techniques like Chawla et al. 2002 and Hasanin et al. 2019). This research looks at adjusting the loss function with a focus on incorporating the $\text{F}_{\beta}$ performance metric. The interconnection between performance metric and loss function is crucial for understanding both model behavior and the inherent nature of that specific dataset. This connection has already been approached from the angle of thresholding (a post model step) as in Lipton et al. (2014) or developing a problem specific metric, as Ho and Wookey (2019), Li et al. (2019), and Oksuz et al. (2018) did for real world mislabeling costs, dynamic weighting for easy negative samples, and object detection, respectively. This paper takes a uniquely different and novel approach where statistical distributions act as an intermediary to connect the $\text{F}_{\beta}$ metric to the binary cross-entropy through dynamic penalty weights.

First, the derivation of the $F_{\beta}$ metric from van Rijsbergen’s effectiveness score, $E$ , is revisited to prove a limiting case of $F_{1}$ in section 4. This result supports the default case for the main algorithm in section 6.

Second, the $F_{\beta}$ metric is reformulated into a multiplicative form by assuming two independent random variables. Then parametric statistical distributions are assumed for these random variables. In particular, the Uniform and Inverse Uniform (U & IU) case and the Gaussian and Inverse Exponential (Ga & IE) case are proposed. The idea behind U & IU is that no known insight is assumed on the $F_{\beta}$ cumulative density function’s (CDF) surface. But the Ga & IE provides the practitioner more flexibility in setting some insight to this CDF surface. This leads to a more interpretable performance metric that is configurable to the data without having to create a new problem specific metric (or loss function).

Third, for both distributional cases, the CDF or $\text{Pr}\left(F_{\beta}\right)$ shown in section 5 facilitates finding an optimal $\beta$ through a knee curve algorithm in section 6.1. This algorithm gets the best $\beta$ from a monotonic knee curve given precision and recall. It is the value when the curve levels off. The $\beta_{opt}$ surface for different parameter settings found in section 6.3 suggests a slightly more recall centric penalty. This is discussed further in section 7.

Finally, a weighted binary cross-entropy loss function based on $\beta_{opt}$ is proposed in section 6.2. This loss methodology is applied to three data categories: image, text and tabular/structured data. For contextual data (i.e., image and text), model performance for $F_{1}$ improves, and the best result occurs for the text data that contains (known) labeling errors. The structured/tabular or non-contextual data does not show significant $F_{1}$ improvement, but provides an important result: when considering neural embedding architectures for training, the type (or category) of data matters.

3 Related Work

Logistic regression models are one of the most fundamental statistically based classifier. Jansche (2005) provides a training procedure that uses a sigmoid approximation to maximize the $F_{\beta}$ on this class of classifiers. When comparing the surface plots of the likelihood from Jansche and that from section 5 – a similar but not an equivalent comparison – a comparable rate of change can be seen for both surfaces with respect to their respective parameters. This is an important similarity because this paper’s procedure applies distributional assumptions to provide dynamic penalties to a well-known binary cross-entropy loss. Also, implementation of this paper’s methodology is straightforward because it avoids the need to provide updated partial derivatives for the loss function. Furthermore, Jansche alludes to (future work that considers) a general method to optimize several operating points simultaneously, which is a fundamental and indirect assertion in this paper. The sigmoid approximation is also used by Fujino et al. (2008) in the multi-label setting for text categorization. In their framework, multiple binary classifiers are trained per category and combined with weights estimated to maximize micro- or macro-averaged $F_{1}$ scores.

Similarly, Aurelio et al. (2022) propose a methodology for performance metric learning that uses a metric approximation (i.e., AUC, $F_{1}$ ) derived from the confusion matrix. The back-propagation error term involves the first derivative, followed by the application of gradient descent. This method provides an alternative means of integrating performance metrics with gradient-based learning. However, there are cases where the back-propagation term proposed by Aurelio et al. may pose issues. For instance, when considering equation 13 from Aurelio et al. in conjunction with batch training and severe imbalance, there could be a division by zero error if a batch with only the zero label appears. Moreover, Aurelio et al. test several metrics: $F_{1}$ , G-mean, AG-mean and AUC for their method. But the G-mean, AG-mean and AUC, based on the confusion matrix approximation, can be derived as functions of $F_{1}$ . This suggests that $F_{\beta}$ is more flexible than G-mean, AG-mean and AUC. In other words, $\beta$ is unique to $F_{\beta}$ yet generalized across other metrics when equal to 1. In fact, for class imbalance, the AUC metric - an average over many thresholds, and G-mean - a geometric mean, is less stringent and more generous in accuracy reporting compared to the $F_{1}$ . This is the reason all results in this paper are reported using the $F_{1}$ score.

Surrogate loss functions attempt to mimic certain aspects of the $F_{\beta}$ and is another related area. For example, sigmoidF1 from Bénédict et al. (2021) creates smooth versions for the entries of the confusion matrix, which is used to create a differentiable loss function that imitates the $F_{1}$ . This smooth differentiability is another application of a sigmoid approximation similar to Jansche. Lee et al. (2021) formulates a surrogate loss by adjusting the cross-entropy loss such that its gradient matches the gradient of a smooth version of the $F_{\beta}$ .

In terms of metric creation or variation to the $F_{\beta}$ , Ho and Wookey (2019), Li et al. (2019), Oksuz et al. (2020) and Yan et al. (2022) are highlighted. The Real World Weight Cross Entropy (RWWCE) loss function from Ho and Wookey is a metric similar in spirit to Oksuz et al. The idea is to set (not train or tune) cost related weights based on the dataset and the main problem, by introducing costs (i.e., financial costs) that reflect the real world. RWWCE affects both the positive and negative labels by tying each to its own real world cost implication. The dice loss from Li et al. propose a dynamic weight adjustment to address the dominating effect of easy-negative examples. The formulation is based on the $F_{\beta}$ using a smoothing parameter and a focal adaptation from Lin et al. (2017). A ranking loss based on the Localisation Recall Precision (LRP) metric Oksuz et al. (2018) is developed by Oksuz et al. (2020) for object detection. They propose an averaged LRP alongside a ranking loss function for not only classification but also localisation of objects in images. This provides a balance between both positive and negative samples. Along a similar theme, Yan et al. (2022) explores a discriminative loss function that aims to maximize the expected $F_{\beta}$ directly for speech mispronunciation. Their loss function is based on the $F_{\beta}$ (comparing human assessors and the model prediction) weighted by a probability distribution (i.e., normal distribution) for that score. The final objective function is a weighted average between their loss function and the ordinary cross-entropy.

When considering the components of performance metrics, precision and recall are often the primary focus. Mohit et al. (2012) and Tian et al. (2022) propose two different loss functions that are both recall oriented. Mohit et al. adjust the hinge loss by adding a recall based cost (and penalty) into the objective function. As they said, by favoring recall over precision it results in a substantial boost to recall and $F_{1}$ . By leveraging the concept of inverse frequency weighting (i.e., a sampling based technique), Tian et al. adjust the cross-entropy to reflect an inverse weighting on false negatives per class. They state that their loss function sits between regular and inverse frequency weighted cross-entropy by balancing the excessive false positives introduced by constantly up-weighting minority classes. When they consider a similar loss function using precision, this loss function shows irregular behavior. These findings are insightful because this paper’s $\beta_{opt}$ surface as seen in section 6.3 is more recall centric with the added benefit of being able to incorporate precision weighting through the assumed probability surface.

4 Background

The $F_{\beta}$ measure comes directly from van Rijsbergen’s effectiveness score, $E$ , for information retrieval (chapter 7 in Rijsbergen 1979). For the theory on the six conditions supporting $E$ as a measure, refer to Rijsbergen. This paper highlights two of these conditions. First, $E$ guides the practitioner’s ability to quantify effectiveness given any point ( $r$ , $p$ ) – where $r$ and $p$ are recall and precision – as compared to some other point. Second, precision and recall contribute effects independently of $E$ . As said by Rijsbergen, for a constant $r$ (or $p$ ) the difference in $E$ from any set of varying points of $p$ (or $r$ ) can not be removed by changing the constant. These conditions suggest equivalence relations and imply a common effectiveness (CE) curve based on precision and recall (definition 3 in Rijsbergen 1979). They also motivate the rationale on using statistical distributions to understand the CE curve. The van Rijsbergen’s effectiveness measure is given in (1).

E=1-\dfrac{1}{\alpha\frac{1}{p}+(1-\alpha)\frac{1}{r}}

(1)

where, $\alpha=\frac{1}{\beta^{2}+1}$ . Sasaki (2007) gives the details on deriving $F_{\beta}=\frac{(\beta^{2}+1)pr}{\beta^{2}p+r}$ from (1) with $\beta=\frac{r}{p}$ and by solving $\frac{\partial E}{\partial r}$ = $\frac{\partial E}{\partial p}$ . The $\beta$ parameter is intended to allow the practitioner control by giving $\beta$ times more importance to recall than precision. Using the derivation steps from Sasaki, a general form of $F_{\beta}$ for any derivative can be shown as (2),

F_{\beta}^{n}(p,r)=\frac{(\beta^{\frac{-2}{n-2}}+1)pr}{\beta^{\frac{-2}{n-2}}p+r},

(2)

where $n$ pertains to $\frac{\partial^{n}E}{\partial r^{n}}$ = $\frac{\partial^{n}E}{\partial p^{n}}$ resulting in $\alpha_{n}=\frac{1}{\beta^{\frac{-2}{n-2}}+1}$ . Note that $n>0$ , and $n\neq 2$ . The proof is found in Appendix A. For $n=2$ the equation reduces to the equality $p=r$ implying $\beta=1$ . Using (2), it can be seen that the $\displaystyle{\lim_{n\to\infty}F_{\beta}^{n}=F_{1}^{1}=\frac{2pr}{p+r}}$ , which is most commonly used in the literature. The reason for showing this limiting case is to provide a justification on fixing $\beta=1$ (instead of claiming equal importance for $r$ and $p$ ) in the default case of any algorithm – in particular the algorithm in section 6.

5 Reformulating the F-Beta to leverage statistical distributions

CE for neural networks is seen when different network weights give different precision and recall yet resulting in similar performance scores. CE also provides a basis for this paper’s use of $\beta$ from the $F_{\beta}$ measure to guide training through penalties, in lieu of an explicit loss (or surrogate loss) function. In fact, Vashishtha et al. (2022) uses the $F$ -score as part of a preprocessing step for feature selection prior to their ensemble model (EM-PCA then ELM) for fault diagnosis. They show significant performance improvement in their approach which adds supporting evidence to this paper’s use of the $F_{\beta}$ as a loss penalty for feature selection via gradient based learning.

The first step is to reformulate (2) for $n=1$ . This makes assuming statistical distributions easier. Consider the following reformulation through multiplicative decomposition in (3) which assumes $X_{1}$ and $X_{2}$ to be independent random variables.

F_{\beta}=X_{1}X_{2},

(3)

where $X_{1}=r^{\prime}+\beta^{\prime}$ , $X_{2}=(\beta^{\prime\prime}+r)^{-1}$ with $r^{\prime}=pr$ , $\beta^{\prime}=\beta^{2}pr$ and $\beta^{\prime\prime}=\beta^{2}p$ . $X_{1}$ indirectly captures imbalance in the model prediction from the underlying data. If precision and recall are on opposite ends of the $[0,1]$ scale, then $X_{1}$ will reflect this, while maintaining continuity when precision and recall are directionally consistent. $X_{2}$ can be thought of as a weighting scheme that appears recall centric with a precision based penalty. For instance, for both high (or both low) precision and recall, the weighting is consistent with intuition. However, when precision and recall are on opposite ends of the $[0,1]$ scale, the weighting sways by the aggregate with the lower score. Two use cases are considered for (3): $X_{1}$ and $X_{2}$ follow U & IU, respectively, and $X_{1}$ and $X_{2}$ follow Ga & IE, respectively.

5.1 Case 1: Uniform and Inverse Uniform

The thought behind U & IU is to apply (flat) equal distribution for both $X_{1}$ and $X_{2}$ . These assumed distributions are applied to $\beta^{\prime}$ and $\beta^{\prime\prime}$ as follows:

Let	$\displaystyle\beta^{\prime}\sim U(0,\beta^{*})$
	$\displaystyle\beta^{\prime\prime}\sim U(0,\beta^{*})$
then	$\displaystyle X_{1}\sim U\left(r^{\prime},r^{\prime}+\beta^{*}\right)$	(4)
and	$\displaystyle X_{2}\sim IU\left(\frac{1}{r+\beta^{*}},\frac{1}{r}\right),$	(5)

where $r\text{ and }r^{\prime}\in[0,1]$ and $\beta^{*}>0$ . Note that for both distributions there is only one $\beta^{*}$ chosen and this value replaces the need to have an explicit form that includes $\beta$ as a parameter. This is for convenience, as well as noticing that both $\beta^{\prime}$ and $\beta^{\prime\prime}$ differ by a factor of $r$ . So allowing $\beta^{*}$ to vary broadly (which is the $\beta_{max}$ in section 6) would be enough to balance this convenience tradeoff. Next is to derive the joint distribution which would be used in section 6. It can be shown (the proof is in Appendix B) that the joint distribution is:

		$\displaystyle\text{Pr}\left(F_{\beta}\right)=\text{Pr}\left(F_{\beta}\leq z\right)=1_{\{z\leq p\&\frac{(r+\beta^{})z}{r^{\prime}+\beta^{}}\leq 1\}}\times\dfrac{\frac{z}{2}\left(r+\beta^{}-\frac{r^{\prime}}{z}\right)^{2}}{(\beta^{})^{2}}$
		$\displaystyle+1_{\{z>p\&\frac{(r+\beta^{})z}{r^{\prime}+\beta^{}}>1\}}\times\left(\dfrac{rz-r^{\prime}}{\beta^{}}+\dfrac{1}{\left(\beta^{}\right)^{2}}\left(\left(r+\beta^{}+\frac{r^{\prime}+\beta^{}}{2z}\right)\left(r^{\prime}+\beta^{*}-rz\right)\right.\right.$
		$\displaystyle+\left.\left.\dfrac{r\left(rz-\left(r^{\prime}+\beta^{*}\right)\right)}{2}\right)\right)$
		$\displaystyle+1_{\{z>p\&\frac{(r+\beta^{})z}{r^{\prime}+\beta^{}}\leq 1\}}\times\left[\left(\dfrac{rz-r^{\prime}}{\beta^{}}+\dfrac{1}{\left(\beta^{}\right)^{2}}\left(\left(r+\beta^{}+\frac{r^{\prime}+\beta^{}}{2z}\right)\left(r^{\prime}+\beta^{*}-rz\right)\right.\right.\right.$
		$\displaystyle+\left.\left.\left.\dfrac{r\left(rz-\left(r^{\prime}+\beta^{}\right)\right)}{2}\right)\right)-\dfrac{1}{\left(\beta^{}\right)^{2}}\left((r+\beta^{})(r^{\prime}+\beta^{})-\frac{(r^{\prime}+\beta^{})^{2}}{2z}-\frac{(r+\beta^{})^{2}z}{2}\right)\right]$

Refer to caption — (a) $P(F_{\beta}<0.4)$

To understand this flat mixture, consider Figure 1 - the CDF surface for a grid of precision and recall where $\beta^{*}\in[8,16]$ . (Note: the blue and red heat coloring is from the CDF and highlights curvature and/or rate of change). For a lower z value of $0.4$ , Figure 1(a) shows that $\beta^{*}=8$ has a faster rate of change as compared to $\beta^{*}=16$ . The same conclusion is apparent in Figure 1(b), which is for a higher z value of $0.8$ . For both figures more curvature is seen for lower $\beta^{*}$ values. This suggests that a larger $\beta^{*}$ value smooths the surface and is a better candidate for $\beta_{max}$ in the algorithm in Section 6.

5.2 Case 2: Gaussian and Inverse Exponential

A more informed distributional approach for $X_{1}$ and $X_{2}$ considers Ga & IE, respectively. The reason to use Gaussian distribution for $X_{1}$ is to allow a bell-shaped variability around a fixed $r^{\prime}$ that is based on $\beta^{\prime}$ and ultimately $\beta$ . The weighting of $X_{1}$ by $X_{2}$ uses the Inverse Exponential distribution because with selections of the rate parameter $\lambda$ the distribution can shift mass from left to right as well as appear uniformly distributed around $r$ . This provides practitioners enough flexibility on experimenting with different weights. The following shows the assumptions for $\beta^{\prime}$ and $\beta^{\prime\prime}$ :

Let	$\displaystyle\beta^{\prime}\sim\text{Ga}(0,\sigma^{2})$
	$\displaystyle\beta^{\prime\prime}\sim\text{Exponential}(\lambda)$
then	$\displaystyle X_{1}\sim\text{Ga}(r^{\prime},\sigma^{2})$	(7)
and	$\displaystyle X_{2}\sim\text{IE}(\lambda\mathchar 24635\relax\;r),$	(8)

where $r$ in (8) is the location shift by recall from the definition of $X_{2}$ in (3) and $\sigma^{2}$ is the variability captured by $\beta^{\prime}$ . Using both (7) and (8), the distribution for (3) is now split around $z=0$ as follows:

$\displaystyle\text{Pr}\left(F_{\beta}\right)$	$\displaystyle=\text{Pr}\left(F_{\beta}\leq z\right)=1_{z>0}\times\left[\Phi(rz\mathchar 24635\relax\;r^{\prime},\sigma^{2})+\exp\left(\lambda r+\dfrac{\left(\frac{\lambda\sigma^{2}}{z}\right)^{2}-\frac{2r^{\prime}\lambda\sigma^{2}}{z}}{2\sigma^{2}}\right)\right.$
	$\displaystyle\left.\times\left(1-\Phi\left(rz\mathchar 24635\relax\;\left(r^{\prime}-\frac{\lambda\sigma^{2}}{z}\right),\sigma^{2}\right)\right)\right]+1_{z=0}\times\Phi(0\mathchar 24635\relax\;r^{\prime},\sigma^{2})$
	$\displaystyle+1_{z<0}\times\left[\Phi(rz\mathchar 24635\relax\;r^{\prime},\sigma^{2})-\exp\left(\lambda r+\dfrac{\left(\frac{\lambda\sigma^{2}}{z}\right)^{2}-\frac{2r^{\prime}\lambda\sigma^{2}}{z}}{2\sigma^{2}}\right)\right.$
	$\displaystyle\left.\times\Phi\left(rz\mathchar 24635\relax\;\left(r^{\prime}-\frac{\lambda\sigma^{2}}{z}\right),\sigma^{2}\right)\right],$	(9)

where $\Phi(x\mathchar 24635\relax\;\mu,\sigma^{2})$ denotes the standard normal or gaussian distribution at the value $x$ for a mean, $\mu$ and variance, $\sigma^{2}$ . (Refer to Appendix C for the proof). Similar to before the focus is on the indicator $1_{z\geq 0}$ as defined in (9).

Since this distributional mixture has more flexibility due to more parameters, Figure 2 highlights this when $\lambda\in[0.5,2.0]$ , and $\sigma^{2}\in[0.5,2.0]$ . The probabilities are computed again at a lower z value, $0.4$ , and at a higher z value, $0.8$ for comparison. For a fixed $\lambda$ , varying $\sigma^{2}$ impacts the curvature of the surface with higher $\sigma^{2}$ values producing a flattening effect. Figure 2(a) shows this distinctly. Conversely, as $\lambda$ increases with a fixed $\sigma^{2}$ , the rate at which the surface changes is very apparent. This can be seen by juxtaposing Figure 2(c) and 2(a) or Figure 2(d) and 2(b) and noticing that the increase of $\lambda$ produces a clear increase in the rate of change. These observations match the intuition that $\sigma^{2}$ is linked to the shape of the bell curve, and $\lambda$ is linked to a rate of change. It also serves as a basis of intuition behind the algorithm in Section 6. That is, a faster rate of change along with a curved (and/or smoother) surface would provide loss penalties that adapt quickly per batch using the aggregated information from precision and recall.

6 Knee algorithm and Weighted Cross Entropy

6.1 Knee algorithm to find optimal $\beta$ values

Now that probabilities, or $\text{Pr}(F_{\beta}\leq z)$ for some $z\in[0,1]$ are established in section 5.1 and 5.2, the goal is to use them to get an optimal $\beta$ value, $\beta_{opt}$ . There are a couple things to consider. First, because $\beta$ is grouped into $\beta^{\prime}$ and $\beta^{\prime\prime}$ with distributional assumptions, using maximum likelihood estimation (MLE) is not particularly suitable here. Also, $\beta_{max}$ , $\sigma^{2}$ , and $\lambda$ from (4), (5), and (8) are set in advance and do not need to be estimated. Second, our observed data is only one data point per training batch, namely precision and recall. Given this and the natural bend of the $F_{\beta}$ function, a knee algorithm is applicable. From Satopaa et al. (2011), the knee of a curve is associated with good operator points in a system right before the performance levels off. This removes the need for complex system-specific analysis. Furthermore, they have provided a definition of curvature that supports their method being application independent – an important property for this paper. Algorithm 1 implements (and slightly alters) Kneedles algorithm from Satopaa et al. to detect the knee in the $F_{\beta}$ curve. Refer to Algorithm 1 for the formal pseudo code.

Algorithm 1 Calculate

\beta_{opt}

n>0\wedge\beta_{max}>0

\beta_{opt}>0

3:Compute

p

, and

r

from the training batch.

4:Initialize

b_{sn},b_{d},b_{lmx},p_{sn},p_{d},p_{lmx},p_{s}

to empty arrays.

b_{s}\Leftarrow[b_{s_{1}},...,b_{s_{n}}]

where

b_{s_{n}}=\beta_{max}

6:for (

i=0

n

) do

z\Leftarrow F_{\beta=b_{s_{i}}}^{n=1}(p,r)

using eqn (2)

p_{s_{i}}\Leftarrow

\text{Pr}(F_{\beta}\leq z|\beta=b_{s_{i}},p=p,r=r)

using section 5.1 or 5.2

9:end for

10:if

r<p

then

11:

p_{max}=\max(p_{s})

12: for (

i=0

n

) do

13:

p_{s_{i}}\Leftarrow

p_{max}-p_{s_{i}}

14: end for

15:end if

16:

b_{max}=\max(b_{s})

p_{max}=\max(p_{s})

b_{min}=\min(b_{s})

p_{min}=\min(p_{s})

17:for (

i=0

n

) do

18:

b_{sn_{i}}\Leftarrow\dfrac{b_{s_{i}}-b_{min}}{b_{max}-b_{min}}

19:

p_{sn_{i}}\Leftarrow\dfrac{p_{s_{i}}-p_{min}}{p_{max}-p_{min}}

20:

b_{d_{i}}\Leftarrow b_{sn_{i}}

21:

p_{d_{i}}\Leftarrow p_{sn_{i}}-b_{sn_{i}}

22: if

(i\geq 1)\wedge(i<n)

then

23: if

(p_{d_{i-1}}<p_{d_{i}})\wedge(p_{d_{i+1}}<p_{d_{i}})

then

24:

p_{lmx_{i}}\Leftarrow p_{d_{i}}

25:

b_{lmx_{i}}\Leftarrow b_{d_{i}}

26: end if

27: end if

28:end for

29:if

p_{lmx}

is a non-empty array then

30:

\beta_{opt}=\text{ mean}(p_{lmx})

31:else

32:

\beta_{opt}=1

as per section 4

33:end if

A brief explanation in plain words is as follows:

1.

For any training batch, compute precision (p) and recall (r). Then with a predefined $\beta_{max}$ value, set $n$ equally spaced values, $b_{s_{i}}$ , up to $\beta_{max}$ , and use section 5 to compute $p_{s_{i}}=\text{Pr}(F_{\beta}\leq z|\beta=b_{s_{i}},p=p,r=r)$ . (This replaces step 1 from Satopaa et al.). Let $D_{s}$ represent this smooth curve as $D_{s}=\{(b_{s_{i}},p_{s_{i}})\in\mathbb{R}^{2}|b_{s_{i}},p_{s_{i}}\geq 0\}$ for $i=1,...n$ .
2.

When $r<p$ , convert to a knee by taking the difference of the probabilities from the maximum. That is, $p_{s_{i}}=\max(p_{s})-p_{s_{i}}$ for $i=1,...n$ . This is necessary because of the formulation of the $F_{\beta}$ metric.
3.

Normalize the points to a unit square and call these $b_{sn}$ and $p_{sn}$ .
4.

Take the difference of points and label that $b_{d}$ and $p_{d}$ .
5.

Find the candidate knee points by getting all local maxima’s, label that $b_{lmx}$ and $p_{lmx}$ .
6.

Take the average of $p_{lmx}$ and this will be $\beta_{opt}$ . (This simplifies Satopaa et al.).

6.2 Proposal Weighted Binary Cross-Entropy

The weighted binary cross-entropy loss is primarily focused around the imbalanced use case where a minority class exists. This paper posits that from the shuffling of data observations, as is frequently done while training, relevant aggregate information is available to use from the batch. For instance, say for a fixed minority class observation, $y$ , it is grouped among different batches of the majority class. The interaction effect of $y$ among these randomly varying training batches is often overlooked. It is this interaction that can be inferred through the precision and recall aggregates, then transferred as a penalty to the loss function via $\beta_{opt}$ in a probabilistic way. By using Algorithm 1 to get $\beta_{opt}$ , the proposed loss is,

	$\displaystyle L(\textbf{f(x; $\theta$)}\|\beta^{2}_{opt},\textbf{x})$	$\displaystyle=-\sum_{i}\left\{y_{i}\log\left(f_{i}(\textbf{x}\mathchar 24635\relax\;\theta)\right)+\left(1-y_{i}\right)\times\log\left(1-f_{i}(\textbf{x}\mathchar 24635\relax\;\theta)\right)\right.$
		$\displaystyle\left.\times\left(\frac{1_{\{[1-f_{i}(\textbf{x}\mathchar 24635\relax\;\theta)]\leq 0.5}\}}{1+\beta^{2}_{opt}}+(1+\beta^{2}_{opt})\times 1_{\{[1-f_{i}(\textbf{x}\mathchar 24635\relax\;\theta)]>0.5\}}\right)\right\},$		(10)

where the function $f_{i}(\textbf{x}\mathchar 24635\relax\;\theta)$ is the $i$ -th element from the prediction of a neural network using the inputs x and training weights, $\theta$ ; and $y_{i}$ is the $i$ -th element of the true target label. When considering the majority class, or $y_{i}=0$ for $i=\{1,...,m\}$ , the loss is weighted by $(1+\beta^{2}_{opt})$ . Therefore, for correctly predicted observations, the loss has a reduction by $(1+\beta^{2}_{opt})$ . When incorrectly predicted, the loss is magnified by the same amount. For the minority class, or $y_{i}=1$ for $i=\{1,...,n-m\}$ , the loss is unchanged. This is intentional because under imbalanced data there are far less observations, and computing precision and recall lead to numerical instability or frequent edge cases for Algorithm 1.

6.3 Understanding the $\beta_{opt}$ Surface and Weighted Cross-Entropy

Figure 3 highlights the surface generated from Algorithm 1 leveraging probabilities from Section 5. First, the U & IU mixture or Figure 3(a) suggests that the shape of the surface remains relatively similar even when doubling $\beta_{max}$ . This is an important point toward fixing $\beta_{max}=16$ for the Ga & IE. Based on Figure 3(a), the U & IU mixture penalizes more on the outskirts of recall, while the immediate penalties arise on the diagonal of the unit square. This suggests that precision and recall estimates from training cause immediate penalties when they are on opposite ends of the $[0,1]$ range as well as on the diagonal when these values start to even out. For Ga & IE mixture, Figures 3(b) and 3(c) show some similar conclusions as U & IU mixture along with additional insights. For a lower rate of $\lambda=0.5$ , or Figure 3(b), diagonal spikes for $\sigma^{2}\in[0.5,2.0]$ as well as a precision centric penalty for higher $\sigma^{2}$ (i.e., $\sigma^{2}=2$ ) are seen. For a higher rate of $\lambda=2.0$ , or Figure 3(c), a similar diagonal is retained for $\sigma^{2}\in[0.5,2.0]$ as in Figure 3(b). Furthermore, for increasing values of $\sigma^{2}$ , the penalty evolves from precision centric to a vertical separation on the unit grid at around a precision of $0.4$ . The overall interpretation is the following: for a lower $\lambda$ , increasing $\sigma^{2}$ creates a slightly more precision based penalty; while for a higher $\lambda$ , increasing $\sigma^{2}$ causes the penalty to become more balanced between recall and precision. The choice of these parameters are problem specific, but provides the practitioner flexibility in determining the best selection for their use case. On a separate note, from Figure 3, a spiky surface is obvious, which is partially explained by having a default setting in the algorithm. This is a strong sign of immediate and configurable penalties.

7 Datasets and Experimentation

7.1 Datasets

The origins of the $F_{\beta}$ metric come from text retrieval, so it is important to verify this method across different categories of data. In particular, image data from CIFAR-10¹¹1https://www.cs.toronto.edu/ kriz/cifar.html, text data from IMDB movie sentiments Maas et al. (2011) and structured/tabular data from the Census Income Dataset Dua and Graff (2019) are tested. For each experiment, the primary label (i.e., label 1) is either imbalanced or forced to be imbalanced to reflect real world scenarios. Because CIFAR-10 contains multiple image labels, the airplane label is the primary label and all others are combined. This yields a 10% class imbalance. IMDB movie sentiment reviews (positive/negative text) are not imbalanced. The positive sentiments in the training data are reduced to 1K randomly sampled sentiments yielding a 7.4% imbalance (Table 1). The Census Income Tabular data contains 14 input features (i.e., age, work class, education, occupation, etc) with 5 numerical and 9 categorical features. The binary labels are greater than 50K salary (label 1) and less than 50K salary (label 0). By default, greater than 50K salary is already imbalanced at 6.2%. The training and validation dataset sizes for each data category are as follows: for CIFAR-10, 50K training and 10K validation, for IMDB, 13.5K training and 25K validation, and for the Census data, 200K training and 100K validation. In terms of class imbalance, this paper considers an imbalance or proportion of label 1 under 10% to be significantly imbalanced and between 10% to 25% to be moderately imbalanced. Some heuristic rationale for a 10% imbalance, a model that has perfect recall or 100%, a precision of $p=\frac{1}{3}$ would be required to get $F_{1}=0.5$ . In practical examples, this scenario can occur with weakly discriminative features. Therefore, this paper seeks to test this algorithm in scenarios that would need an improved precision. A variety of imbalanced and balanced scenarios will be tested in this paper.

Two real-life use cases related to cylindrical tanks are also considered, providing a physical domain to test Algorithm 1. Chauhan et al. (2022) developed an arithmetic optimizer with slime mould algorithm and Chauhan et al. (2023) developed an evolutionary based algorithm with slime mould algorithm; both algorithms focus on global parameter optimization. They tested these algorithms on several benchmark problems, one of which is called the pressure vessel design. The problem is a constrained parameter optimization (i.e., material thickness, and cylinder dimensions) for minimizing a cost function. This paper focuses on using the HAOASMA algorithm by Chauhan et al. in a simulation to convert the problem into a binary classification. The second use case is derived from Underground Storage Tanks (UST) and is also inspired by Chauhan et al.’s pressure vessel problem. The physical shape (i.e., the cylindrical shape) of USTs is similar to the pressure vessel design. USTs are used to store petroleum products, chemicals, and other hazardous materials underground. These structures could deform underground and possibly explain a false positive leak. Ramdhani (2016) and Ramdhani et al. (2018) explored parameter optimization of UST dimensions changing from cylindrical to ellipsoidal. The observed data are vertical (underground) height measurements, which can contain uniformly distributed error. Ramdhani et al. used these measurements and the volumetric equations (12), (13), (14), and (15) - derived from a cross-sectional view - to develop a methodology to estimate tank dimensions and test if the shape has deformed. The cross-sectional view can be seen in Figure 5.

The conversion of both of these real-life use cases into a classification involves establishing a baseline set of parameters to simulate data for label 0. Varying these parameters will allow simulation of data for label 1. For the pressure vessel design, the baseline parameters from HAOASMA are $T_{s}=1.8048$ , $T_{h}=0.0939$ , $R=13.8360$ , and $L=123.2019$ . To convert this to a classification, the parameters for thickness $T_{s}$ and $T_{h}$ are changed from the baseline while $R$ and $L$ dimensions are drawn from a normal distribution. By using values for $T_{s}$ , $T_{h}$ , $R$ and $L$ , the cost function is computed using 11. These cost values concatenated with $R$ and $L$ arrays serve as the input to a neural network classifier. The label 1 reflect simulated data using $T_{s}$ and $T_{h}$ that are changed from the baseline. These variations are $T_{s}=1.7887$ and $T_{h}\in\{0.0313,0.2817\}$ . Label 0 is HAOASMA baseline values $T_{s}=1.8048$ and $T_{h}=0.0939$ . Appendix D provides the equations, the distributional plots seen in Figure 4, and a detailed explanation of the simulation procedure (Algorithm 2). For the UST problem, Ramdhani used a measurement error model with an error on the height measurement and another on the volume computation. The same model is used to simulate data in this paper. The baseline is a cylinder and the variations to vertical and horizontal axes $a$ and $b$ represent a deformed cylinder to an ellipse. By using $r$ , $L$ , $h$ , $a$ and $b$ along with (12) and (14) or (13) and (15) the volume is computed. These volumes concatenated with noisy height measurements are the inputs to a neural network classifier. The label 1 reflect simulated data using the variations to the cylinder or the baseline. These variations are $a\in\{3.2,3.8\}$ and $b\in\{5.0,4.2105\}$ . Label 0 will be the baseline cylinder with radius $r=4$ and the length of $L=32$ . Refer to Appendix E for detailed explanation of the simulation Algorithm 3 along with comparison plots and volume equations.

7.2 Model Networks

7.2.1 Image Network

For the CIFAR-10 image dataset, ResNet (He et al. 2016) version 1 is applied. The number of layers for ResNet is 20, which upon initial experimentation is adequate for speed and generalization in this case. Adam optimizer was implemented with a learning rate of $1e^{-3}$ with total epochs of 30. No learning rate schedule because of the intentional lower number of epoch in order to validate faster training via this proposed loss algorithm. The training batch size is 32. Modest data augmentation is done – random horizontal and vertical shifts of 10%, and horizontal and vertical flips.

7.2.2 Text Network

For the IMDB movie sentiments, a Transformer block (which applies self-attention Vaswani et al. 2017) is used. The token embedding size is 32, and the transformer has 2 attention heads and a hidden layer size of 32 including dropout rates of 10%. A pooling layer and the two layers that follow – a dense RELU activated layer of size 20 and a dense sigmoid layer of size 1 – give the final output probability. As for preprocessing, a vocabulary size of 20K and maximum sequence length of 200 is implemented. The training batch size is 32.

7.2.3 Structured/Tabular Network

For the Census Income Dataset, a standard encoder embedding paradigm Schmidhuber (2015) is used. Specifically, all categorical features with an embedding size of 64 are concatenated, then numerical features are concatenated to this embedding vector. Afterwards, a 25% dropout layer and the two layers that follow – a fully connected dense layer with GELU activation of size 64 and a sigmoid activated layer of size 1 – provide the final output probability. The training batch size is 256.

7.2.4 UST/Vessel Network

For the real life use cases on simulated data the model network is simple because of the minimal amount of features. The network is a sequential set of dense layers of sizes 20, 10 and 1. The last layer of size 1 has a sigmoid activation to give the final output probability. Additionally, a drop out of 10% is added after both middle layers. The training batch size is 128.

7.3 Experimental Results

The results in Table 1 compare the use of the loss function (10) by different models based on U & IU and Ga & IE to a baseline case of ordinary cross-entropy. All results shown in this table are computed on the validation datasets for each data category above (see above for the dataset sizes). For ease of presentation, $M_{1}^{\beta}$ is Model 1: U & IU from (5.1). $M_{2}^{(\lambda,\sigma^{2})}$ is Model 2: Ga & IE from (5.2). The superscripts $\beta$ and $(\lambda,\sigma^{2})$ are the parameters being explored. $M_{B}$ is the baseline or the same model network that is trained using ordinary cross-entropy.

7.3.1 Image Results

For the image network, Table 1 shows modest improvement over the baseline under the $M_{1}^{\beta}$ for a moderately sized $\beta=8$ . This suggests that image data trains better under constant penalties on the outskirts of the unit square toward the imbalance of high precision and low recall. High precision and low recall imply image confusion between classes in the feature embedding space. In fact, this can lead to large implications as in Grush (2015). Algorithms like DeepInspect Tian et al. (2020) help to detect confusion and bias errors to isolate misclassified images leading to repair based training algorithms such as Tian (2020) and Zhang et al. (2021). But Qian et al. (2021) empirically shows that such repair or de-biasing algorithms can be inaccurate with one fixed-seed training run. The importance of the $M_{1}^{\beta}$ result is now evident because $M_{1}^{\beta}$ quickly penalizes the network in a way that inherently mirrors algorithms like DeepInspect’s confusion/bias detection without the need for repair algorithms.

7.3.2 Text Results

The training results for the text network by far show the most improvement with a nearly 14% boost in the $F_{1}$ score over the baseline for the $M_{2}^{(\lambda,\sigma^{2})}$ model. Not only is the performance notable, the model parameter selections are consistent – the parameters move in the same direction. In other words, given the parameters $\lambda=0.5$ and $\sigma^{2}=0.5$ , the training shows improvement over the baseline and this improvement continues in the same direction when $\lambda=0.01$ and $\sigma^{2}=0.01$ . This is similar to section 7.3.1 because first, the architecture is generalizing better (seen by the $F_{1}$ score) for label confusion (i.e., language context) and second, it adjusts for intentionally configured imbalance and incorrect labeling (a known issue for this dataset). The incorrect labeling in the IMDB dataset is shown to be non-negligible – upwards of 2-3% – by Klie et al. (2022) and Northcutt et al. (2021). In particular, Northcutt et al. show that small increases in label errors often cause a destabilizing effect on machine learning models for which confident learning methodology is developed to detect them. Klie et al. analyze 18 methods (including confident learning) for Automated Error Detections (AED) and shows the importance of AED for data cleaning. In close proximity to the AED methodology, another paradigm is Robust Training with Label Noise. Song et al. (2022) provides an exhaustive survey ranging from robust architectures (i.e., noise adaptation layers) and robust regularization (i.e., explicit and implicit regularization) to robust loss (i.e., loss correction, re-weighting, etc.) and sample selection. It is in this context that the $M_{2}^{(\lambda,\sigma^{2})}$ framework sits between AED and Robust Training with Label Noise on this IMDB dataset which is known to have errors. $M_{2}^{(\lambda,\sigma^{2})}$ serves two purposes: (1) as a robust loss through the $\beta_{opt}$ re-weighting on the batch and, (2) as a means to detect and down weight possible label errors.

7.3.3 Structured/Tabular Results

The results for the structured/tabular network do not show any $F_{1}$ improvement over the baseline nor any indication of possible improvement through the extra parameter variations. From Table 1, the best performing model for this dataset (not the baseline) is $M_{2}^{(\lambda,\sigma^{2})}$ where $\lambda=2.0$ and $\sigma^{2}=0.5$ . The interpretation of this parameter configuration suggests that training tabular data is very susceptible to both low precision and recall, hence the high penalty in that area of the unit square in Figure 3. Despite embedding categories and numeric features into a richer vector space, the non-contextual nature of tabular data may not necessarily be best trained through these architectures. Furthermore, Sun et al. (2019) applies a two dimensional embedding (i.e., simulating an image) to this Census dataset and the results show that a decision tree (i.e., xgboost) would perform similarly. It is worth mentioning that Sun et al. present these results with an accuracy measure (not $F_{1}$ ) which is misleading since the data is naturally imbalanced. However, a similar general conclusion is given by Borisov et al. (2021) for tabular data – decision trees have faster training time and generally comparable accuracy as compared with embedding based architectures. These results are unsurprising because as stated by Wen et al. (2022) tabular data is not contextually driven data like images or languages which contain position-related correlations. It is heartening to notice, that after Wen et al. apply a casually aware GAN to the census data, the resulting $F_{1}$ score ( $0.509$ ) is similar to the baseline result in Table 1 ( $0.5193$ ). Because of these results, there is an important finding: the type of data, in particular contextual data which is the basis for the creation of the $F_{\beta}$ metric, plays a significant role when using the metric alongside a loss function. This hypothesis is studied further in the benchmark data in Section 7.4.

7.3.4 UST/Pressure Vessel Results

The results for the simulation of real life use cases can be found in Table 2. In the UST case, it is evident that this methodology outperforms the baseline cross-entropy in determining a shape change from a cylinder to an ellipse. For example, in the easier scenario for CvE ( $a=3.2$ ) the $M_{2}^{(\lambda,\sigma^{2})}$ model family appears to be better. However, in cases of the extra variations, the $M_{1}^{8}$ and $M_{2}^{(0.01,0.01)}$ perform the same. This trend is also observed in the results for the Image and Text data presented in Table 1. The interpretation is that a slightly more recall-centric penalty may be optimal for this scenario. Interestingly, for the easier CHvEH scenario ( $a=3.2$ ), the $M_{2}^{(\lambda,\sigma^{2})}$ model family also appears to be better, and the extra variations for $M_{1}^{32}$ and $M_{2}^{(5.0,5.0)}$ perform the same. These variations mirror CvE but in the other direction, suggesting that a balanced or slightly more precision-centric penalty is optimal. In the difficult scenario ( $a=3.8$ ), both CvE and CHvEH are closely aligned with $M_{2}^{(\lambda,\sigma^{2})}$ model family. For CHvEH the best performer is the $M_{1}^{32}$ variation. Overall, there is between 12% to 28% improvement over the baseline or standard cross-entropy for this simulation. Regarding the PV data, for the easier scenario ( $t_{h}=0.0313$ ) the $M_{1}^{\beta}$ family appears to be better, with the $M_{2}^{(\lambda,\sigma^{2})}$ model family not far behind. In the difficult scenario ( $t_{h}=0.2817$ ) there is no improvement over the baseline cross-entropy but the best performing model family is $M_{2}^{(\lambda,\sigma^{2})}$ . The reason is likely due to the significant overlap in distribution seen in Figure 4. These results are impactful because the commonality between model families begins to surface. For the easier scenario, a more recall-centric penalty turns out to be better, while in the difficult scenario, a balanced or slightly precision-centric penalty is more effective. This finding is intuitive.This finding is intuitive.

\sidewaystablefn

Table 1: Best F1 Score for

M_{1}^{\beta}

and

M_{2}^{(\lambda,\sigma^{2})}

over

30

Epochs

	Baseline	Parameter Variations Section 6.3					Extra Variations
Dataset	$M_{B}$	$M_{1}^{16}$	$M_{2}^{(0.5,0.5)}$	$M_{2}^{(0.5,2.0)}$	$M_{2}^{(2.0,0.5)}$	$M_{2}^{(2.0,2.0)}$	$M_{1}^{8}$	$M_{1}^{32}$	$M_{2}^{(0.01,0.01)}$	$M_{2}^{(5.0,5.0)}$
Image¹¹1The image dataset is the CIFAR10. The airplane label versus the remaining labels is the binary label basis. It gives a training data imbalance of 10%. Training data size is 50K and validation is 10K.	0.8161	0.8261	0.8085	0.8193	0.8232	0.8257	0.8266	0.8068	0.8087	0.8178
Text²²2The text dataset for NLP is the IMDB movie sentiment with binary label of positive/negative sentiment. The vocabulary size is 20K and the maximum review length is 200. The training set is imbalanced by choosing only 1K positive sentiments which yields an imbalance of 7.4%. The training data size is 13.5K and validation is 25K.	0.6749	0.6393	0.7175	0.6170	0.6547	0.6673	0.7236	0.5460	0.7666	0.7364
Structured³³3The structured or tabular data set is the Census Income Dataset from UCI repository. The labels are greater than or less than 50K salary. The data is already imbalanced with a rate of 6.2% for $>$ 50K. The training data size is 200K and the validation is 100K.	0.5193	0.4170	0.3917	0.4126	0.4635	0.3930	0.3824	0.3511	0.3890	0.4516
\botrule

\sidewaystablefn

Table 2: Best F1 Score for

M_{1}^{\beta}

and

M_{2}^{(\lambda,\sigma^{2})}

over

30

Epochs

	Baseline	Parameter Variations Section 6.3					Extra Variations
Dataset⁰⁰footnotemark: 0	$M_{B}$	$M_{1}^{16}$	$M_{2}^{(0.5,0.5)}$	$M_{2}^{(0.5,2.0)}$	$M_{2}^{(2.0,0.5)}$	$M_{2}^{(2.0,2.0)}$	$M_{1}^{8}$	$M_{1}^{32}$	$M_{2}^{(0.01,0.01)}$	$M_{2}^{(5.0,5.0)}$
CvE¹¹1The simulation has label 0 with $r=4$ and $L=32$ versus label 1 of $a=3.2$ and $b=5.0$ .	0.9691	0.9228	0.9983	0.9915	0.9898	0.9565	0.9966	0.9915	0.9966	0.9673
CvE²²2The simulation has label 0 with $r=4$ and $L=32$ versus label 1 of $a=3.8$ and $b=4.2105$ .	0.3169	0.3147	0.3351	0.3469	0.3296	0.3333	0.3224	0.3401	0.3362	0.3573
CHvEH³³3The simulation has label 0 with $r=4$ and $L=32$ versus label 1 of $a=3.2$ and $b=5.0$ .	0.9831	0.9813	0.9813	0.9898	0.9915	0.9831	0.9726	0.9882	0.9831	0.9882
CHvEH⁴⁴4The simulation has label 0 with $r=4$ and $L=32$ versus label 1 of $a=3.8$ and $b=4.2105$ .	0.2891	0.3345	0.3427	0.3515	0.3262	0.3159	0.3636	0.3701	0.3395	0.3425
PV⁵⁵5The simulation has label 0 with $t_{s}=1.8048$ and $t_{h}=0.0939$ versus label 1 of $t_{s}=1.7887$ and $t_{h}=0.0313$ .	0.9967	0.9992	0.9983	0.9483	0.9831	0.9967	0.9967	0.9891	0.9727	0.9958
PV⁶⁶6The simulation has label 0 with $t_{s}=1.8048$ and $t_{h}=0.0939$ versus label 1 of $t_{s}=1.7887$ and $t_{h}=0.2817$ .	0.7515	0.4552	0.5057	0.4893	0.4722	0.5248	0.4934	0.4861	0.4675	0.5161
\botrule

⁰⁰footnotetext: The simulations for UST (Underground Storage Tanks) are for the cylinder versus ellipse (CvE) or cylinder with hemispherical end-caps versus ellipsoidal with hemi-ellipsoidal end-caps (CHvEH). For the PV or pressure vessel, the simulation is between varying thickness of the surface and head. Refer to Appendix D and E for details. The label 1 proportion is 25% and total training and testing size is 1200.

7.4 Further Experimentation: Benchmark Analysis

Following the benchmark analysis from Aurelio et al., a similar approach is done for the Image, Text, and Tabular data. This expands the analysis from Table 1 to provide a more detailed and comprehensive view across various well-known datasets. The results can be found in Table 3, 4, and 5. The footnotes in these tables are explained as follows: the breakdown of train and test data sizes, the proportion of label 1, the labeling convention for label 1 versus label 0 (if multiple labels exist), and the location of the data, if necessary. For example, label 9 vs all means the label 9 is the label 1 and everything else is marked as label 0. Detailed explanations, links, and training details for all the datasets are provided in the footnotes for each table. At a high level, for images CIFAR-10, CIFAR-100 and Fashion MNIST are analyzed. For text, AG’s News Corpus, Reuters Corpus Volume 1, Hate Speech and Stanford Sentiment Treebank are analyzed. For the tabular data, 10 classical datasets from UCI repository are analyzed. Finally, the same model networks from Section 7.3 will be used.

7.4.1 Image Results

Comparing the CIFAR-10 result in Table 1 versus 3, the model family changes from $M_{1}^{\beta}$ to $M_{2}^{(\lambda,\sigma^{2})}$ . The interpretation remains the consistent: a recall centric based penalty is favored. The CIFAR-100 examples, with an imbalance of 1%, follow a similar recall centric penalty for $M_{1}^{16}$ under the label convention 9 vs all. However, under the labeling 39 versus all, a more precision centric penalty is preferred. This illustrates the problem-specific nature of selecting a model family and parameter, showcasing the flexibility of this paper’s methodology. Notably, there is a 14% increase in the $F_{1}$ score for CIFAR-100 under the 39 versus all label convention. Fashion MNIST favors the $M_{2}^{(\lambda,\sigma^{2})}$ with a more precision centric penalty. The most intriguing result is that, for all the extra variations, $M_{2}^{(5.0,5.0)}$ is the most frequent performer, which is a more balanced penalty. This suggests that $M_{2}^{(5.0,5.0)}$ could be a starting point of exploration given the balanced nature of the penalty distribution.

\sidewaystablefn

Table 3: Best F1 Score for

M_{1}^{\beta}

and

M_{2}^{(\lambda,\sigma^{2})}

over

30

Epochs

	Baseline	Parameter Variations Section 6.3					Extra Variations
Dataset⁰⁰footnotemark: 0	$M_{B}$	$M_{1}^{16}$	$M_{2}^{(0.5,0.5)}$	$M_{2}^{(0.5,2.0)}$	$M_{2}^{(2.0,0.5)}$	$M_{2}^{(2.0,2.0)}$	$M_{1}^{8}$	$M_{1}^{32}$	$M_{2}^{(0.01,0.01)}$	$M_{2}^{(5.0,5.0)}$
CIFAR-10¹¹1Train/test 50K/10K, label 1 10%, labeling is 1 vs all.	0.9216	0.9088	0.9204	0.9196	0.9263	0.9122	0.9194	0.9119	0.9268	0.9173
CIFAR-100²²2Train/test 50K/10K & 50K/10K, label 1 1% & 1%, labeling is 9 vs all & 39 vs all.	0.7345	0.7804	0.7273	0.6941	0.7594	0.7692	0.7501	0.7167	0.7314	0.7683
CIFAR-100²²2Train/test 50K/10K & 50K/10K, label 1 1% & 1%, labeling is 9 vs all & 39 vs all.	0.6021	0.6592	0.6778	0.6871	0.6381	0.6818	0.6509	0.6351	0.6702	0.6704
Fashion MNIST³³3Train/test 50K/10K & 50K/10K, label 1 10% & 10%, labeling is 0 vs all & 9 vs all.	0.8651	0.8638	0.8663	0.8462	0.8651	0.8593	0.8544	0.8558	0.8638	0.8672
Fashion MNIST³³3Train/test 50K/10K & 50K/10K, label 1 10% & 10%, labeling is 0 vs all & 9 vs all.	0.9627	0.9621	0.9656	0.9656	0.9675	0.9648	0.9615	0.9648	0.9641	0.9681
\botrule

⁰⁰footnotetext: All the datasets are easily found in Keras repository. The link is : https://keras.io/api/datasets/.

7.4.2 Text Results

Referring to Table 4, for the AG’s News Corpus and Reuters Corpus Volume 1 under the labeling crude vs all, the $M_{2}^{(\lambda,\sigma^{2})}$ model family, particularly $M_{2}^{(0.5,2.0)}$ is preferred. These parameter selections suggest a slightly more precision centric penalty. When considering Reuters Corpus Volume 1 with the labeling crude vs all and the Stanford Sentiment Treebank, there is no observed improvement. In the case of the Hate Speech Data, a more distinctive context, there is roughly a 4% boost under the $M_{2}^{(2.0,2.0)}$ model. This parameter selection is also a balanced penalty between recall and precision. Overall, similar to the Image benchmark conclusion, the $M_{2}^{(5.0,5.0)}$ is a frequent performer in the extra variation set of parameters. This insight of balanced penalty selection also holds for contextual text data.

\sidewaystablefn

Table 4: Best F1 Score for

M_{1}^{\beta}

and

M_{2}^{(\lambda,\sigma^{2})}

over

30

Epochs

	Baseline	Parameter Variations Section 6.3					Extra Variations
Dataset⁰⁰footnotemark: 0	$M_{B}$	$M_{1}^{16}$	$M_{2}^{(0.5,0.5)}$	$M_{2}^{(0.5,2.0)}$	$M_{2}^{(2.0,0.5)}$	$M_{2}^{(2.0,2.0)}$	$M_{1}^{8}$	$M_{1}^{32}$	$M_{2}^{(0.01,0.01)}$	$M_{2}^{(5.0,5.0)}$
ag_news¹¹1Train/test 90K/30K, label 1 25%, labeling is 3 vs all, AG’s News Corpus Data found here base-url/ag_news.	0.9632	0.9474	0.9632	0.9639	0.9626	0.9624	0.9553	0.9404	0.9634	0.9655
rcv1²²2Train/test 5485/2189 & 5485/2189, label 1 4.61% & 4.57%, labeling is crude vs all & trade vs all, Reuters Corpus Volume 1 Data found here base-url/yangwang825/reuters-21578.	0.9333	0.9298	0.9396	0.9461	0.9211	0.9451	0.9316	0.927	0.9356	0.9501
rcv1²²2Train/test 5485/2189 & 5485/2189, label 1 4.61% & 4.57%, labeling is crude vs all & trade vs all, Reuters Corpus Volume 1 Data found here base-url/yangwang825/reuters-21578.	0.9324	0.92	0.9251	0.9189	0.9178	0.9189	0.9127	0.9139	0.9251	0.9054
hate³³3Train/test 8027/2676, label 1 11%, labeling is 1 vs 0, Hate Speech Data found here base-url/hate_speech18.	0.8671	0.8304	0.8741	0.9045	0.8621	0.9046	0.8383	0.7669	0.8655	0.8868
sst⁴⁴4Train/test 67K/872, label 1 55%, labeling is 1 vs 0, Stanford Sentiment Treebank found here base-url/sst2.	0.8175	0.7619	0.7955	0.7909	0.8071	0.8001	0.7727	0.7494	0.8018	0.8004
\botrule

⁰⁰footnotetext: Datasets are found in Hugging Face repository. The base-url is https://huggingface.co/datasets.

7.4.3 Structured/Tabular Results

The tabular or structured benchmark results in Table 5 show that this paper’s methodology outperforms the baseline for all but one dataset (the breast cancer dataset). A key insight is that, for the parameter variations from section 6.3 and the extra variations, a more recall centric penalty is preferred. In particular, the $M_{1}^{\beta}$ and $M_{2}^{(2.0,0.5)}$ model families for the datasets iono, pima, vehicle, glass, vowel, yeast and abalone are favored. The remaining datasets - seg and sat - show modest improvement for the balanced penalty or $M_{2}^{(0.5,2.0)}$ model. Compared to the Census results in Table 1, it appears that feature distinctiveness plays a major part for tabular data. This paper defines feature distinctiveness as a neural network learning better discriminative features with respect to the dependent variable. This conclusion arises from the more recall centric penalty showing up in the result, suggesting that for tabular or structured data, the network should focus on learning strong discriminative features to enhance recall. This result underscores the hypothesis of this paper that the type of data, particularly contextual data, matters for a metric-based penalty and further supports the flexibility of this $F_{\beta}$ penalty methodology.

\sidewaystablefn

Table 5: Best F1 Score for

M_{1}^{\beta}

and

M_{2}^{(\lambda,\sigma^{2})}

over

30

Epochs

	Baseline	Parameter Variations Section 6.3					Extra Variations
Dataset⁰⁰footnotemark: 0	$M_{B}$	$M_{1}^{16}$	$M_{2}^{(0.5,0.5)}$	$M_{2}^{(0.5,2.0)}$	$M_{2}^{(2.0,0.5)}$	$M_{2}^{(2.0,2.0)}$	$M_{1}^{8}$	$M_{1}^{32}$	$M_{2}^{(0.01,0.01)}$	$M_{2}^{(5.0,5.0)}$
iono¹¹1Train/test 235/116, label 1 34%, Ionosphere Data found in UCI-url.	0.7845	0.8364	0.8068	0.8161	0.8092	0.8256	0.8205	0.8742	0.8114	0.7845
pima²²2Train/test 514/254, label 1 35%, Pima Indians Diabetes Data found in R-url.	0.5253	0.4407	0.4109	0.3645	0.5454	0.5088	0.5124	0.2711	0.5058	0.5208
breast³³3Train/test 381/188, label 1 38%, Breast Cancer Wisconsin Data found in UCI-url.	0.9416	0.7985	0.6464	0.8633	0.7934	0.9387	0.7832	0.8239	0.7589	0.7832
vehicle⁴⁴4Train/test 566/280, label 1 27%, labeling is opel vs all, Vehicle Data found in R-url.	0.3942	0.4423	0.4000	0.3363	0.4507	0.3470	0.2105	0.3247	0.3103	0.3333
seg⁵⁵5Train/test 210/2100, label 1 14%, labeling is brickface vs all, Segmentation Data found in UCI-url.	0.6798	0.5099	0.6078	0.6987	0.5571	0.5295	0.3915	0.3130	0.6645	0.3247
glass⁶⁶6Train/test 143/71, label 1 13%, labeling is 7 vs all, Glass Data found in R-url.	0.8695	0.7200	0.7407	0.7826	0.9473	0.6250	0.9473	0.7000	0.8333	0.9523
sat⁷⁷7Train/test 4308/1004, label 1 9%, labeling is 4 vs all, Satellite Data found in UCI-url.	0.5511	0.1674	0.3274	0.5571	0.4963	0.1313	0.2375	0.1714	0.5849	0.3779
vowel⁸⁸8Train/test 663/327, label 1 9%, labeling is hYd vs all, Vowel Data found in R-url.	0.2752	0.3439	0.3076	0.2926	0.3103	0.2434	0.2464	0.1851	0.1647	0.2979
yeast⁹⁹9Train/test 344/170 label 1 9%, labeling is CYT vs ME2, Yeast Data found in UCI-url.	0.5491	0.8717	0.7500	0.5079	0.2185	0.2010	0.6046	0.5084	0.6857	0.2105
abalone¹⁰¹⁰10Train/test 489/242, label 1 6%, labeling is 18 vs 9, Abalone Data found in UCI-url.	0.9723	0.9723	0.9723	0.9723	0.9723	0.9723	0.9723	0.9765	0.9723	0.9723
\botrule

⁰⁰footnotetext: Data urls: UCI-url https://archive.ics.uci.edu/datasets/ or R-url https://github.com/cran/mlbench/tree/master/data/.

8 Conclusion

This paper proposes a weighted cross-entropy based on van Rijsbergen $F_{\beta}$ measure. By assuming statistical distributions as an intermediary, an optimal $\beta$ can be found, which is then used as a penalty weighting in the loss function. This approach is convenient since van Rijsbergen defines $\beta$ to be a weighting parameter between recall and precision. Guided training by the $F_{\beta}$ hypothesizes that the interaction of the many combinations between the minority and majority classes has information that can help in three ways. First, as in Vashishtha et al. it can improve feature selection. Second, model training can generalize better. Lastly, overall performance may improve. Results from Table 1 show that this methodology helps in achieving better $F_{1}$ scores in some cases, with the added benefit of parameter interpretation from $M_{1}^{\beta}$ and $M_{2}^{(\lambda,\sigma^{2})}$ . Furthermore, when considering results from real-life use cases as in Table 2, commonalities between model families start to surface. Parameter selections that yield recall-centric penalties for both $M_{1}^{\beta}$ and $M_{2}^{(\lambda,\sigma^{2})}$ can be observed. The analyses from this paper provide the following insights: (1) the balanced penalty distribution is a good starting point for $M_{2}^{(\lambda,\sigma^{2})}$ model family, (2) feature distinctiveness impacts parameter selections between both model families, (3) non-contextual data such as tabular or structured data, seem to benefit from a recall centric penalty, (4) $M_{1}^{\beta}$ may be better for image data, and $M_{2}^{(\lambda,\sigma^{2})}$ for text, and (5) contextual-based data are better positioned for embedding architectures than non-contextual data - except when the tabular data can be mapped to contextual data or the features are discriminative.These points show that $F_{\beta}$ as a performance metric can be integrated alongside a loss function through penalty weights by using statistical distributions.

References

Aurelio et al. (2022) Aurelio, Y. S., de Almeida, G. M., de Castro, C. L., and Braga, A. P. (2022). Cost-Sensitive Learning based on Performance Metric for Imbalanced Data. Neural Processing Letters, 54(4), 3097-3114.
Chauhan et al. (2022) Chauhan, S., Vashishtha, G., and Kumar, A. (2022). A symbiosis of arithmetic optimizer with slime mould algorithm for improving global optimization and conventional design problem. The Journal of Supercomputing, 78(5), 6234-6274.
Chauhan et al. (2023) Chauhan, S., and Vashishtha, G. (2023). A synergy of an evolutionary algorithm with slime mould algorithm through series and parallel construction for improving global optimization and conventional design problem. Engineering Applications of Artificial Intelligence, 118, 105650.
Chawla et al. (2002) Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
Fujino et al. (2008) Fujino, A., Isozaki, H., and Suzuki, J. (2008). Multi-label text categorization with model combination based on f1-score maximization. In Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II.
Hasanin et al. (2019) Hasanin, T., Khoshgoftaar, T. M., Leevy, J. L., and Seliya, N. (2019). Examining characteristics of predictive models with imbalanced big data. Journal of Big Data, 6(1), 1-21.
Oksuz et al. (2018) Oksuz, K., Cam, B. C., Akbas, E., and Kalkan, S. (2018). Localization recall precision (LRP): A new performance metric for object detection. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 504-519).
Li et al. (2019) Li, X., Sun, X., Meng, Y., Liang, J., Wu, F., and Li, J. (2019). Dice loss for data-imbalanced NLP tasks. arXiv preprint arXiv:1911.02855.
Ho and Wookey (2019) Ho, Y., and Wookey, S. (2019). The real-world-weight cross-entropy loss function: Modeling the costs of mislabeling. IEEE Access, 8, 4806-4813.
Lipton et al. (2014) Lipton, Z. C., Elkan, C., and Narayanaswamy, B. (2014). Thresholding classifiers to maximize F1 score. arXiv preprint arXiv:1402.1892.
Bénédict et al. (2021) Bénédict, G., Koops, V., Odijk, D., and de Rijke, M. (2021). sigmoidF1: A Smooth F1 Score Surrogate Loss for Multilabel Classification. arXiv preprint arXiv:2108.10566.
Borisov et al. (2021) Borisov, V., Leemann, T., Seßler, K., Haug, J., Pawelczyk, M., and Kasneci, G. (2021). Deep neural networks and tabular data: A survey. arXiv preprint arXiv:2110.01889.
Dua and Graff (2019) Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Dudewicz and Mishra (1988) Dudewicz, E. J., and Mishra, S. (1988). Modern mathematical statistics. John Wiley & Sons, Inc.
Grush (2015) Grush, L. (2015). Google engineer apologizes after Photos app tags two black people as gorillas. The Verge, 1.
He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
Hogg and Craig (1995) Hogg, R. V., and Craig, A. T. (1995). Introduction to mathematical statistics.(5”” edition). Englewood Hills, New Jersey.
Jansche (2005) Jansche, M. (2005, October). Maximum expected F-measure training of logistic regression models. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (pp. 692-699).
Klie et al. (2022) Klie, J. C., Webber, B., and Gurevych, I. (2022). Annotation Error Detection: Analyzing the Past and Present for a More Coherent Future. arXiv preprint arXiv:2206.02280.
Lee et al. (2021) Lee, N., Yang, H., and Yoo, H., A surrogate loss function for optimization of $F_{\beta}$ score in binary classification with imbalanced data, arXiv preprint arXiv:2104.01459, 2021.
Lin et al. (2017) Lin, T. Y., Goyal, P., Girshick, R., He, K., and Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (pp. 2980-2988).
Maas et al. (2011) Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C., (June 2011) Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (pp. 142-150).
Mohit et al. (2012) Mohit, B., Schneider, N., Bhowmick, R., Oflazer, K., and Smith, N. A. (2012, April). Recall-oriented learning of named entities in Arabic Wikipedia. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp. 162-173).
Northcutt et al. (2021) Northcutt, C. G., Athalye, A., and Mueller, J. (2021). Pervasive label errors in test sets destabilize machine learning benchmarks. arXiv preprint arXiv:2103.14749.
Oksuz et al. (2020) Oksuz, K., Cam, B. C., Akbas, E., and Kalkan, S. (2020). A ranking-based, balanced loss function unifying classification and localisation in object detection. Advances in Neural Information Processing Systems, 33, 15534-15545.
Qian et al. (2021) Qian, S., Pham, V. H., Lutellier, T., Hu, Z., Kim, J., Tan, L., … and Shah, S. (2021). Are my deep learning systems fair? An empirical study of fixed-seed training. Advances in Neural Information Processing Systems, 34, 30211-30227.
Ramdhani (2016) Ramdhani, S. (2016). Some contributions to underground storage tank calibration models, leak detection and shape deformation (Doctoral dissertation, The University of Texas at San Antonio).
Ramdhani et al. (2018) Ramdhani, S., Tripathi, R., Keating, J., and Balakrishnan, N. (2018). Underground storage tanks (UST): A closer investigation statistical implications to changing the shape of a UST. Communications in Statistics-Simulation and Computation, 47(9), 2612-2623.
Sandler et al. (2018) Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L. C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4510-4520).
Sasaki (2007) Sasaki, Y., The Truth of the F-Measure, University of Manchester Technical Report, 2007.
Satopaa et al. (2011) Satopaa, V., Albrecht, J., Irwin, D., and Raghavan, B., Finding a kneedle in a haystack: Detecting knee points in system behavior, In 2011 31st international conference on distributed computing systems workshops (pp. 166-171). IEEE
Schmidhuber (2015) Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural networks, 61, 85-117.
Song et al. (2022) Song, H., Kim, M., Park, D., Shin, Y., and Lee, J. G. (2022). Learning from noisy labels with deep neural networks: A survey. IEEE Transactions on Neural Networks and Learning Systems.
Sun et al. (2019) Sun, B., Yang, L., Zhang, W., Lin, M., Dong, P., Young, C., and Dong, J. (2019). Supertml: Two-dimensional word embedding for the precognition on structured tabular data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (pp. 0-0).
Tian et al. (2022) Tian, J., Mithun, N. C., Seymour, Z., Chiu, H. P., and Kira, Z. (2022, May). Striking the Right Balance: Recall Loss for Semantic Segmentation. In 2022 International Conference on Robotics and Automation (ICRA) (pp. 5063-5069). IEEE.
Tian et al. (2020) Tian, Y., Zhong, Z., Ordonez, V., Kaiser, G., and Ray, B. (2020, June). Testing dnn image classifiers for confusion & bias errors. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (pp. 1122-1134).
Tian (2020) Tian, Y. (2020, November). Repairing confusion and bias errors for DNN-based image classifiers. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (pp. 1699-1700).
Rijsbergen (1979) Van Rijsbergen, C. J., Information retrieval 2nd, Newton MA, 1979.
Vashishtha et al. (2022) Vashishtha, G., and Kumar, R. (2022). Pelton wheel bucket fault diagnosis using improved shannon entropy and expectation maximization principal component analysis. Journal of Vibration Engineering & Technologies, 1-15.
Vashishtha et al. (2022) Vashishtha, G., and Kumar, R. (2022). Unsupervised learning model of sparse filtering enhanced using wasserstein distance for intelligent fault diagnosis. Journal of Vibration Engineering & Technologies, 1-18.
Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Yan et al. (2022) Yan, B. C., Wang, H. W., Jiang, S. W. F., Chao, F. A., and Chen, B. (2022, July). Maximum f1-score training for end-to-end mispronunciation detection and diagnosis of L2 English speech. In 2022 IEEE International Conference on Multimedia and Expo (ICME) (pp. 1-5). IEEE.
Zhang et al. (2021) Zhang, X., Zhai, J., Ma, S., and Shen, C. (2021, May). AUTOTRAINER: An Automatic DNN Training Problem Detection and Repair System. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (pp. 359-371). IEEE.
Wen et al. (2022) Wen, B., Cao, Y., Yang, F., Subbalakshmi, K., and Chandramouli, R. (2022, March). Causal-TGAN: Modeling Tabular Data Using Causally-Aware GAN. In ICLR Workshop on Deep Generative Models for Highly Structured Data.

Appendix A General form of F-Beta: n-th derivative

The derivation pattern using Sasaki’s Sasaki (2007) steps are straightforward for any partial derivative after the first derivative. To set the stage, a few equations are listed.

–

From (1), it can easily be shown that $\frac{1}{\alpha\frac{1}{p}+(1-\alpha)\frac{1}{r}}=\frac{pr}{\alpha r+(1-\alpha)p}$ .
–

Keeping the notation similar to Sasaki (2007), let $g=\alpha r+(1-\alpha)p$ then $\frac{\partial g}{\partial r}=\alpha$ and $\frac{\partial g}{\partial p}=1-\alpha$ .
–

Taking the first derivative of (1) via the chain rule yields the following: $\frac{\partial E}{\partial r}=\frac{-pg+pr\frac{\partial g}{\partial r}}{g^{2}}$ and $\frac{\partial E}{\partial p}=\frac{-pg+pr\frac{\partial g}{\partial p}}{g^{2}}$ .
–

After simplifying, $\frac{\partial E}{\partial r}=\frac{-(1-\alpha)p^{2}}{g^{2}}$ and $\frac{\partial E}{\partial p}=\frac{-\alpha r^{2}}{g^{2}}$ .

After setting $\frac{\partial^{n}E}{\partial r^{n}}$ = $\frac{\partial^{n}E}{\partial p^{n}}$ for $n=1$ it’s easy to see that $(1-\alpha)p^{2}=\alpha r^{2}$ and using $\beta=\frac{r}{p}$ yields $\alpha$ that pertains to the original $F_{\beta}$ measure or (2) with $n=1$ . With the same steps, for $n=2$ the equality becomes $2(1-\alpha)p^{2}\alpha=2\alpha r^{2}(1-\alpha)$ or $p=r$ implying $\beta=1$ . With each successive differentiation where $n>2$ , the pattern is as follows: $c\alpha^{n-2}p^{2}=c(1-\alpha)^{n-2}r^{2}$ , where $c$ is the same constant on both sides. Using $r=\beta p$ will then give the generalized equality $\alpha_{n}=\frac{1}{\beta^{\frac{-2}{n-2}}+1}$ .

Appendix B Case 1: Joint Probability Distribution for U and IU

To prove (LABEL:eq6a) it is sufficient to set up both integrals and explain the bounds. The computation itself is straightforward. From the following probability $\text{Pr}\left(F_{\beta}\right)=\text{Pr}\left(F_{\beta}\leq z\right)=\text{Pr}\left(X_{1}X_{2}\leq z\right)$ it is clear that the domain is in $[\frac{r^{\prime}}{r+\beta^{*}},\frac{r^{\prime}+\beta^{*}}{r}]$ based on (4) and (5). With a slight rearrangement, we can say the following:

	$\displaystyle\text{Pr}\left(F_{\beta}\right)$	$\displaystyle=\text{Pr}\left(X_{1}X_{2}\leq z\right)$
		$\displaystyle=\int_{r^{\prime}}^{r^{\prime}+\beta^{*}}\text{Pr}\left(x_{2}\leq\frac{z}{x_{1}}\right)f_{x1}dx_{1},$

where $f_{x_{1}}$ is the probability density of $X_{1}$ and $\text{Pr}\left(x_{2}\leq\frac{z}{x_{1}}\right)$ will be the cumulative distribution for $X_{2}$ . These are quite common and can be found online, or in Dudewicz and Mishra (1988) or Hogg and Craig (1995). The bounds come from (4). Using these bounds, notice that for $x_{2}\leq\frac{z}{x_{1}}$ to exist, then $\frac{z}{x_{1}}\geq\frac{1}{r+\beta^{*}}$ and $\frac{z}{x_{1}}\leq\frac{1}{r}$ . This results in the range $rz\leq x_{1}\leq(r+\beta^{*})z$ . Recall, that the existence of $x_{1}$ is in the range $r^{\prime}\leq x_{1}\leq r^{\prime}+\beta^{*}$ . From both intervals on $x_{1}$ , define condition 1 as $rz\leq r^{\prime}$ or $z\leq p$ and condition 2 as $(r+\beta^{*})z\leq r^{\prime}+\beta^{*}$ or $\frac{(r+\beta^{*})z}{r^{\prime}+\beta^{*}}\leq 1$ . We need to consider separately the following scenarios: condition 1 and 2 are true, condition 1 and 2 are both false, and condition 1 is false and condition 2 is true. The scenario of condition 1 being true and condition 2 being false does not occur.

Proof: [Proof] For $z\leq p$ & $\frac{(r+\beta^{*})z}{r^{\prime}+\beta^{*}}\leq 1$ we get the following:

	$\displaystyle\int_{r^{\prime}}^{r^{\prime}+\beta^{}}\text{Pr}\left(x_{2}\leq\frac{z}{x_{1}}\right)f_{x1}dx_{1}=\dfrac{1}{\beta^{}}\int_{r^{\prime}}^{(r+\beta^{*})z}\text{Pr}\left(x_{2}\leq\frac{z}{x_{1}}\right)dx_{1},$
	$\displaystyle=\dfrac{1}{(\beta^{})^{2}}\int_{r^{\prime}}^{(r+\beta^{})z}\left(\left(r+\beta^{*}\right)-\frac{x_{1}}{z}\right)dx_{1}$
	$\displaystyle=\dfrac{\frac{z}{2}\left(r+\beta^{}-\frac{r^{\prime}}{z}\right)^{2}}{(\beta^{})^{2}}.$

For $z>p$ & $\frac{(r+\beta^{*})z}{r^{\prime}+\beta^{*}}>1$ we get the following:

	$\displaystyle\int_{r^{\prime}}^{r^{\prime}+\beta^{*}}\text{Pr}\left(x_{2}\leq\frac{z}{x_{1}}\right)f_{x1}dx_{1}$
	$\displaystyle=\dfrac{1}{\beta^{}}\int_{r^{\prime}}^{rz}dx_{1}+\dfrac{1}{\beta^{}}\int_{rz}^{r^{\prime}+\beta^{*}}\text{Pr}\left(x_{2}\leq\frac{z}{x_{1}}\right)dx_{1},$
	$\displaystyle=\left(\dfrac{rz-r^{\prime}}{\beta^{}}+\dfrac{1}{\left(\beta^{}\right)^{2}}\left(\left(r+\beta^{}+\frac{r^{\prime}+\beta^{}}{2z}\right)\left(r^{\prime}+\beta^{*}-rz\right)\right.\right.$
	$\displaystyle+\left.\left.\dfrac{r\left(rz-\left(r^{\prime}+\beta^{*}\right)\right)}{2}\right)\right)$

For $z>p$ & $\frac{(r+\beta^{*})z}{r^{\prime}+\beta^{*}}\leq 1$ we get the following:

	$\displaystyle\int_{r^{\prime}}^{r^{\prime}+\beta^{*}}\text{Pr}\left(x_{2}\leq\frac{z}{x_{1}}\right)f_{x1}dx_{1}$
	$\displaystyle=\dfrac{1}{\beta^{}}\int_{r^{\prime}}^{rz}dx_{1}+\dfrac{1}{\beta^{}}\int_{rz}^{r^{\prime}+\beta^{}}\text{Pr}\left(x_{2}\leq\frac{z}{x_{1}}\right)dx_{1}-\dfrac{1}{\beta^{}}\int_{(r+\beta^{})z}^{r^{\prime}+\beta^{}}\text{Pr}\left(x_{2}\leq\frac{z}{x_{1}}\right)dx_{1},$
	$\displaystyle=\dfrac{rz-r^{\prime}}{\beta^{}}+\dfrac{1}{\left(\beta^{}\right)^{2}}\left(\left(r+\beta^{}+\frac{r^{\prime}+\beta^{}}{2z}\right)\left(r^{\prime}+\beta^{*}-rz\right)\right.$
	$\displaystyle+\left.\dfrac{r\left(rz-\left(r^{\prime}+\beta^{}\right)\right)}{2}\right)-\dfrac{1}{\left(\beta^{}\right)^{2}}\left((r+\beta^{})(r^{\prime}+\beta^{})-\frac{(r^{\prime}+\beta^{})^{2}}{2z}-\frac{(r+\beta^{})^{2}z}{2}\right)$

For the scenario $z\leq p$ & $\frac{(r+\beta^{*})z}{r^{\prime}+\beta^{*}}>1$ , we need to show that it never occurs. By rearranging condition 2, and recalling $r^{\prime}=pr$ , we can get $r(z-p)>\beta^{*}(1-z)$ . Then, if $z\leq 1$ then $r(z-p)\leq 0$ by $z\leq p$ . Since $\beta^{*}>0$ , $r(z-p)>\beta^{*}(1-z)$ never occurs. If, $z>1$ then it implies $p>1$ since $z\leq p$ . This never occurs since $p\in[0,1]$ .

Appendix C Case 2: Joint Probability Distribution for Ga and IE

The derivation of (9) is similar to Case 1 in that the integral will be broken into pieces and probability distribution proof will be used again. As before using the same rearrangement, we can say the following:

	$\displaystyle\text{Pr}\left(F_{\beta}\right)$	$\displaystyle=\text{Pr}\left(X_{1}X_{2}\leq z\right)$
		$\displaystyle=\int_{-\infty}^{+\infty}\text{Pr}\left(x_{2}\leq\frac{z}{x_{1}}\right)f_{x1}dx_{1},$

Before moving forward, $X_{2}$ ’s marginal distribution or (8) will be given.

	$\displaystyle\text{If, }\beta^{\prime\prime}\sim\text{Exponential}(\lambda),\text{ then the CDF for }X=\beta^{\prime\prime}+r\text{ is: }$
	$\displaystyle F(x)=\text{Pr}(X\leq x)=\text{Pr}(\beta^{\prime\prime}+r\leq x^{\prime}_{2})=\text{Pr}(\beta^{\prime\prime}\leq x-r)$
	$\displaystyle F(x)=1-\exp\left(-\lambda(x-r)\right),\forall x\geq r.$
	$\displaystyle\text{With transformation: }Y=g(X)=\frac{1}{X}\text{ so, }g^{-1}(y)=\beta^{\prime\prime}=\frac{1-ry}{y}$
	$\displaystyle\text{Then, }F_{y}(y)=F_{x}(g^{-1}(y))=\exp\left(-\lambda\left(\frac{1-ry}{y}\right)\right)\text{ since $g$ is a}$
	strictly decreasing function.

Now, we can see that $X_{2}$ has the distribution $F(X_{2})=\exp\left(-\lambda\left(\frac{1-rx_{2}}{x_{2}}\right)\right)$ where $x_{2}$ $\in$ [0, $\frac{1}{r}$ ]. Using this property we complete the proof.

Proof: [Proof] For z $>$ 0:

	$\displaystyle\int_{-\infty}^{rz}f_{x_{1}}d_{x_{1}}+\int_{rz}^{+\infty}\exp\left\{-\lambda\left(\frac{1-r\frac{z}{x_{1}}}{\frac{z}{x_{1}}}\right)\right\}f_{x_{1}}d_{x_{1}}$
	$\displaystyle=\Phi(rz\mathchar 24635\relax\;r^{\prime},\sigma^{2})+\dfrac{1}{\sqrt{2\pi\sigma^{2}}}\int_{rz}^{+\infty}\exp\left\{-\lambda\left(\frac{1-r\frac{z}{x_{1}}}{\frac{z}{x_{1}}}\right)\right\}\exp\left\{-\frac{1}{2\sigma^{2}}\left(x_{1}-r^{\prime}\right)^{2}\right\}$
	$\displaystyle=\Phi(rz\mathchar 24635\relax\;r^{\prime},\sigma^{2})+\dfrac{1}{\sqrt{2\pi\sigma^{2}}}\int_{rz}^{+\infty}\exp\left\{-\frac{x_{1}^{2}-2r^{\prime}x_{1}+(r^{\prime})^{2}}{2\sigma^{2}}-\frac{\lambda x_{1}}{z}+\lambda r\right\}dx_{1}$
	$\displaystyle=\Phi(rz\mathchar 24635\relax\;r^{\prime},\sigma^{2})+\dfrac{1}{\sqrt{2\pi\sigma^{2}}}\int_{rz}^{+\infty}\exp\left\{-\frac{\left(x_{1}-\left(r^{\prime}-\frac{\lambda\sigma^{2}}{z}\right)\right)^{2}}{2\sigma^{2}}\right.$
	$\displaystyle\left.+\frac{\frac{2r^{\prime}\lambda\sigma^{2}}{z}-\left(\frac{\lambda\sigma^{2}}{z}\right)^{2}+\lambda r}{2\sigma^{2}}\right\}dx_{1}$
	$\displaystyle=\left[\Phi(rz\mathchar 24635\relax\;r^{\prime},\sigma^{2})+\exp\left(\lambda r+\dfrac{\left(\frac{\lambda\sigma^{2}}{z}\right)^{2}-\frac{2r^{\prime}\lambda\sigma^{2}}{z}}{2\sigma^{2}}\right)\right.$
	$\displaystyle\left.\times\left(1-\Phi\left(rz\mathchar 24635\relax\;\left(r^{\prime}-\frac{\lambda\sigma^{2}}{z}\right),\sigma^{2}\right)\right)\right].$

To be clear, the bounds of the integral around $rz$ arise from the directionally based inequality on $\frac{z}{x_{1}}$ , in particular $\frac{z}{x_{1}}>\frac{1}{r}$ .

For z=0: it can be seen that the entire mass is summarized by the Gaussian distribution or $X_{1}$ since $X_{2}$ is non-negative.

\displaystyle\int_{-\infty}^{0}f_{x_{1}}d_{x_{1}}=\Phi(0\mathchar 24635\relax\;r^{\prime},\sigma^{2}).

For z $<$ 0: this is a bit different because though the Inverse Exponential redistributes the mass for the Gaussian as before, it being non-negative needs to be adjusted for. Consider the interval $[-\infty,0]$ that represents the domain for this (probability) mass. Let $p_{z}$ be the probability mass of interest. Next, define $b_{1}$ to be the mass from $[-\infty,rz]$ , $b_{2}$ to be the mass from $[rz,0]$ , and $b_{3}$ as the mass for $[-\infty,0]$ . Notice for $b_{1}$ and $b_{2}$ , the separation of the integral is similar as before but with different bounds. So we have the following:

	$\displaystyle b_{1}=\int_{-\infty}^{rz}\exp\left\{-\lambda\left(\frac{1-r\frac{z}{x_{1}}}{\frac{z}{x_{1}}}\right)\right\}f_{x_{1}}d_{x_{1}}$
	$\displaystyle b_{2}=\int_{rz}^{0}f_{x_{1}}d_{x_{1}}$
	$\displaystyle b_{3}=\Phi(0\mathchar 24635\relax\;r^{\prime},\sigma^{2}).$

By $X_{2}$ ’s redistribution, the mass for the negative values is $p_{z}=b_{3}-b_{1}-b_{2}$ for $z<0$ . The proof is now simplified to solving the $p_{z}$ expression and by using some of the results from the $z>0$ case we have the following:

	$\displaystyle\Phi(0\mathchar 24635\relax\;r^{\prime},\sigma^{2})-\int_{-\infty}^{rz}\exp\left\{-\lambda\left(\frac{1-r\frac{z}{x_{1}}}{\frac{z}{x_{1}}}\right)\right\}f_{x_{1}}d_{x_{1}}-\int_{rz}^{0}f_{x_{1}}d_{x_{1}}$
	$\displaystyle=\Phi(0\mathchar 24635\relax\;r^{\prime},\sigma^{2})-\exp\left(\lambda r+\dfrac{\left(\frac{\lambda\sigma^{2}}{z}\right)^{2}-\frac{2r^{\prime}\lambda\sigma^{2}}{z}}{2\sigma^{2}}\right)\times\Phi\left(rz\mathchar 24635\relax\;\left(r^{\prime}-\frac{\lambda\sigma^{2}}{z}\right),\sigma^{2}\right)$
	$\displaystyle-\left(\Phi\left(0\mathchar 24635\relax\;r^{\prime},\sigma^{2}\right)-\Phi\left(rz\mathchar 24635\relax\;r^{\prime},\sigma^{2}\right)\right)$
	$\displaystyle=\Phi(rz\mathchar 24635\relax\;r^{\prime},\sigma^{2})-\exp\left(\lambda r+\dfrac{\left(\frac{\lambda\sigma^{2}}{z}\right)^{2}-\frac{2r^{\prime}\lambda\sigma^{2}}{z}}{2\sigma^{2}}\right)\times\Phi\left(rz\mathchar 24635\relax\;\left(r^{\prime}-\frac{\lambda\sigma^{2}}{z}\right),\sigma^{2}\right)$

Appendix D Pressure Vessel Design

This is borrowed from Chauhan et al. (2022). See figure 7 from Chauhan et al. to see the structure design of pressure vessel, which looks similar to Underground Storage Tanks discussed earlier.

D.1 Problem Statement

The pressure vessel design objective is to minimize total cost, which includes material, forming and welding. The design variables are thickness of the shell ( $T_{s}$ ), thickness of the head ( $T_{h}$ ), the inner radius ( $R$ ) and the length of the cylinder ( $L$ ). The mathematical formulation is found in 11.

	$\displaystyle X$	$\displaystyle=[x_{1},x_{2},x_{3},x_{4}]=[T_{s},T_{h},R,L]$
	$\displaystyle f(X)$	$\displaystyle=0.6224x_{1}x_{3}x_{4}+1.7781x_{2}x_{3}^{2}+3.1661x_{1}^{2}x_{4}+19.84x_{1}^{2}x_{3}$		(11)

For this paper, Chauhan et al. HAOASMA algorithm results will be used as the baseline for the parameters. These parameter results serve as the best minimum result. To be specific, $T_{s}=1.8048$ , $T_{h}=0.0939$ , $R=13.8360$ , and $L=123.2019$ . The next section will provide a couple variations to convert this problem to a classification problem.

D.2 Varying Design Parameter Plots

The simulation carried out can be done using Algorithm 2. Two simulated realizations from this algorithm can be seen in Figure 4. The left figure has values $T_{s}^{v}=1.7887$ and $T_{h}^{v}=0.0313$ and the right figure has $T_{s}^{v}=1.7887$ and $T_{h}^{v}=0.2817$ . The superscript $v$ stands for variation. The variations are intended to be reflect two scenarios: the first is a clear separation between distribution (the left figure), hence an easier classification. The second has significant overlap (the right figure) or a tougher classification.

Algorithm 2 Simulation of Pressure Vessel Data for Classification

s>0

\wedge

i\in(0,1)

and

T_{s}^{v}>0

and

T_{h}^{v}>0

where

i

is imbalance,

s

is the size of the data set and

T_{s}^{v}

and

T_{h}^{v}

are parameter variations from the HAOASMA baseline.

2:Set

T_{s}^{b}=1.8048

T_{h}^{b}=0.0939

R^{b}=13.8360

, and

L^{b}=123.2019

where superscript

b

is baseline.

3:Compute data label sizes based on the imbalance

i

s_{0}=\lfloor s\times(1-i)\rfloor

and

s_{1}=s-s_{0}

where the subscript

0

and

1

stand for data sizes for label 0 and label 1.

4:Initialize

t_{s}^{b}

and

t_{h}^{b}

s_{0}

size arrays with values

T_{s}^{b}

and

T_{h}^{b}

respectively.

5:Draw

r^{b}

and

l^{b}

arrays of size

s_{0}

from the normal distributions:

N(\mu=R^{b},\sigma^{2}=1)

and

N(\mu=L^{b},\sigma^{2}=1)

respectively.

6:Concatenate the column vectors

t_{s}^{b}

t_{h}^{b}

r^{b}

and

l^{b}

together and assign this array to variable

X_{0}

7:Apply 11 to each row of

X_{0}

to yield a cost value and assign this array to variable

Y_{0}

8:Initialize

l_{0}

as a label 0 array of size

s_{0}

with the value

0

9:Concatenate the column vectors

r^{b}

l^{b}

Y_{0}

, and

l_{0}

together and assign this array to variable

F_{0}

10:Initialize

t_{s}^{v}

and

t_{h}^{v}

s_{1}

size arrays with values

T_{s}^{v}

and

T_{h}^{v}

respectively.

11:Draw

r^{b}

and

l^{b}

arrays of size

s_{1}

from the normal distributions:

N(\mu=R^{b},\sigma^{2}=1)

and

N(\mu=L^{b},\sigma^{2}=1)

respectively.

12:Concatenate the column vectors

t_{s}^{v}

t_{h}^{v}

r^{b}

and

l^{b}

together and assign this array to variable

X_{1}

13:Apply 11 to each row of

X_{1}

to yield a cost value and assign this array to variable

Y_{1}

14:Initialize

l_{1}

as a label 1 array of size

s_{1}

with the value

1

15:Concatenate the column vectors

r^{b}

l^{b}

Y_{1}

, and

l_{1}

together and assign this array to variable

F_{1}

16:Concatenate or stack the arrays

F_{0}

and

F_{1}

which yields a total of

s

rows with the last column being the labels for classification.

Appendix E Underground Storage Tank (UST)

This is section is borrowed from Ramdhani (2016) and all equations and derivations and further explorations can be found there.

E.1 Problem Statement

The UST problem deals with estimating tank dimensions by using only vertical height measurements. It is also possible that this cylindrical UST has hemispherical endcaps appended on the ends which will also contain volume. The equations for the volume based on cross-sectional measurements for the tanks with the cylindrical, cylindrical with hemispherical endcaps, ellipsoidal, and ellipsoidal with hemi-ellipsoidal endcaps shapes, are given in (12), (13), (14), and (15), respectively.

The equation for the Cylindrical shape is:

\displaystyle f_{C}(r,L,h)=L\left\{r^{2}cos^{-1}\left(\frac{r-h}{r}\right)-(r-h)\sqrt{2rh-h^{2}}\right\}

(12)

If one were to add hemispherical endcaps to the cylinder ends the subsequent volume would be:

\displaystyle f_{CH}(r,L,h)=L\left\{r^{2}cos^{-1}\left(\frac{r-h}{r}\right)-(r-h)\sqrt{2rh-h^{2}}\right\}+\frac{\pi h^{2}}{3}(3r-h)

(13)

The equation for the Elliptical shape for a deformed Cylinder is:

	$\displaystyle f_{ED}(a,b,L,h)$	$\displaystyle=L\left\{(ab)cos^{-1}\left(\frac{a-h}{\sqrt{a^{2}+(h^{2}-2ha)\left(1-\frac{b^{2}}{a^{2}}\right)}}\right)\right.$
		$\displaystyle\left.-b(a-h)\sqrt{1-\left(1-\frac{h}{a}\right)^{2}}\right\}$		(14)

If one were to add hemispherical endcaps to the cylinder which deforms to hemi-ellipsoidal endcaps the subsequent volume would be:

$\displaystyle f_{EDH}(a,b,L,h)$	$\displaystyle=L\left\{(ab)cos^{-1}\left(\frac{a-h}{\sqrt{a^{2}+(h^{2}-2ha)\left(1-\frac{b^{2}}{a^{2}}\right)}}\right)\right.$
	$\displaystyle\left.-b(a-h)\sqrt{1-\left(1-\frac{h}{a}\right)^{2}}\right\}+\dfrac{2\pi a^{3}+\pi(a-h)\left(\frac{hb^{2}}{a}\right)\left(\frac{h-2a}{a}\right)}{3}$
	$\displaystyle-\dfrac{2\pi a^{3}(a-h)}{3\sqrt{a^{2}+(h^{2}-2ha)\left(1-\frac{b^{2}}{a^{2}}\right)}}$	(15)

E.2 Varying Tank Dimension

The parameters used in the Algorithm 3 are borrowed from Ramdhani (2016) and the baseline will be the cylindrical case with radius, $r=4$ and length, $L=32$ and the parameter variations will be on the $a$ and $b$ for an ellipse. Ramdhani used a measurement error based model for simulation, which will also be used here. The measurement errors will be on the heights $h$ . Similar to the pressure vessel we consider an easy and a tough simulation scenario for classification. This is seen in Figure 6 where the left figure is easier to distinguish between cylinder and ellipse versus the right figure. The same interpretation can be seen for the end-cap based equations or Figure 7.

Algorithm 3 Simulation of Tank Dimension Data for Classification

s>0

\wedge

i\in(0,1)

and

a>0

and

b>0

where

i

is imbalance,

s

is the size of the data set and

a

and

b

are the parameters for the vertical and horizontal axis of an ellipse.

2:Set

r=4

L=32

as the baseline parameters.

3:Compute data label sizes based on imbalance

i

s_{0}=\lfloor s\times(1-i)\rfloor

and

s_{1}=s-s_{0}

where the subscript

0

and

1

stand for data sizes for label 0 and label 1.

4:Initialize noise arrays

\epsilon_{0}\sim N(0,2)

and

\gamma_{0}\sim U(-0.05,0.05)

both of size

s_{0}

5:Draw an array of heights

h_{0}\sim U(1,2\times r-1)

of size

s_{0}

for the vertical height of a cylinder.

6:Compute variable

h_{0}^{\prime}=h_{0}+\gamma_{0}

7:Initialize

l_{0}

as a label 0 array of size

s_{0}

with the value

0

8:Compute the volume from either (12) or (13) using

h_{0}

r

L

and assign this to variable

X_{0}

9:Assign variable

Y_{0}=X_{0}+\epsilon_{0}

10:Concatenate the column arrays

Y_{0}

h_{0}^{\prime}

, and

l_{0}

and assign this to

F_{0}

11:Initialize noise arrays

\epsilon_{1}\sim N(0,2)

and

\gamma_{1}\sim U(-0.05,0.05)

both of size

s_{1}

12:Draw an array of heights

h_{1}\sim U(1,2\times a-1)

of size

s_{1}

for the vertical height of an ellipse.

13:Compute variable

h_{1}^{\prime}=h_{1}+\gamma_{1}

14:Initialize

l_{1}

as a label 1 array of size

s_{1}

with the value

1

15:Compute the volume from either (14) or (15) using

h_{1}

a

b

L

and assign this to variable

X_{1}

16:Assign variable

Y_{1}=X_{1}+\epsilon_{1}

17:Concatenate the column arrays

Y_{1}

h_{1}^{\prime}

, and

l_{1}

and assign this to

F_{1}

18:Concatenate or stack the arrays

F_{0}

and

F_{1}

which yields a total of

s

rows with the last column being the labels for classification.

		$\displaystyle\text{Pr}\left(F_{\beta}\right)=\text{Pr}\left(F_{\beta}\leq z\right)=1_{\{z\leq p\&\frac{(r+\beta^{})z}{r^{\prime}+\beta^{}}\leq 1\}}\times\dfrac{\frac{z}{2}\left(r+\beta^{}-\frac{r^{\prime}}{z}\right)^{2}}{(\beta^{})^{2}}$
		$\displaystyle+1_{\{z>p\&\frac{(r+\beta^{})z}{r^{\prime}+\beta^{}}>1\}}\times\left(\dfrac{rz-r^{\prime}}{\beta^{}}+\dfrac{1}{\left(\beta^{}\right)^{2}}\left(\left(r+\beta^{}+\frac{r^{\prime}+\beta^{}}{2z}\right)\left(r^{\prime}+\beta^{*}-rz\right)\right.\right.$
		$\displaystyle+\left.\left.\dfrac{r\left(rz-\left(r^{\prime}+\beta^{*}\right)\right)}{2}\right)\right)$
		$\displaystyle+1_{\{z>p\&\frac{(r+\beta^{})z}{r^{\prime}+\beta^{}}\leq 1\}}\times\left[\left(\dfrac{rz-r^{\prime}}{\beta^{}}+\dfrac{1}{\left(\beta^{}\right)^{2}}\left(\left(r+\beta^{}+\frac{r^{\prime}+\beta^{}}{2z}\right)\left(r^{\prime}+\beta^{*}-rz\right)\right.\right.\right.$
		$\displaystyle+\left.\left.\left.\dfrac{r\left(rz-\left(r^{\prime}+\beta^{}\right)\right)}{2}\right)\right)-\dfrac{1}{\left(\beta^{}\right)^{2}}\left((r+\beta^{})(r^{\prime}+\beta^{})-\frac{(r^{\prime}+\beta^{})^{2}}{2z}-\frac{(r+\beta^{})^{2}z}{2}\right)\right]$

	$\displaystyle\int_{r^{\prime}}^{r^{\prime}+\beta^{*}}\text{Pr}\left(x_{2}\leq\frac{z}{x_{1}}\right)f_{x1}dx_{1}$
	$\displaystyle=\dfrac{1}{\beta^{}}\int_{r^{\prime}}^{rz}dx_{1}+\dfrac{1}{\beta^{}}\int_{rz}^{r^{\prime}+\beta^{}}\text{Pr}\left(x_{2}\leq\frac{z}{x_{1}}\right)dx_{1}-\dfrac{1}{\beta^{}}\int_{(r+\beta^{})z}^{r^{\prime}+\beta^{}}\text{Pr}\left(x_{2}\leq\frac{z}{x_{1}}\right)dx_{1},$
	$\displaystyle=\dfrac{rz-r^{\prime}}{\beta^{}}+\dfrac{1}{\left(\beta^{}\right)^{2}}\left(\left(r+\beta^{}+\frac{r^{\prime}+\beta^{}}{2z}\right)\left(r^{\prime}+\beta^{*}-rz\right)\right.$
	$\displaystyle+\left.\dfrac{r\left(rz-\left(r^{\prime}+\beta^{}\right)\right)}{2}\right)-\dfrac{1}{\left(\beta^{}\right)^{2}}\left((r+\beta^{})(r^{\prime}+\beta^{})-\frac{(r^{\prime}+\beta^{})^{2}}{2z}-\frac{(r+\beta^{})^{2}z}{2}\right)$

Reformulating van Rijsbergen’s FβF_{\beta} metric for weighted binary cross-entropy

Abstract

keywords:

1 Acronym List

2 Introduction

3 Related Work

4 Background

5 Reformulating the F-Beta to leverage statistical distributions

5.1 Case 1: Uniform and Inverse Uniform

5.2 Case 2: Gaussian and Inverse Exponential

6 Knee algorithm and Weighted Cross Entropy

6.1 Knee algorithm to find optimal β\beta values

6.2 Proposal Weighted Binary Cross-Entropy

6.3 Understanding the βo​p​t\beta_{opt} Surface and Weighted Cross-Entropy

7 Datasets and Experimentation

7.1 Datasets

7.2 Model Networks

7.2.1 Image Network

7.2.2 Text Network

7.2.3 Structured/Tabular Network

7.2.4 UST/Vessel Network

7.3 Experimental Results

7.3.1 Image Results

7.3.2 Text Results

7.3.3 Structured/Tabular Results

7.3.4 UST/Pressure Vessel Results

7.4 Further Experimentation: Benchmark Analysis

7.4.1 Image Results

7.4.2 Text Results

7.4.3 Structured/Tabular Results

8 Conclusion

References

Appendix A General form of F-Beta: n-th derivative

Appendix B Case 1: Joint Probability Distribution for U and IU

Appendix C Case 2: Joint Probability Distribution for Ga and IE

Appendix D Pressure Vessel Design

D.1 Problem Statement

D.2 Varying Design Parameter Plots

Appendix E Underground Storage Tank (UST)

E.1 Problem Statement

E.2 Varying Tank Dimension

Reformulating van Rijsbergen’s $F_{\beta}$ metric for weighted binary cross-entropy

6.1 Knee algorithm to find optimal $\beta$ values

6.3 Understanding the $\beta_{opt}$ Surface and Weighted Cross-Entropy