2022
Reformulating van Rijsbergen’s metric for weighted binary cross-entropy
Abstract
The separation of performance metrics from gradient based loss functions may not always give optimal results and may miss vital aggregate information. This paper investigates incorporating a performance metric alongside differentiable loss functions to inform training outcomes. The goal is to guide model performance and interpretation by assuming statistical distributions on this performance metric for dynamic weighting. The focus is on van Rijsbergen’s metric – a popular choice for gauging classification performance. Through distributional assumptions on the , an intermediary link can be established to the standard binary cross-entropy via dynamic penalty weights. First, the metric is reformulated to facilitate assuming statistical distributions with accompanying proofs for the cumulative density function. These probabilities are used within a knee curve algorithm to find an optimal or . This is used as a weight or penalty in the proposed weighted binary cross-entropy. Experimentation on publicly available data along with benchmark analysis mostly yields better and interpretable results as compared to the baseline for both imbalanced and balanced classes. For example, for the IMDB text data with known labeling errors, a 14% boost in score is shown. The results also reveal commonalities between the penalty model families derived in this paper and the suitability of recall-centric or precision-centric parameters used in the optimization. The flexibility of this methodology can enhance interpretation.
keywords:
Performance metrics, Metrics, F-Beta Metric, Penalty Optimization, C.J. van Rijsbergen, Information Retrieval, Weighted Cross-Entropy, Binary Cross-Entropy, Text Retrieval1 Acronym List
-
F-Beta Metric
-
Optimal from Algorithm 1
-
Model 1: U & IU from (5.1)
-
Model 2: Ga & IE from (5.2)
- PV
-
Pressure Vessel Design
-
Thickness of the pressure vessel shell
-
Thickness of the pressure vessel head
- UST
-
Underground Storage Tank
-
Equation for the volume a cylindrical UST (12)
-
Equation for the volume of a cylindrical UST with hemispherical endcaps (13)
-
Equation for the volume of an ellipsoidal UST (14)
-
Equation for the volume of an ellipsoidal UST with hemi-ellipsoidal end-caps (15)
- CvE
-
Cylindrical UST versus Ellipsoidal UST
- CHvEH
-
Cylindrical UST with hemispherical endcaps versus ellipsoidal UST with hemi-ellipsoidal end-caps
- UCI
-
UCI Machine Learning Repository
2 Introduction
Data imbalance is a known, and widespread real world issue that affects performance metrics for a variety of learning algorithm problems (i.e., image detection and segmentation, text categorization and classification). Approaches to mitigate this issue generally fall into three categories: adjusting the neural network architecture (including multiple models or ensembles like Fujino et al. 2008), adjusting the loss function used for training, or adjusting the data (i.e., collecting more data, or leveraging sampling techniques like Chawla et al. 2002 and Hasanin et al. 2019). This research looks at adjusting the loss function with a focus on incorporating the performance metric. The interconnection between performance metric and loss function is crucial for understanding both model behavior and the inherent nature of that specific dataset. This connection has already been approached from the angle of thresholding (a post model step) as in Lipton et al. (2014) or developing a problem specific metric, as Ho and Wookey (2019), Li et al. (2019), and Oksuz et al. (2018) did for real world mislabeling costs, dynamic weighting for easy negative samples, and object detection, respectively. This paper takes a uniquely different and novel approach where statistical distributions act as an intermediary to connect the metric to the binary cross-entropy through dynamic penalty weights.
First, the derivation of the metric from van Rijsbergen’s effectiveness score, , is revisited to prove a limiting case of in section 4. This result supports the default case for the main algorithm in section 6.
Second, the metric is reformulated into a multiplicative form by assuming two independent random variables. Then parametric statistical distributions are assumed for these random variables. In particular, the Uniform and Inverse Uniform (U & IU) case and the Gaussian and Inverse Exponential (Ga & IE) case are proposed. The idea behind U & IU is that no known insight is assumed on the cumulative density function’s (CDF) surface. But the Ga & IE provides the practitioner more flexibility in setting some insight to this CDF surface. This leads to a more interpretable performance metric that is configurable to the data without having to create a new problem specific metric (or loss function).
Third, for both distributional cases, the CDF or shown in section 5 facilitates finding an optimal through a knee curve algorithm in section 6.1. This algorithm gets the best from a monotonic knee curve given precision and recall. It is the value when the curve levels off. The surface for different parameter settings found in section 6.3 suggests a slightly more recall centric penalty. This is discussed further in section 7.
Finally, a weighted binary cross-entropy loss function based on is proposed in section 6.2. This loss methodology is applied to three data categories: image, text and tabular/structured data. For contextual data (i.e., image and text), model performance for improves, and the best result occurs for the text data that contains (known) labeling errors. The structured/tabular or non-contextual data does not show significant improvement, but provides an important result: when considering neural embedding architectures for training, the type (or category) of data matters.
3 Related Work
Logistic regression models are one of the most fundamental statistically based classifier. Jansche (2005) provides a training procedure that uses a sigmoid approximation to maximize the on this class of classifiers. When comparing the surface plots of the likelihood from Jansche and that from section 5 – a similar but not an equivalent comparison – a comparable rate of change can be seen for both surfaces with respect to their respective parameters. This is an important similarity because this paper’s procedure applies distributional assumptions to provide dynamic penalties to a well-known binary cross-entropy loss. Also, implementation of this paper’s methodology is straightforward because it avoids the need to provide updated partial derivatives for the loss function. Furthermore, Jansche alludes to (future work that considers) a general method to optimize several operating points simultaneously, which is a fundamental and indirect assertion in this paper. The sigmoid approximation is also used by Fujino et al. (2008) in the multi-label setting for text categorization. In their framework, multiple binary classifiers are trained per category and combined with weights estimated to maximize micro- or macro-averaged scores.
Similarly, Aurelio et al. (2022) propose a methodology for performance metric learning that uses a metric approximation (i.e., AUC, ) derived from the confusion matrix. The back-propagation error term involves the first derivative, followed by the application of gradient descent. This method provides an alternative means of integrating performance metrics with gradient-based learning. However, there are cases where the back-propagation term proposed by Aurelio et al. may pose issues. For instance, when considering equation 13 from Aurelio et al. in conjunction with batch training and severe imbalance, there could be a division by zero error if a batch with only the zero label appears. Moreover, Aurelio et al. test several metrics: , G-mean, AG-mean and AUC for their method. But the G-mean, AG-mean and AUC, based on the confusion matrix approximation, can be derived as functions of . This suggests that is more flexible than G-mean, AG-mean and AUC. In other words, is unique to yet generalized across other metrics when equal to 1. In fact, for class imbalance, the AUC metric - an average over many thresholds, and G-mean - a geometric mean, is less stringent and more generous in accuracy reporting compared to the . This is the reason all results in this paper are reported using the score.
Surrogate loss functions attempt to mimic certain aspects of the and is another related area. For example, sigmoidF1 from Bénédict et al. (2021) creates smooth versions for the entries of the confusion matrix, which is used to create a differentiable loss function that imitates the . This smooth differentiability is another application of a sigmoid approximation similar to Jansche. Lee et al. (2021) formulates a surrogate loss by adjusting the cross-entropy loss such that its gradient matches the gradient of a smooth version of the .
In terms of metric creation or variation to the , Ho and Wookey (2019), Li et al. (2019), Oksuz et al. (2020) and Yan et al. (2022) are highlighted. The Real World Weight Cross Entropy (RWWCE) loss function from Ho and Wookey is a metric similar in spirit to Oksuz et al. The idea is to set (not train or tune) cost related weights based on the dataset and the main problem, by introducing costs (i.e., financial costs) that reflect the real world. RWWCE affects both the positive and negative labels by tying each to its own real world cost implication. The dice loss from Li et al. propose a dynamic weight adjustment to address the dominating effect of easy-negative examples. The formulation is based on the using a smoothing parameter and a focal adaptation from Lin et al. (2017). A ranking loss based on the Localisation Recall Precision (LRP) metric Oksuz et al. (2018) is developed by Oksuz et al. (2020) for object detection. They propose an averaged LRP alongside a ranking loss function for not only classification but also localisation of objects in images. This provides a balance between both positive and negative samples. Along a similar theme, Yan et al. (2022) explores a discriminative loss function that aims to maximize the expected directly for speech mispronunciation. Their loss function is based on the (comparing human assessors and the model prediction) weighted by a probability distribution (i.e., normal distribution) for that score. The final objective function is a weighted average between their loss function and the ordinary cross-entropy.
When considering the components of performance metrics, precision and recall are often the primary focus. Mohit et al. (2012) and Tian et al. (2022) propose two different loss functions that are both recall oriented. Mohit et al. adjust the hinge loss by adding a recall based cost (and penalty) into the objective function. As they said, by favoring recall over precision it results in a substantial boost to recall and . By leveraging the concept of inverse frequency weighting (i.e., a sampling based technique), Tian et al. adjust the cross-entropy to reflect an inverse weighting on false negatives per class. They state that their loss function sits between regular and inverse frequency weighted cross-entropy by balancing the excessive false positives introduced by constantly up-weighting minority classes. When they consider a similar loss function using precision, this loss function shows irregular behavior. These findings are insightful because this paper’s surface as seen in section 6.3 is more recall centric with the added benefit of being able to incorporate precision weighting through the assumed probability surface.
4 Background
The measure comes directly from van Rijsbergen’s effectiveness score, , for information retrieval (chapter 7 in Rijsbergen 1979). For the theory on the six conditions supporting as a measure, refer to Rijsbergen. This paper highlights two of these conditions. First, guides the practitioner’s ability to quantify effectiveness given any point (, ) – where and are recall and precision – as compared to some other point. Second, precision and recall contribute effects independently of . As said by Rijsbergen, for a constant (or ) the difference in from any set of varying points of (or ) can not be removed by changing the constant. These conditions suggest equivalence relations and imply a common effectiveness (CE) curve based on precision and recall (definition 3 in Rijsbergen 1979). They also motivate the rationale on using statistical distributions to understand the CE curve. The van Rijsbergen’s effectiveness measure is given in (1).
(1) |
where, . Sasaki (2007) gives the details on deriving from (1) with and by solving =. The parameter is intended to allow the practitioner control by giving times more importance to recall than precision. Using the derivation steps from Sasaki, a general form of for any derivative can be shown as (2),
(2) |
where pertains to = resulting in . Note that , and . The proof is found in Appendix A. For the equation reduces to the equality implying . Using (2), it can be seen that the , which is most commonly used in the literature. The reason for showing this limiting case is to provide a justification on fixing (instead of claiming equal importance for and ) in the default case of any algorithm – in particular the algorithm in section 6.
5 Reformulating the F-Beta to leverage statistical distributions
CE for neural networks is seen when different network weights give different precision and recall yet resulting in similar performance scores. CE also provides a basis for this paper’s use of from the measure to guide training through penalties, in lieu of an explicit loss (or surrogate loss) function. In fact, Vashishtha et al. (2022) uses the -score as part of a preprocessing step for feature selection prior to their ensemble model (EM-PCA then ELM) for fault diagnosis. They show significant performance improvement in their approach which adds supporting evidence to this paper’s use of the as a loss penalty for feature selection via gradient based learning.
The first step is to reformulate (2) for . This makes assuming statistical distributions easier. Consider the following reformulation through multiplicative decomposition in (3) which assumes and to be independent random variables.
(3) |
where , with , and . indirectly captures imbalance in the model prediction from the underlying data. If precision and recall are on opposite ends of the scale, then will reflect this, while maintaining continuity when precision and recall are directionally consistent. can be thought of as a weighting scheme that appears recall centric with a precision based penalty. For instance, for both high (or both low) precision and recall, the weighting is consistent with intuition. However, when precision and recall are on opposite ends of the scale, the weighting sways by the aggregate with the lower score. Two use cases are considered for (3): and follow U & IU, respectively, and and follow Ga & IE, respectively.
5.1 Case 1: Uniform and Inverse Uniform
The thought behind U & IU is to apply (flat) equal distribution for both and . These assumed distributions are applied to and as follows:
Let | ||||
then | (4) | |||
and | (5) |
where and . Note that for both distributions there is only one chosen and this value replaces the need to have an explicit form that includes as a parameter. This is for convenience, as well as noticing that both and differ by a factor of . So allowing to vary broadly (which is the in section 6) would be enough to balance this convenience tradeoff. Next is to derive the joint distribution which would be used in section 6. It can be shown (the proof is in Appendix B) that the joint distribution is:


To understand this flat mixture, consider Figure 1 - the CDF surface for a grid of precision and recall where . (Note: the blue and red heat coloring is from the CDF and highlights curvature and/or rate of change). For a lower z value of , Figure 1(a) shows that has a faster rate of change as compared to . The same conclusion is apparent in Figure 1(b), which is for a higher z value of . For both figures more curvature is seen for lower values. This suggests that a larger value smooths the surface and is a better candidate for in the algorithm in Section 6.
5.2 Case 2: Gaussian and Inverse Exponential
A more informed distributional approach for and considers Ga & IE, respectively. The reason to use Gaussian distribution for is to allow a bell-shaped variability around a fixed that is based on and ultimately . The weighting of by uses the Inverse Exponential distribution because with selections of the rate parameter the distribution can shift mass from left to right as well as appear uniformly distributed around . This provides practitioners enough flexibility on experimenting with different weights. The following shows the assumptions for and :
Let | ||||
then | (7) | |||
and | (8) |
where in (8) is the location shift by recall from the definition of in (3) and is the variability captured by . Using both (7) and (8), the distribution for (3) is now split around as follows:
(9) |
where denotes the standard normal or gaussian distribution at the value for a mean, and variance, . (Refer to Appendix C for the proof). Similar to before the focus is on the indicator as defined in (9).




Since this distributional mixture has more flexibility due to more parameters, Figure 2 highlights this when , and . The probabilities are computed again at a lower z value, , and at a higher z value, for comparison. For a fixed , varying impacts the curvature of the surface with higher values producing a flattening effect. Figure 2(a) shows this distinctly. Conversely, as increases with a fixed , the rate at which the surface changes is very apparent. This can be seen by juxtaposing Figure 2(c) and 2(a) or Figure 2(d) and 2(b) and noticing that the increase of produces a clear increase in the rate of change. These observations match the intuition that is linked to the shape of the bell curve, and is linked to a rate of change. It also serves as a basis of intuition behind the algorithm in Section 6. That is, a faster rate of change along with a curved (and/or smoother) surface would provide loss penalties that adapt quickly per batch using the aggregated information from precision and recall.
6 Knee algorithm and Weighted Cross Entropy
6.1 Knee algorithm to find optimal values
Now that probabilities, or for some are established in section 5.1 and 5.2, the goal is to use them to get an optimal value, . There are a couple things to consider. First, because is grouped into and with distributional assumptions, using maximum likelihood estimation (MLE) is not particularly suitable here. Also, , , and from (4), (5), and (8) are set in advance and do not need to be estimated. Second, our observed data is only one data point per training batch, namely precision and recall. Given this and the natural bend of the function, a knee algorithm is applicable. From Satopaa et al. (2011), the knee of a curve is associated with good operator points in a system right before the performance levels off. This removes the need for complex system-specific analysis. Furthermore, they have provided a definition of curvature that supports their method being application independent – an important property for this paper. Algorithm 1 implements (and slightly alters) Kneedles algorithm from Satopaa et al. to detect the knee in the curve. Refer to Algorithm 1 for the formal pseudo code.
A brief explanation in plain words is as follows:
-
1.
For any training batch, compute precision (p) and recall (r). Then with a predefined value, set equally spaced values, , up to , and use section 5 to compute . (This replaces step 1 from Satopaa et al.). Let represent this smooth curve as for .
-
2.
When , convert to a knee by taking the difference of the probabilities from the maximum. That is, for . This is necessary because of the formulation of the metric.
-
3.
Normalize the points to a unit square and call these and .
-
4.
Take the difference of points and label that and .
-
5.
Find the candidate knee points by getting all local maxima’s, label that and .
-
6.
Take the average of and this will be . (This simplifies Satopaa et al.).
6.2 Proposal Weighted Binary Cross-Entropy
The weighted binary cross-entropy loss is primarily focused around the imbalanced use case where a minority class exists. This paper posits that from the shuffling of data observations, as is frequently done while training, relevant aggregate information is available to use from the batch. For instance, say for a fixed minority class observation, , it is grouped among different batches of the majority class. The interaction effect of among these randomly varying training batches is often overlooked. It is this interaction that can be inferred through the precision and recall aggregates, then transferred as a penalty to the loss function via in a probabilistic way. By using Algorithm 1 to get , the proposed loss is,
(10) |
where the function is the -th element from the prediction of a neural network using the inputs x and training weights, ; and is the -th element of the true target label. When considering the majority class, or for , the loss is weighted by . Therefore, for correctly predicted observations, the loss has a reduction by . When incorrectly predicted, the loss is magnified by the same amount. For the minority class, or for , the loss is unchanged. This is intentional because under imbalanced data there are far less observations, and computing precision and recall lead to numerical instability or frequent edge cases for Algorithm 1.
6.3 Understanding the Surface and Weighted Cross-Entropy



Figure 3 highlights the surface generated from Algorithm 1 leveraging probabilities from Section 5. First, the U & IU mixture or Figure 3(a) suggests that the shape of the surface remains relatively similar even when doubling . This is an important point toward fixing for the Ga & IE. Based on Figure 3(a), the U & IU mixture penalizes more on the outskirts of recall, while the immediate penalties arise on the diagonal of the unit square. This suggests that precision and recall estimates from training cause immediate penalties when they are on opposite ends of the range as well as on the diagonal when these values start to even out. For Ga & IE mixture, Figures 3(b) and 3(c) show some similar conclusions as U & IU mixture along with additional insights. For a lower rate of , or Figure 3(b), diagonal spikes for as well as a precision centric penalty for higher (i.e., ) are seen. For a higher rate of , or Figure 3(c), a similar diagonal is retained for as in Figure 3(b). Furthermore, for increasing values of , the penalty evolves from precision centric to a vertical separation on the unit grid at around a precision of . The overall interpretation is the following: for a lower , increasing creates a slightly more precision based penalty; while for a higher , increasing causes the penalty to become more balanced between recall and precision. The choice of these parameters are problem specific, but provides the practitioner flexibility in determining the best selection for their use case. On a separate note, from Figure 3, a spiky surface is obvious, which is partially explained by having a default setting in the algorithm. This is a strong sign of immediate and configurable penalties.
7 Datasets and Experimentation
7.1 Datasets
The origins of the metric come from text retrieval, so it is important to verify this method across different categories of data. In particular, image data from CIFAR-10111https://www.cs.toronto.edu/ kriz/cifar.html, text data from IMDB movie sentiments Maas et al. (2011) and structured/tabular data from the Census Income Dataset Dua and Graff (2019) are tested. For each experiment, the primary label (i.e., label 1) is either imbalanced or forced to be imbalanced to reflect real world scenarios. Because CIFAR-10 contains multiple image labels, the airplane label is the primary label and all others are combined. This yields a 10% class imbalance. IMDB movie sentiment reviews (positive/negative text) are not imbalanced. The positive sentiments in the training data are reduced to 1K randomly sampled sentiments yielding a 7.4% imbalance (Table 1). The Census Income Tabular data contains 14 input features (i.e., age, work class, education, occupation, etc) with 5 numerical and 9 categorical features. The binary labels are greater than 50K salary (label 1) and less than 50K salary (label 0). By default, greater than 50K salary is already imbalanced at 6.2%. The training and validation dataset sizes for each data category are as follows: for CIFAR-10, 50K training and 10K validation, for IMDB, 13.5K training and 25K validation, and for the Census data, 200K training and 100K validation. In terms of class imbalance, this paper considers an imbalance or proportion of label 1 under 10% to be significantly imbalanced and between 10% to 25% to be moderately imbalanced. Some heuristic rationale for a 10% imbalance, a model that has perfect recall or 100%, a precision of would be required to get . In practical examples, this scenario can occur with weakly discriminative features. Therefore, this paper seeks to test this algorithm in scenarios that would need an improved precision. A variety of imbalanced and balanced scenarios will be tested in this paper.
Two real-life use cases related to cylindrical tanks are also considered, providing a physical domain to test Algorithm 1. Chauhan et al. (2022) developed an arithmetic optimizer with slime mould algorithm and Chauhan et al. (2023) developed an evolutionary based algorithm with slime mould algorithm; both algorithms focus on global parameter optimization. They tested these algorithms on several benchmark problems, one of which is called the pressure vessel design. The problem is a constrained parameter optimization (i.e., material thickness, and cylinder dimensions) for minimizing a cost function. This paper focuses on using the HAOASMA algorithm by Chauhan et al. in a simulation to convert the problem into a binary classification. The second use case is derived from Underground Storage Tanks (UST) and is also inspired by Chauhan et al.’s pressure vessel problem. The physical shape (i.e., the cylindrical shape) of USTs is similar to the pressure vessel design. USTs are used to store petroleum products, chemicals, and other hazardous materials underground. These structures could deform underground and possibly explain a false positive leak. Ramdhani (2016) and Ramdhani et al. (2018) explored parameter optimization of UST dimensions changing from cylindrical to ellipsoidal. The observed data are vertical (underground) height measurements, which can contain uniformly distributed error. Ramdhani et al. used these measurements and the volumetric equations (12), (13), (14), and (15) - derived from a cross-sectional view - to develop a methodology to estimate tank dimensions and test if the shape has deformed. The cross-sectional view can be seen in Figure 5.
The conversion of both of these real-life use cases into a classification involves establishing a baseline set of parameters to simulate data for label 0. Varying these parameters will allow simulation of data for label 1. For the pressure vessel design, the baseline parameters from HAOASMA are , , , and . To convert this to a classification, the parameters for thickness and are changed from the baseline while and dimensions are drawn from a normal distribution. By using values for , , and , the cost function is computed using 11. These cost values concatenated with and arrays serve as the input to a neural network classifier. The label 1 reflect simulated data using and that are changed from the baseline. These variations are and . Label 0 is HAOASMA baseline values and . Appendix D provides the equations, the distributional plots seen in Figure 4, and a detailed explanation of the simulation procedure (Algorithm 2). For the UST problem, Ramdhani used a measurement error model with an error on the height measurement and another on the volume computation. The same model is used to simulate data in this paper. The baseline is a cylinder and the variations to vertical and horizontal axes and represent a deformed cylinder to an ellipse. By using , , , and along with (12) and (14) or (13) and (15) the volume is computed. These volumes concatenated with noisy height measurements are the inputs to a neural network classifier. The label 1 reflect simulated data using the variations to the cylinder or the baseline. These variations are and . Label 0 will be the baseline cylinder with radius and the length of . Refer to Appendix E for detailed explanation of the simulation Algorithm 3 along with comparison plots and volume equations.
7.2 Model Networks
7.2.1 Image Network
For the CIFAR-10 image dataset, ResNet (He et al. 2016) version 1 is applied. The number of layers for ResNet is 20, which upon initial experimentation is adequate for speed and generalization in this case. Adam optimizer was implemented with a learning rate of with total epochs of 30. No learning rate schedule because of the intentional lower number of epoch in order to validate faster training via this proposed loss algorithm. The training batch size is 32. Modest data augmentation is done – random horizontal and vertical shifts of 10%, and horizontal and vertical flips.
7.2.2 Text Network
For the IMDB movie sentiments, a Transformer block (which applies self-attention Vaswani et al. 2017) is used. The token embedding size is 32, and the transformer has 2 attention heads and a hidden layer size of 32 including dropout rates of 10%. A pooling layer and the two layers that follow – a dense RELU activated layer of size 20 and a dense sigmoid layer of size 1 – give the final output probability. As for preprocessing, a vocabulary size of 20K and maximum sequence length of 200 is implemented. The training batch size is 32.
7.2.3 Structured/Tabular Network
For the Census Income Dataset, a standard encoder embedding paradigm Schmidhuber (2015) is used. Specifically, all categorical features with an embedding size of 64 are concatenated, then numerical features are concatenated to this embedding vector. Afterwards, a 25% dropout layer and the two layers that follow – a fully connected dense layer with GELU activation of size 64 and a sigmoid activated layer of size 1 – provide the final output probability. The training batch size is 256.
7.2.4 UST/Vessel Network
For the real life use cases on simulated data the model network is simple because of the minimal amount of features. The network is a sequential set of dense layers of sizes 20, 10 and 1. The last layer of size 1 has a sigmoid activation to give the final output probability. Additionally, a drop out of 10% is added after both middle layers. The training batch size is 128.
7.3 Experimental Results
The results in Table 1 compare the use of the loss function (10) by different models based on U & IU and Ga & IE to a baseline case of ordinary cross-entropy. All results shown in this table are computed on the validation datasets for each data category above (see above for the dataset sizes). For ease of presentation, is Model 1: U & IU from (5.1). is Model 2: Ga & IE from (5.2). The superscripts and are the parameters being explored. is the baseline or the same model network that is trained using ordinary cross-entropy.
7.3.1 Image Results
For the image network, Table 1 shows modest improvement over the baseline under the for a moderately sized . This suggests that image data trains better under constant penalties on the outskirts of the unit square toward the imbalance of high precision and low recall. High precision and low recall imply image confusion between classes in the feature embedding space. In fact, this can lead to large implications as in Grush (2015). Algorithms like DeepInspect Tian et al. (2020) help to detect confusion and bias errors to isolate misclassified images leading to repair based training algorithms such as Tian (2020) and Zhang et al. (2021). But Qian et al. (2021) empirically shows that such repair or de-biasing algorithms can be inaccurate with one fixed-seed training run. The importance of the result is now evident because quickly penalizes the network in a way that inherently mirrors algorithms like DeepInspect’s confusion/bias detection without the need for repair algorithms.
7.3.2 Text Results
The training results for the text network by far show the most improvement with a nearly 14% boost in the score over the baseline for the model. Not only is the performance notable, the model parameter selections are consistent – the parameters move in the same direction. In other words, given the parameters and , the training shows improvement over the baseline and this improvement continues in the same direction when and . This is similar to section 7.3.1 because first, the architecture is generalizing better (seen by the score) for label confusion (i.e., language context) and second, it adjusts for intentionally configured imbalance and incorrect labeling (a known issue for this dataset). The incorrect labeling in the IMDB dataset is shown to be non-negligible – upwards of 2-3% – by Klie et al. (2022) and Northcutt et al. (2021). In particular, Northcutt et al. show that small increases in label errors often cause a destabilizing effect on machine learning models for which confident learning methodology is developed to detect them. Klie et al. analyze 18 methods (including confident learning) for Automated Error Detections (AED) and shows the importance of AED for data cleaning. In close proximity to the AED methodology, another paradigm is Robust Training with Label Noise. Song et al. (2022) provides an exhaustive survey ranging from robust architectures (i.e., noise adaptation layers) and robust regularization (i.e., explicit and implicit regularization) to robust loss (i.e., loss correction, re-weighting, etc.) and sample selection. It is in this context that the framework sits between AED and Robust Training with Label Noise on this IMDB dataset which is known to have errors. serves two purposes: (1) as a robust loss through the re-weighting on the batch and, (2) as a means to detect and down weight possible label errors.
7.3.3 Structured/Tabular Results
The results for the structured/tabular network do not show any improvement over the baseline nor any indication of possible improvement through the extra parameter variations. From Table 1, the best performing model for this dataset (not the baseline) is where and . The interpretation of this parameter configuration suggests that training tabular data is very susceptible to both low precision and recall, hence the high penalty in that area of the unit square in Figure 3. Despite embedding categories and numeric features into a richer vector space, the non-contextual nature of tabular data may not necessarily be best trained through these architectures. Furthermore, Sun et al. (2019) applies a two dimensional embedding (i.e., simulating an image) to this Census dataset and the results show that a decision tree (i.e., xgboost) would perform similarly. It is worth mentioning that Sun et al. present these results with an accuracy measure (not ) which is misleading since the data is naturally imbalanced. However, a similar general conclusion is given by Borisov et al. (2021) for tabular data – decision trees have faster training time and generally comparable accuracy as compared with embedding based architectures. These results are unsurprising because as stated by Wen et al. (2022) tabular data is not contextually driven data like images or languages which contain position-related correlations. It is heartening to notice, that after Wen et al. apply a casually aware GAN to the census data, the resulting score () is similar to the baseline result in Table 1 (). Because of these results, there is an important finding: the type of data, in particular contextual data which is the basis for the creation of the metric, plays a significant role when using the metric alongside a loss function. This hypothesis is studied further in the benchmark data in Section 7.4.
7.3.4 UST/Pressure Vessel Results
The results for the simulation of real life use cases can be found in Table 2. In the UST case, it is evident that this methodology outperforms the baseline cross-entropy in determining a shape change from a cylinder to an ellipse. For example, in the easier scenario for CvE () the model family appears to be better. However, in cases of the extra variations, the and perform the same. This trend is also observed in the results for the Image and Text data presented in Table 1. The interpretation is that a slightly more recall-centric penalty may be optimal for this scenario. Interestingly, for the easier CHvEH scenario (), the model family also appears to be better, and the extra variations for and perform the same. These variations mirror CvE but in the other direction, suggesting that a balanced or slightly more precision-centric penalty is optimal. In the difficult scenario (), both CvE and CHvEH are closely aligned with model family. For CHvEH the best performer is the variation. Overall, there is between 12% to 28% improvement over the baseline or standard cross-entropy for this simulation. Regarding the PV data, for the easier scenario () the family appears to be better, with the model family not far behind. In the difficult scenario () there is no improvement over the baseline cross-entropy but the best performing model family is . The reason is likely due to the significant overlap in distribution seen in Figure 4. These results are impactful because the commonality between model families begins to surface. For the easier scenario, a more recall-centric penalty turns out to be better, while in the difficult scenario, a balanced or slightly precision-centric penalty is more effective. This finding is intuitive.This finding is intuitive.
Baseline | Parameter Variations Section 6.3 | Extra Variations | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Dataset | ||||||||||
Image111The image dataset is the CIFAR10. The airplane label versus the remaining labels is the binary label basis. It gives a training data imbalance of 10%. Training data size is 50K and validation is 10K. | 0.8161 | 0.8261 | 0.8085 | 0.8193 | 0.8232 | 0.8257 | 0.8266 | 0.8068 | 0.8087 | 0.8178 |
Text222The text dataset for NLP is the IMDB movie sentiment with binary label of positive/negative sentiment. The vocabulary size is 20K and the maximum review length is 200. The training set is imbalanced by choosing only 1K positive sentiments which yields an imbalance of 7.4%. The training data size is 13.5K and validation is 25K. | 0.6749 | 0.6393 | 0.7175 | 0.6170 | 0.6547 | 0.6673 | 0.7236 | 0.5460 | 0.7666 | 0.7364 |
Structured333The structured or tabular data set is the Census Income Dataset from UCI repository. The labels are greater than or less than 50K salary. The data is already imbalanced with a rate of 6.2% for 50K. The training data size is 200K and the validation is 100K. | 0.5193 | 0.4170 | 0.3917 | 0.4126 | 0.4635 | 0.3930 | 0.3824 | 0.3511 | 0.3890 | 0.4516 |
\botrule |
Baseline | Parameter Variations Section 6.3 | Extra Variations | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Dataset00footnotemark: 0 | ||||||||||
CvE111The simulation has label 0 with and versus label 1 of and . | 0.9691 | 0.9228 | 0.9983 | 0.9915 | 0.9898 | 0.9565 | 0.9966 | 0.9915 | 0.9966 | 0.9673 |
CvE222The simulation has label 0 with and versus label 1 of and . | 0.3169 | 0.3147 | 0.3351 | 0.3469 | 0.3296 | 0.3333 | 0.3224 | 0.3401 | 0.3362 | 0.3573 |
CHvEH333The simulation has label 0 with and versus label 1 of and . | 0.9831 | 0.9813 | 0.9813 | 0.9898 | 0.9915 | 0.9831 | 0.9726 | 0.9882 | 0.9831 | 0.9882 |
CHvEH444The simulation has label 0 with and versus label 1 of and . | 0.2891 | 0.3345 | 0.3427 | 0.3515 | 0.3262 | 0.3159 | 0.3636 | 0.3701 | 0.3395 | 0.3425 |
PV555The simulation has label 0 with and versus label 1 of and . | 0.9967 | 0.9992 | 0.9983 | 0.9483 | 0.9831 | 0.9967 | 0.9967 | 0.9891 | 0.9727 | 0.9958 |
PV666The simulation has label 0 with and versus label 1 of and . | 0.7515 | 0.4552 | 0.5057 | 0.4893 | 0.4722 | 0.5248 | 0.4934 | 0.4861 | 0.4675 | 0.5161 |
\botrule |
7.4 Further Experimentation: Benchmark Analysis
Following the benchmark analysis from Aurelio et al., a similar approach is done for the Image, Text, and Tabular data. This expands the analysis from Table 1 to provide a more detailed and comprehensive view across various well-known datasets. The results can be found in Table 3, 4, and 5. The footnotes in these tables are explained as follows: the breakdown of train and test data sizes, the proportion of label 1, the labeling convention for label 1 versus label 0 (if multiple labels exist), and the location of the data, if necessary. For example, label 9 vs all means the label 9 is the label 1 and everything else is marked as label 0. Detailed explanations, links, and training details for all the datasets are provided in the footnotes for each table. At a high level, for images CIFAR-10, CIFAR-100 and Fashion MNIST are analyzed. For text, AG’s News Corpus, Reuters Corpus Volume 1, Hate Speech and Stanford Sentiment Treebank are analyzed. For the tabular data, 10 classical datasets from UCI repository are analyzed. Finally, the same model networks from Section 7.3 will be used.
7.4.1 Image Results
Comparing the CIFAR-10 result in Table 1 versus 3, the model family changes from to . The interpretation remains the consistent: a recall centric based penalty is favored. The CIFAR-100 examples, with an imbalance of 1%, follow a similar recall centric penalty for under the label convention 9 vs all. However, under the labeling 39 versus all, a more precision centric penalty is preferred. This illustrates the problem-specific nature of selecting a model family and parameter, showcasing the flexibility of this paper’s methodology. Notably, there is a 14% increase in the score for CIFAR-100 under the 39 versus all label convention. Fashion MNIST favors the with a more precision centric penalty. The most intriguing result is that, for all the extra variations, is the most frequent performer, which is a more balanced penalty. This suggests that could be a starting point of exploration given the balanced nature of the penalty distribution.
Baseline | Parameter Variations Section 6.3 | Extra Variations | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Dataset00footnotemark: 0 | ||||||||||
CIFAR-10111Train/test 50K/10K, label 1 10%, labeling is 1 vs all. | 0.9216 | 0.9088 | 0.9204 | 0.9196 | 0.9263 | 0.9122 | 0.9194 | 0.9119 | 0.9268 | 0.9173 |
CIFAR-100222Train/test 50K/10K & 50K/10K, label 1 1% & 1%, labeling is 9 vs all & 39 vs all. | 0.7345 | 0.7804 | 0.7273 | 0.6941 | 0.7594 | 0.7692 | 0.7501 | 0.7167 | 0.7314 | 0.7683 |
CIFAR-100222Train/test 50K/10K & 50K/10K, label 1 1% & 1%, labeling is 9 vs all & 39 vs all. | 0.6021 | 0.6592 | 0.6778 | 0.6871 | 0.6381 | 0.6818 | 0.6509 | 0.6351 | 0.6702 | 0.6704 |
Fashion MNIST333Train/test 50K/10K & 50K/10K, label 1 10% & 10%, labeling is 0 vs all & 9 vs all. | 0.8651 | 0.8638 | 0.8663 | 0.8462 | 0.8651 | 0.8593 | 0.8544 | 0.8558 | 0.8638 | 0.8672 |
Fashion MNIST333Train/test 50K/10K & 50K/10K, label 1 10% & 10%, labeling is 0 vs all & 9 vs all. | 0.9627 | 0.9621 | 0.9656 | 0.9656 | 0.9675 | 0.9648 | 0.9615 | 0.9648 | 0.9641 | 0.9681 |
\botrule |
7.4.2 Text Results
Referring to Table 4, for the AG’s News Corpus and Reuters Corpus Volume 1 under the labeling crude vs all, the model family, particularly is preferred. These parameter selections suggest a slightly more precision centric penalty. When considering Reuters Corpus Volume 1 with the labeling crude vs all and the Stanford Sentiment Treebank, there is no observed improvement. In the case of the Hate Speech Data, a more distinctive context, there is roughly a 4% boost under the model. This parameter selection is also a balanced penalty between recall and precision. Overall, similar to the Image benchmark conclusion, the is a frequent performer in the extra variation set of parameters. This insight of balanced penalty selection also holds for contextual text data.
Baseline | Parameter Variations Section 6.3 | Extra Variations | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Dataset00footnotemark: 0 | ||||||||||
ag_news111Train/test 90K/30K, label 1 25%, labeling is 3 vs all, AG’s News Corpus Data found here base-url/ag_news. | 0.9632 | 0.9474 | 0.9632 | 0.9639 | 0.9626 | 0.9624 | 0.9553 | 0.9404 | 0.9634 | 0.9655 |
rcv1222Train/test 5485/2189 & 5485/2189, label 1 4.61% & 4.57%, labeling is crude vs all & trade vs all, Reuters Corpus Volume 1 Data found here base-url/yangwang825/reuters-21578. | 0.9333 | 0.9298 | 0.9396 | 0.9461 | 0.9211 | 0.9451 | 0.9316 | 0.927 | 0.9356 | 0.9501 |
rcv1222Train/test 5485/2189 & 5485/2189, label 1 4.61% & 4.57%, labeling is crude vs all & trade vs all, Reuters Corpus Volume 1 Data found here base-url/yangwang825/reuters-21578. | 0.9324 | 0.92 | 0.9251 | 0.9189 | 0.9178 | 0.9189 | 0.9127 | 0.9139 | 0.9251 | 0.9054 |
hate333Train/test 8027/2676, label 1 11%, labeling is 1 vs 0, Hate Speech Data found here base-url/hate_speech18. | 0.8671 | 0.8304 | 0.8741 | 0.9045 | 0.8621 | 0.9046 | 0.8383 | 0.7669 | 0.8655 | 0.8868 |
sst444Train/test 67K/872, label 1 55%, labeling is 1 vs 0, Stanford Sentiment Treebank found here base-url/sst2. | 0.8175 | 0.7619 | 0.7955 | 0.7909 | 0.8071 | 0.8001 | 0.7727 | 0.7494 | 0.8018 | 0.8004 |
\botrule |
7.4.3 Structured/Tabular Results
The tabular or structured benchmark results in Table 5 show that this paper’s methodology outperforms the baseline for all but one dataset (the breast cancer dataset). A key insight is that, for the parameter variations from section 6.3 and the extra variations, a more recall centric penalty is preferred. In particular, the and model families for the datasets iono, pima, vehicle, glass, vowel, yeast and abalone are favored. The remaining datasets - seg and sat - show modest improvement for the balanced penalty or model. Compared to the Census results in Table 1, it appears that feature distinctiveness plays a major part for tabular data. This paper defines feature distinctiveness as a neural network learning better discriminative features with respect to the dependent variable. This conclusion arises from the more recall centric penalty showing up in the result, suggesting that for tabular or structured data, the network should focus on learning strong discriminative features to enhance recall. This result underscores the hypothesis of this paper that the type of data, particularly contextual data, matters for a metric-based penalty and further supports the flexibility of this penalty methodology.
Baseline | Parameter Variations Section 6.3 | Extra Variations | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Dataset00footnotemark: 0 | ||||||||||
iono111Train/test 235/116, label 1 34%, Ionosphere Data found in UCI-url. | 0.7845 | 0.8364 | 0.8068 | 0.8161 | 0.8092 | 0.8256 | 0.8205 | 0.8742 | 0.8114 | 0.7845 |
pima222Train/test 514/254, label 1 35%, Pima Indians Diabetes Data found in R-url. | 0.5253 | 0.4407 | 0.4109 | 0.3645 | 0.5454 | 0.5088 | 0.5124 | 0.2711 | 0.5058 | 0.5208 |
breast333Train/test 381/188, label 1 38%, Breast Cancer Wisconsin Data found in UCI-url. | 0.9416 | 0.7985 | 0.6464 | 0.8633 | 0.7934 | 0.9387 | 0.7832 | 0.8239 | 0.7589 | 0.7832 |
vehicle444Train/test 566/280, label 1 27%, labeling is opel vs all, Vehicle Data found in R-url. | 0.3942 | 0.4423 | 0.4000 | 0.3363 | 0.4507 | 0.3470 | 0.2105 | 0.3247 | 0.3103 | 0.3333 |
seg555Train/test 210/2100, label 1 14%, labeling is brickface vs all, Segmentation Data found in UCI-url. | 0.6798 | 0.5099 | 0.6078 | 0.6987 | 0.5571 | 0.5295 | 0.3915 | 0.3130 | 0.6645 | 0.3247 |
glass666Train/test 143/71, label 1 13%, labeling is 7 vs all, Glass Data found in R-url. | 0.8695 | 0.7200 | 0.7407 | 0.7826 | 0.9473 | 0.6250 | 0.9473 | 0.7000 | 0.8333 | 0.9523 |
sat777Train/test 4308/1004, label 1 9%, labeling is 4 vs all, Satellite Data found in UCI-url. | 0.5511 | 0.1674 | 0.3274 | 0.5571 | 0.4963 | 0.1313 | 0.2375 | 0.1714 | 0.5849 | 0.3779 |
vowel888Train/test 663/327, label 1 9%, labeling is hYd vs all, Vowel Data found in R-url. | 0.2752 | 0.3439 | 0.3076 | 0.2926 | 0.3103 | 0.2434 | 0.2464 | 0.1851 | 0.1647 | 0.2979 |
yeast999Train/test 344/170 label 1 9%, labeling is CYT vs ME2, Yeast Data found in UCI-url. | 0.5491 | 0.8717 | 0.7500 | 0.5079 | 0.2185 | 0.2010 | 0.6046 | 0.5084 | 0.6857 | 0.2105 |
abalone101010Train/test 489/242, label 1 6%, labeling is 18 vs 9, Abalone Data found in UCI-url. | 0.9723 | 0.9723 | 0.9723 | 0.9723 | 0.9723 | 0.9723 | 0.9723 | 0.9765 | 0.9723 | 0.9723 |
\botrule |
8 Conclusion
This paper proposes a weighted cross-entropy based on van Rijsbergen measure. By assuming statistical distributions as an intermediary, an optimal can be found, which is then used as a penalty weighting in the loss function. This approach is convenient since van Rijsbergen defines to be a weighting parameter between recall and precision. Guided training by the hypothesizes that the interaction of the many combinations between the minority and majority classes has information that can help in three ways. First, as in Vashishtha et al. it can improve feature selection. Second, model training can generalize better. Lastly, overall performance may improve. Results from Table 1 show that this methodology helps in achieving better scores in some cases, with the added benefit of parameter interpretation from and . Furthermore, when considering results from real-life use cases as in Table 2, commonalities between model families start to surface. Parameter selections that yield recall-centric penalties for both and can be observed. The analyses from this paper provide the following insights: (1) the balanced penalty distribution is a good starting point for model family, (2) feature distinctiveness impacts parameter selections between both model families, (3) non-contextual data such as tabular or structured data, seem to benefit from a recall centric penalty, (4) may be better for image data, and for text, and (5) contextual-based data are better positioned for embedding architectures than non-contextual data - except when the tabular data can be mapped to contextual data or the features are discriminative.These points show that as a performance metric can be integrated alongside a loss function through penalty weights by using statistical distributions.
References
- Aurelio et al. (2022) Aurelio, Y. S., de Almeida, G. M., de Castro, C. L., and Braga, A. P. (2022). Cost-Sensitive Learning based on Performance Metric for Imbalanced Data. Neural Processing Letters, 54(4), 3097-3114.
- Chauhan et al. (2022) Chauhan, S., Vashishtha, G., and Kumar, A. (2022). A symbiosis of arithmetic optimizer with slime mould algorithm for improving global optimization and conventional design problem. The Journal of Supercomputing, 78(5), 6234-6274.
- Chauhan et al. (2023) Chauhan, S., and Vashishtha, G. (2023). A synergy of an evolutionary algorithm with slime mould algorithm through series and parallel construction for improving global optimization and conventional design problem. Engineering Applications of Artificial Intelligence, 118, 105650.
- Chawla et al. (2002) Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
- Fujino et al. (2008) Fujino, A., Isozaki, H., and Suzuki, J. (2008). Multi-label text categorization with model combination based on f1-score maximization. In Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II.
- Hasanin et al. (2019) Hasanin, T., Khoshgoftaar, T. M., Leevy, J. L., and Seliya, N. (2019). Examining characteristics of predictive models with imbalanced big data. Journal of Big Data, 6(1), 1-21.
- Oksuz et al. (2018) Oksuz, K., Cam, B. C., Akbas, E., and Kalkan, S. (2018). Localization recall precision (LRP): A new performance metric for object detection. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 504-519).
- Li et al. (2019) Li, X., Sun, X., Meng, Y., Liang, J., Wu, F., and Li, J. (2019). Dice loss for data-imbalanced NLP tasks. arXiv preprint arXiv:1911.02855.
- Ho and Wookey (2019) Ho, Y., and Wookey, S. (2019). The real-world-weight cross-entropy loss function: Modeling the costs of mislabeling. IEEE Access, 8, 4806-4813.
- Lipton et al. (2014) Lipton, Z. C., Elkan, C., and Narayanaswamy, B. (2014). Thresholding classifiers to maximize F1 score. arXiv preprint arXiv:1402.1892.
- Bénédict et al. (2021) Bénédict, G., Koops, V., Odijk, D., and de Rijke, M. (2021). sigmoidF1: A Smooth F1 Score Surrogate Loss for Multilabel Classification. arXiv preprint arXiv:2108.10566.
- Borisov et al. (2021) Borisov, V., Leemann, T., Seßler, K., Haug, J., Pawelczyk, M., and Kasneci, G. (2021). Deep neural networks and tabular data: A survey. arXiv preprint arXiv:2110.01889.
- Dua and Graff (2019) Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
- Dudewicz and Mishra (1988) Dudewicz, E. J., and Mishra, S. (1988). Modern mathematical statistics. John Wiley & Sons, Inc.
- Grush (2015) Grush, L. (2015). Google engineer apologizes after Photos app tags two black people as gorillas. The Verge, 1.
- He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
- Hogg and Craig (1995) Hogg, R. V., and Craig, A. T. (1995). Introduction to mathematical statistics.(5”” edition). Englewood Hills, New Jersey.
- Jansche (2005) Jansche, M. (2005, October). Maximum expected F-measure training of logistic regression models. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (pp. 692-699).
- Klie et al. (2022) Klie, J. C., Webber, B., and Gurevych, I. (2022). Annotation Error Detection: Analyzing the Past and Present for a More Coherent Future. arXiv preprint arXiv:2206.02280.
- Lee et al. (2021) Lee, N., Yang, H., and Yoo, H., A surrogate loss function for optimization of score in binary classification with imbalanced data, arXiv preprint arXiv:2104.01459, 2021.
- Lin et al. (2017) Lin, T. Y., Goyal, P., Girshick, R., He, K., and Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (pp. 2980-2988).
- Maas et al. (2011) Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C., (June 2011) Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (pp. 142-150).
- Mohit et al. (2012) Mohit, B., Schneider, N., Bhowmick, R., Oflazer, K., and Smith, N. A. (2012, April). Recall-oriented learning of named entities in Arabic Wikipedia. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp. 162-173).
- Northcutt et al. (2021) Northcutt, C. G., Athalye, A., and Mueller, J. (2021). Pervasive label errors in test sets destabilize machine learning benchmarks. arXiv preprint arXiv:2103.14749.
- Oksuz et al. (2020) Oksuz, K., Cam, B. C., Akbas, E., and Kalkan, S. (2020). A ranking-based, balanced loss function unifying classification and localisation in object detection. Advances in Neural Information Processing Systems, 33, 15534-15545.
- Qian et al. (2021) Qian, S., Pham, V. H., Lutellier, T., Hu, Z., Kim, J., Tan, L., … and Shah, S. (2021). Are my deep learning systems fair? An empirical study of fixed-seed training. Advances in Neural Information Processing Systems, 34, 30211-30227.
- Ramdhani (2016) Ramdhani, S. (2016). Some contributions to underground storage tank calibration models, leak detection and shape deformation (Doctoral dissertation, The University of Texas at San Antonio).
- Ramdhani et al. (2018) Ramdhani, S., Tripathi, R., Keating, J., and Balakrishnan, N. (2018). Underground storage tanks (UST): A closer investigation statistical implications to changing the shape of a UST. Communications in Statistics-Simulation and Computation, 47(9), 2612-2623.
- Sandler et al. (2018) Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L. C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4510-4520).
- Sasaki (2007) Sasaki, Y., The Truth of the F-Measure, University of Manchester Technical Report, 2007.
- Satopaa et al. (2011) Satopaa, V., Albrecht, J., Irwin, D., and Raghavan, B., Finding a kneedle in a haystack: Detecting knee points in system behavior, In 2011 31st international conference on distributed computing systems workshops (pp. 166-171). IEEE
- Schmidhuber (2015) Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural networks, 61, 85-117.
- Song et al. (2022) Song, H., Kim, M., Park, D., Shin, Y., and Lee, J. G. (2022). Learning from noisy labels with deep neural networks: A survey. IEEE Transactions on Neural Networks and Learning Systems.
- Sun et al. (2019) Sun, B., Yang, L., Zhang, W., Lin, M., Dong, P., Young, C., and Dong, J. (2019). Supertml: Two-dimensional word embedding for the precognition on structured tabular data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (pp. 0-0).
- Tian et al. (2022) Tian, J., Mithun, N. C., Seymour, Z., Chiu, H. P., and Kira, Z. (2022, May). Striking the Right Balance: Recall Loss for Semantic Segmentation. In 2022 International Conference on Robotics and Automation (ICRA) (pp. 5063-5069). IEEE.
- Tian et al. (2020) Tian, Y., Zhong, Z., Ordonez, V., Kaiser, G., and Ray, B. (2020, June). Testing dnn image classifiers for confusion & bias errors. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (pp. 1122-1134).
- Tian (2020) Tian, Y. (2020, November). Repairing confusion and bias errors for DNN-based image classifiers. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (pp. 1699-1700).
- Rijsbergen (1979) Van Rijsbergen, C. J., Information retrieval 2nd, Newton MA, 1979.
- Vashishtha et al. (2022) Vashishtha, G., and Kumar, R. (2022). Pelton wheel bucket fault diagnosis using improved shannon entropy and expectation maximization principal component analysis. Journal of Vibration Engineering & Technologies, 1-15.
- Vashishtha et al. (2022) Vashishtha, G., and Kumar, R. (2022). Unsupervised learning model of sparse filtering enhanced using wasserstein distance for intelligent fault diagnosis. Journal of Vibration Engineering & Technologies, 1-18.
- Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
- Yan et al. (2022) Yan, B. C., Wang, H. W., Jiang, S. W. F., Chao, F. A., and Chen, B. (2022, July). Maximum f1-score training for end-to-end mispronunciation detection and diagnosis of L2 English speech. In 2022 IEEE International Conference on Multimedia and Expo (ICME) (pp. 1-5). IEEE.
- Zhang et al. (2021) Zhang, X., Zhai, J., Ma, S., and Shen, C. (2021, May). AUTOTRAINER: An Automatic DNN Training Problem Detection and Repair System. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (pp. 359-371). IEEE.
- Wen et al. (2022) Wen, B., Cao, Y., Yang, F., Subbalakshmi, K., and Chandramouli, R. (2022, March). Causal-TGAN: Modeling Tabular Data Using Causally-Aware GAN. In ICLR Workshop on Deep Generative Models for Highly Structured Data.
Appendix A General form of F-Beta: n-th derivative
The derivation pattern using Sasaki’s Sasaki (2007) steps are straightforward for any partial derivative after the first derivative. To set the stage, a few equations are listed.
-
–
From (1), it can easily be shown that .
-
–
Keeping the notation similar to Sasaki (2007), let then and .
-
–
Taking the first derivative of (1) via the chain rule yields the following: and .
-
–
After simplifying, and .
After setting = for it’s easy to see that and using yields that pertains to the original measure or (2) with . With the same steps, for the equality becomes or implying . With each successive differentiation where , the pattern is as follows: , where is the same constant on both sides. Using will then give the generalized equality .
Appendix B Case 1: Joint Probability Distribution for U and IU
To prove (LABEL:eq6a) it is sufficient to set up both integrals and explain the bounds. The computation itself is straightforward. From the following probability it is clear that the domain is in based on (4) and (5). With a slight rearrangement, we can say the following:
where is the probability density of and will be the cumulative distribution for . These are quite common and can be found online, or in Dudewicz and Mishra (1988) or Hogg and Craig (1995). The bounds come from (4). Using these bounds, notice that for to exist, then and . This results in the range . Recall, that the existence of is in the range . From both intervals on , define condition 1 as or and condition 2 as or . We need to consider separately the following scenarios: condition 1 and 2 are true, condition 1 and 2 are both false, and condition 1 is false and condition 2 is true. The scenario of condition 1 being true and condition 2 being false does not occur.
Proof: [Proof] For & we get the following:
For & we get the following:
For & we get the following:
For the scenario & , we need to show that it never occurs. By rearranging condition 2, and recalling , we can get . Then, if then by . Since , never occurs. If, then it implies since . This never occurs since .
Appendix C Case 2: Joint Probability Distribution for Ga and IE
The derivation of (9) is similar to Case 1 in that the integral will be broken into pieces and probability distribution proof will be used again. As before using the same rearrangement, we can say the following:
Before moving forward, ’s marginal distribution or (8) will be given.
strictly decreasing function. |
Now, we can see that has the distribution where [0,]. Using this property we complete the proof.
Proof: [Proof] For z 0:
To be clear, the bounds of the integral around arise from the directionally based inequality on , in particular .
For z=0: it can be seen that the entire mass is summarized by the Gaussian distribution or since is non-negative.
For z 0: this is a bit different because though the Inverse Exponential redistributes the mass for the Gaussian as before, it being non-negative needs to be adjusted for. Consider the interval that represents the domain for this (probability) mass. Let be the probability mass of interest. Next, define to be the mass from , to be the mass from , and as the mass for . Notice for and , the separation of the integral is similar as before but with different bounds. So we have the following:
By ’s redistribution, the mass for the negative values is for . The proof is now simplified to solving the expression and by using some of the results from the case we have the following:
Appendix D Pressure Vessel Design
This is borrowed from Chauhan et al. (2022). See figure 7 from Chauhan et al. to see the structure design of pressure vessel, which looks similar to Underground Storage Tanks discussed earlier.
D.1 Problem Statement
The pressure vessel design objective is to minimize total cost, which includes material, forming and welding. The design variables are thickness of the shell (), thickness of the head (), the inner radius () and the length of the cylinder (). The mathematical formulation is found in 11.
(11) |
For this paper, Chauhan et al. HAOASMA algorithm results will be used as the baseline for the parameters. These parameter results serve as the best minimum result. To be specific, , , , and . The next section will provide a couple variations to convert this problem to a classification problem.
D.2 Varying Design Parameter Plots
The simulation carried out can be done using Algorithm 2. Two simulated realizations from this algorithm can be seen in Figure 4. The left figure has values and and the right figure has and . The superscript stands for variation. The variations are intended to be reflect two scenarios: the first is a clear separation between distribution (the left figure), hence an easier classification. The second has significant overlap (the right figure) or a tougher classification.


Appendix E Underground Storage Tank (UST)
This is section is borrowed from Ramdhani (2016) and all equations and derivations and further explorations can be found there.
E.1 Problem Statement
The UST problem deals with estimating tank dimensions by using only vertical height measurements. It is also possible that this cylindrical UST has hemispherical endcaps appended on the ends which will also contain volume. The equations for the volume based on cross-sectional measurements for the tanks with the cylindrical, cylindrical with hemispherical endcaps, ellipsoidal, and ellipsoidal with hemi-ellipsoidal endcaps shapes, are given in (12), (13), (14), and (15), respectively.
The equation for the Cylindrical shape is:
(12) |
If one were to add hemispherical endcaps to the cylinder ends the subsequent volume would be:
(13) |
The equation for the Elliptical shape for a deformed Cylinder is:
(14) |
If one were to add hemispherical endcaps to the cylinder which deforms to hemi-ellipsoidal endcaps the subsequent volume would be:
(15) |
E.2 Varying Tank Dimension
The parameters used in the Algorithm 3 are borrowed from Ramdhani (2016) and the baseline will be the cylindrical case with radius, and length, and the parameter variations will be on the and for an ellipse. Ramdhani used a measurement error based model for simulation, which will also be used here. The measurement errors will be on the heights . Similar to the pressure vessel we consider an easy and a tough simulation scenario for classification. This is seen in Figure 6 where the left figure is easier to distinguish between cylinder and ellipse versus the right figure. The same interpretation can be seen for the end-cap based equations or Figure 7.





