This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

[1]\fnmFarhad \surPourkamali-Anaraki

[1]\orgdivDepartment of Mathematical and Statistical Sciences, \orgnameUniversity of Colorado Denver, \orgaddress\street1201 Larimer St, \cityDenver, \postcode80204, \stateCO, \countryUSA

2]\orgdivDepartment of Plastics Engineering, \orgnameUniversity of Massachusetts Lowell, \orgaddress\street1 University Ave, \cityLowell, \postcode01854, \stateMA, \countryUSA

3]\orgnameDEVCOM Army Research Laboratory, \orgaddress\cityAberdeen Proving Ground, \postcode21005, \stateMD, \countryUSA

4]\orgdivDepartment of Mechanical and Industrial Engineering, \orgnameUniversity of Massachusetts Lowell, \orgaddress\street1 University Ave, \cityLowell, \postcode01854, \stateMA, \countryUSA

Adaptive Activation Functions for Predictive Modeling with Sparse Experimental Data

farhad.pourkamali@ucdenver.edu    \fnmTahamina \surNasrin tahamina_nasrin@student.uml.edu    \fnmRobert E. \surJensen robert.e.jensen.civ@army.mil    \fnmAmy M. \surPeterson amy_peterson@uml.edu    \fnmChristopher J. \surHansen christopher_hansen@uml.edu * [ [ [
Abstract

A pivotal aspect in the design of neural networks lies in selecting activation functions, crucial for introducing nonlinear structures that capture intricate input-output patterns. While the effectiveness of adaptive or trainable activation functions has been studied in domains with ample data, like image classification problems, significant gaps persist in understanding their influence on classification accuracy and predictive uncertainty in settings characterized by limited data availability. This research aims to address these gaps by investigating the use of two types of adaptive activation functions. These functions incorporate shared and individual trainable parameters per hidden layer and are examined in three testbeds derived from additive manufacturing problems containing fewer than one hundred training instances. Our investigation reveals that adaptive activation functions, such as Exponential Linear Unit (ELU) and Softplus, with individual trainable parameters, result in accurate and confident prediction models that outperform fixed-shape activation functions and the less flexible method of using identical trainable activation functions in a hidden layer. Therefore, this work presents an elegant way of facilitating the design of adaptive neural networks in scientific and engineering problems.

keywords:
Predictive modeling, Neural networks, Activation functions, Conformal prediction, Small data

1 Introduction

Neural networks have made impressive strides in predictive modeling tasks due to their ability to learn nested or composite functions [1, 2]. Therefore, the resulting expressive power to discern complex input-output relationships makes them suitable for a wide range of scientific and engineering disciplines, including biomedical engineering [3, 4, 5], structural engineering [6, 7], mechanical engineering [8, 9, 10, 11], and additive manufacturing [12, 13]. Neural networks are made up of neurons that are interconnected units that process and transmit information. They do this by performing two pivotal operations: computing a weighted sum of inputs and then applying a nonlinear activation function. This interplay among units equips neural networks to learn hierarchical features and multilayered representations to navigate the inherent complexities of diverse problems. In particular, neural networks excel in tasks that require simultaneous feature extraction and predictive modeling within high-dimensional feature spaces.

Among the various hyperparameters that users must set in advance, the selection of activation functions is a key element because of their role in encapsulating nonlinear patterns in the data. Typically, activation functions are predetermined and remain constant throughout the training process in the majority of scenarios. Classical examples of fixed activation functions include Sigmoid and hyperbolic tangent (tanh) functions [14], which squash input values into a specific range. However, fixed activation functions may suffer from problems such as the vanishing gradient problem [15], where gradients become extremely small during backpropagation, hindering the learning process. Therefore, a wide range of activation functions with varying forms of nonlinearity have been introduced in recent years, including Rectified Linear Unit (ReLU) [16], Exponential Linear Unit (ELU) [17], Softplus [18], and Swish [19], to name a few. For instance, the widely-used deep learning library Keras [20], as of version 2.14.0, offers a collection of 17 built-in activation functions.

Although the rapid increase in the number of fixed-shape activation functions offers potential benefits, a significant challenge arises from the intensive computational demands of conducting an exhaustive search to identify optimal functions for specific tasks. This challenge becomes more pronounced in dynamic and diverse scientific problems, necessitating frequent restarts of the search process to maintain the relevance of chosen activation functions. Consequently, the inflexibility inherent in fixed activation functions poses a hindrance to the streamlined development of neural network models within scientific and engineering disciplines.

To address this problem, one promising area of research is the use of adaptive or trainable activation functions [21, 22, 23]. Unlike traditional fixed activation functions such as ReLU or Sigmoid, which maintain a static form throughout training, adaptive functions evolve in response to the data distribution and the model’s learning progress. These functions contain training parameters themselves, allowing us to optimize and tailor the shape or behavior of activation functions during the learning process alongside the parameters of the neural network. Therefore, this adaptability has the potential to enhance the ability to learn complex representations and patterns, while reducing the reliance on costly search procedures and detailed domain expertise.

Although the effectiveness of adaptive activation functions has been extensively studied in fields with abundant annotated data, such as image classification tasks highlighted in Tables 2 and 3 of a recent survey [24], a significant gap exists in understanding their applicability in domains characterized by sparse labeled data sets. A key concern in such scenarios, where data availability is limited, is the potential adverse impact on performance due to the increase in trainable parameters. Furthermore, current research has primarily relied on the standard classification accuracy score to evaluate the success of adaptive activation functions. This score, defined as the ratio of correct predictions to the total number of test samples, poses limitations when applied to scientific and engineering problems with a small number of samples. This metric offers limited assurance regarding the confidence and stability in neural network predictions. Hence, it is imperative to move towards metrics that quantify the uncertainty in the predictions of neural networks to provide a more nuanced understanding of the utility of adaptive activation functions in scientific problems.

Therefore, this paper aims to address the shortcomings mentioned above by systematically investigating the use of adaptive activation functions in three distinct additive manufacturing problems, each limited to fewer than 100 training samples. Our goal is to provide new insights into the possible application of adaptive activation functions, which introduce additional training parameters into data-austere environments. This study involves a comprehensive evaluation of the selection and adaptability of activation functions, coupled with an analysis of their impact on the predictive accuracy and confidence. In particular, we summarize our main contributions as follows.

  1. 1.

    We study the effectiveness of adaptive activation functions compared to their counterparts using fixed-shape activation functions, namely ELU, Softplus and Swish. In our investigation, we explore the common practice of sharing activation functions within a hidden layer and the less explored terrain of using activation functions with individual training parameters. Although individual parameter allocation increases the total number of training parameters, which may be a concern in small data settings, we exhibit that it improves the adaptability of neural networks and their predictive performance.

  2. 2.

    To the best of our knowledge, we are exploring for the first time the effectiveness of adaptive activation functions in applications containing small data sets with fewer than 100 training samples. To conduct a comprehensive investigation, we consider three additive manufacturing problems, with different characteristics such as the number and type of features, as representative of data-austere problems. These problems include the selection of filament materials and 3D printers, as well as the prediction of printability in a complex additive manufacturing problem. We demonstrate that the use of adaptive activation functions is beneficial and justified for these different problems, eliminating the need for predetermined functions.

  3. 3.

    Another distinctive feature of our research is the exploration of the effectiveness of adaptive activation functions through the generation of prediction sets using conformal inference [25, 26], as opposed to relying solely on point predictions. This approach enables us to assess how adaptive activation functions influence the predictive uncertainty of neural network models. In pursuit of this objective, we utilize two metrics: empirical coverage and the average size of the prediction set. In conformal inference, empirical coverage refers to the proportion of actual observed outcomes that fall within the constructed prediction set and the average size gives an indication of the typical range of uncertainty associated with the predictions. A smaller average size suggests more precise and narrow prediction sets, whereas a larger average size indicates a wider range of uncertainty that can be a major consideration in the deployment stage. Hence, this work involves a comprehensive assessment of the reliability of neural networks equipped with adaptive activation functions. This evaluation is particularly crucial in addressing scientific challenges characterized by limited data availability.

  4. 4.

    We provide source code for the implementation of adaptive activation functions with shared and individual trainable parameters in a hidden layer. The source code, available on GitHub https://github.com/farhad-pourkamali/AdaptiveActivation, contains a Keras-compatible implementation of three trainable activation functions derived from ELU, Softplus, and Swish. The code provided and our insights will enable practitioners to use adaptive activation functions in a wider range of data-limited problems while eliminating the need to manually set them ahead of time.

The remainder of this paper is organized as follows. In Section 2, we discuss some mathematical notations and foundations of neural networks together with popular fixed-shape activation functions. Section 3 provides an overview of the existing work on the use of adaptive activation functions. In Section 4, we present the systematic approach we use to rigorously evaluate the predictive performance of adaptive activation functions with shared and individual parameters in small data settings. This evaluation is carried out in three testbeds, each characterized by sparse experimental data originating from additive manufacturing problems. The last section of this paper provides concluding remarks and outlines potential areas of future study.

2 Notations and Background Information

An extensively used neural network type for structured or tabular data problems is the fully connected network, commonly called a multilayer perceptron (MLP) [27]. This architecture consists of densely interconnected layers organized in a sequential manner, presenting an elegant way to learn nested or composite functions crucial for capturing nonlinear input-output relationships. To be formal, imagine a network with LL hidden layers, located between the input and output layers. The ll-th hidden layer consists of NlN_{l} neurons or units. Also, assume that this network takes an input vector xDx\in\mathbb{R}^{D}, where DD is the number of given attributes or features. Furthermore, the weight matrix and the bias vector can be written as W(l)Nl×Nl1W^{(l)}\in\mathbb{R}^{N_{l}\times N_{l-1}} and b(l)Nlb^{(l)}\in\mathbb{R}^{N_{l}}, for each layer indexed by l=1,,L+1l=1,\ldots,L+1. The predicted output f(x)f(x) is then defined from the input xx according to the following equations:

Input layer: x(0)=x,\displaystyle x^{(0)}=x,
Hidden layers: x(l)=g(l)(W(l)x(l1)+b(l)z(l): weighted sum),l=1,,L,\displaystyle x^{(l)}=g^{(l)}\big{(}\underbrace{W^{(l)}x^{(l-1)}+b^{(l)}}_{z^{(l)}:\text{ weighted sum}}\big{)},\;l=1,\ldots,L,
Output layer: f(x)=g(L+1)(W(L+1)x(L)+b(L+1)).\displaystyle f(x)=g^{(L+1)}\big{(}W^{(L+1)}x^{(L)}+b^{(L+1)}\big{)}. (1)

Therefore, the prediction model f(x)f(x) takes on a composite or nested form. The number of units in the output layer NL+1N_{L+1} and the corresponding activation function g(L+1)g^{(L+1)} depend on the problem at hand. For example, in binary classification problems, a single neuron is typically placed, i.e., NL+1=1N_{L+1}=1, and the Sigmoid activation function is used to find the probability that the data point xx belongs to each class. For a given input zz, the Sigmoid activation function takes the form of: Sigmoid(z)=1/(1+ez)\text{Sigmoid}(z)=1/(1+e^{-z}). However, when treating classification problems of more than two classes, the last layer contains one neuron for each class. In this case, we employ the Softmax function to find the categorical distribution for all classes. That is, if we have CC classes, then the Softmax function accepts a set of CC real-valued numbers z1,,zCz_{1},\ldots,z_{C} and converts them into a valid probability distribution by returning ezc/cezce^{z_{c}}/\sum_{c^{\prime}}e^{z_{c^{\prime}}}, for c=1,,Cc=1,\ldots,C [28].

On the other hand, we have greater flexibility when selecting activation functions for intermediate or hidden layers because they provide “latent” representations. Note that in (1), the activation function is applied in the element-wise form, so the standard implementation of dense layers in popular deep learning libraries, such as Keras, follows the same activation function for all neurons in a particular layer. In other words, all neurons placed in a hidden layer apply the same nonlinear function to transfer information to the next layer. Consequently, significant efforts are required to conduct a comprehensive search to select appropriate levels of nonlinearity in order to maintain input-output relationships.

Initial efforts to train neural networks mainly focused on the hyperbolic tangent (tanh) function to activate hidden layers through g(l)g^{(l)}, l=1,,Ll=1,\ldots,L. In the remainder of this paper, we omit the (l)(l) superscript to simplify the notation. The tanh activation function has the following form:

g(z)=ezezez+ez.g(z)=\frac{e^{z}-e^{-z}}{e^{z}+e^{-z}}. (2)

This function has a range of (1,1)(-1,1), which has an advantage due to the zero-centered structure. However, because this function has two different horizontal asymptotes, the derivatives of this function become very small as we move further away from the origin, which is problematic for gradient optimization techniques. Rectified Linear Unit or ReLU has been widely used in recent years to address the vanishing gradient problem [15]. ReLU is defined as g(z)=max(z,0)g(z)=\max(z,0). Hence, when z0z\geq 0, we have g(z)=zg(z)=z and its derivative is always 11. On the other hand, we get g(z)=0g(z)=0 for negative input values zz, therefore, the derivative is equal to 0. Despite the simplicity of ReLU, it exhibits a saturating region, which can be problematic for gradient descent optimization. Specifically, ReLU discards negative values, leading to the known problem of dying ReLU [29]. Consequently, several activation functions have been devised to address these concerns while maintaining the fundamental structure of ReLU.

Exponential Linear Unit (ELU) shares a structure similar to ReLU for nonnegative inputs zz, yet it facilitates the flow of information to some degree for negative values of zz through the expression (ez1)(e^{z}-1). Additionally, Softplus is another popular activation function that provides a smooth approximation of ReLU in the form of g(z)=log(ez+1)g(z)=\log(e^{z}+1) for all values of zz. Consequently, for sufficiently large positive zz values, this function mimics a linear behavior, i.e., g(z)zg(z)\approx z, owing to the inverse relationship between the logarithmic and exponential functions. Another way of approximating the ReLU activation function was proposed in [19], named the Swish activation function, which takes the form of g(z)=zSigmoid(z)=z/(1+ez)g(z)=z\cdot\text{Sigmoid}(z)=z/(1+e^{-z}). Similar to ReLU, Swish acts as a linear function for large positive zz values because the denominator becomes very close to 11. However, what sets Swish apart is its departure from the widely-used monotonicity property. Building on the empirical success of Swish in various computer vision problems, several Swish variants have been proposed in recent years, as highlighted in [30].

3 Existing Work on Adaptive Activation Functions

In recent advancements in neural networks, there has been a growing interest in exploring trainable or adaptive activation functions to enhance their flexibility and adaptability. Traditional activation functions that we discussed in the previous section have fixed shapes, and their performance may be suboptimal for certain tasks. Thus, users must make significant efforts to choose an appropriate activation function from the collection of existing built-in functions. By introducing adaptive activation functions, neural networks can progressively learn the optimal form of input-output relationship for each unit or layer and accelerate the training process. To elucidate this concept, consider the standard empirical risk minimization problem for model fitting with fixed activation functions:

θargminθn=1Ntrainloss(yn,fθ(xn)),θ:={W(1),b(1),,W(L+1),b(L+1)},\theta^{*}\in\operatorname*{arg\,min}_{\theta}\sum_{n=1}^{N_{\text{train}}}\text{loss}\big{(}y_{n},f_{\theta}(x_{n})\big{)},\;\theta:=\big{\{}W^{(1)},b^{(1)},\ldots,W^{(L+1)},b^{(L+1)}\big{\}}, (3)

where 𝒟train={(xn,yn)}n=1Ntrain\mathcal{D}_{\text{train}}=\{(x_{n},y_{n})\}_{n=1}^{N_{\text{train}}} represents the training data set comprising feature and response pairs. The loss function serves as a metric to evaluate the accuracy of predictions, employing specific formulations like the quadratic loss function in regression problems and cross-entropy in classification tasks. In the optimization process, the variable θ\theta serves as a container for weight matrices W(l)W^{(l)} and bias vectors b(l)b^{(l)}, l=1,,L+1l=1,\ldots,L+1. These values are obtained through an initialization step, followed by an iterative refinement of their values. The refinement entails computing the gradient of the loss function with respect to the elements of θ\theta and applying gradient descent to minimize the loss function over iterations.

The core concept of adaptive activation functions lies in parameterizing activation functions to learn the optimal form of nonlinearity introduced in (1). To this end, let g(z;α)g(z;\alpha) be a parameterized activation function and assume that we utilize NhN_{h} distinct activation functions in a neural network model. In this case, we can modify the optimization variable in (3) by adding the new parameters, i.e., θ={W(1),b(1),,W(L+1),b(L+1),α1,,αNh}\theta=\{W^{(1)},b^{(1)},\ldots,W^{(L+1)},b^{(L+1)},\alpha_{1},\ldots,\alpha_{N_{h}}\}. As a result, the total number of parameters to be learned through gradient descent increases, prompting the key question of whether the set of enhanced parameters can improve predictive power compared to fixed-shape activation functions characterized by predefined values of α\alpha.

In this work, we implement and examine some popular parameterized activation functions, starting with the trainable Exponential Linear Unit (ELU) function, which has the following form:

ELU(z;α)={zif z0α(ez1)if z<0.\text{ELU}(z;\alpha)=\begin{cases}z&\text{if }z\geq 0\\ \alpha(e^{z}-1)&\text{if }z<0\end{cases}. (4)

In many deep learning libraries, the default value for α\alpha is set to 11. The derivative of the parameterized ELU is given by αez\alpha e^{z} for negative values of zz, making the selection of α\alpha a crucial consideration during the training process. For instance, when dealing with negative values of zz, where eze^{z} tends to be very small, opting for a larger value of α\alpha becomes essential to prevent the derivative from approaching zero.

Similarly, the parameterized version of Softplus can be written as Softplus(z;α)=log(ez+α2)\text{Softplus}(z;\alpha)=\log(e^{z}+\alpha^{2}). While the default value of 11 results in a smooth approximation of ReLU, choosing very small values for α\alpha renders this activation function nearly linear. This attribute plays a crucial role in finely regulating the complexity of the mapping accomplished by every hidden layer. Lastly, we can introduce an additional tuning parameter for the Swish activation function in the form of Swish(z;α)=zSigmoid(αz)=z/(1+eαz)\text{Swish}(z;\alpha)=z\cdot\text{Sigmoid}(\alpha z)=z/(1+e^{-\alpha z}). Notably, when α=0\alpha=0, this activation function behaves as a scaled linear function, and gradually converging to ReLU as α\alpha increases. Furthermore, we can show that the derivative of the parameterized Swish activation function has the following form:

ddzSwish(z;α)=(1+αz)Sigmoid(αz)αz(Sigmoid(αz))2.\frac{d}{dz}\text{Swish}(z;\alpha)=(1+\alpha z)\cdot\text{Sigmoid}(\alpha z)-\alpha z\cdot\big{(}\text{Sigmoid}(\alpha z)\big{)}^{2}. (5)

Thus, setting α\alpha to 0 results in a constant derivative of 0.5. Conversely, when α\alpha takes nonzero values, the derivative of Swish around the origin becomes significantly dependent on the chosen α\alpha.

Figure 1 depicts the three parameterized activation functions, showcasing their distinctions across a range of α\alpha values to provide a visual understanding of their differences. Setting α=0\alpha=0 in ELU yields the well-known ReLU activation function. However, by increasing α\alpha, ELU enables the information flow for negative values of zz. It should be noted that for nonnegative input values zz, the parameter α\alpha does not alter the structure of ELU. In contrast, α\alpha plays a pivotal role in shaping the overall characteristics of Softplus and Swish for all values of zz.

Refer to caption
Figure 1: Visualizing the impact of the “trainable” parameter α\alpha on modifying the structure of three widely-used activation functions: Exponential Linear Unit (ELU), Softplus, and Swish. The default value for α\alpha in fixed activation functions is commonly set to 11.

Recent research dedicated to investigating the effectiveness of adaptive activation functions has predominantly focused on the analysis of image data sets within the domain of computer vision [31]. A comprehensive survey by Apicella et al. [24] synthesized findings from around 20 studies, extensively utilizing established benchmarks such as MNIST, CIFAR10, CIFAR100, and ImageNet. For example, on the CIFAR10 data set, the median classification accuracy scores using the fixed ReLU and ELU activation functions were approximately 0.910.91 and 0.930.93, respectively. However, employing adaptive ELU and Swish resulted in slightly higher accuracy scores of 0.940.94 and 0.950.95, respectively. Additionally, recent papers by Dubey et al. [23] and Emanuel et al. [32] delved into the exploration of adaptive activation functions, specifically for tasks related to language translation, speech recognition, and text classification.

Expanding the domain, Wang et al. [33] extended the scope by exploring the application of adaptive activation functions across six benchmark data sets in the social and e-commerce domains. Moreover, Klopries et al. [34] investigated adaptive activation functions in the context of unsupervised learning for training autoencoders. In addition, Jagtap et al. [35] and another recent work [36] contributed to the literature by studying adaptive activation functions in the context of solving various forward and inverse problems within the framework of physics-informed neural networks, where the goal is to combine the expressive power of neural networks with the physical constraints provided by partial differential equations.

Despite the abundance of research in the area of adaptive activation functions, there is a notable gap in understanding the effectiveness of adaptive activation functions in addressing sparse data challenges, particularly those arising from intricate experimental scenarios within scientific and engineering domains. To bridge this gap, this paper concentrates on three distinct experimental data sets derived from various additive manufacturing problems. Improving the predictive performance of neural networks and reducing the need for predetermined activation functions are indispensable to accelerate the design and discovery of novel materials with enhanced flexibility and efficiency.

In our study, we specifically evaluate the efficacy of adaptive activation functions in scenarios with fewer than 100100 training samples. This focus provides a unique perspective compared to existing research, which has predominantly focused on data sets with significantly larger sizes spanning multiple orders of magnitude. To illustrate this point, consider the CIFAR10 data set—a collection of 60,000 32x32 color images distributed across 10 classes, with 6,000 images per class. In contrast, in many experimental setups, the average sample size per class is on the order of a few tens of samples. It is worth noting that the expense of gathering a few tens of samples in the world of material experiments may significantly surpass the cost of collecting and labeling images with sizes several orders of magnitudes larger due to machine, maintenance, material, and labor costs. As a result, there is a compelling need for new research to understand the trade-off between performance gains and the increase in trainable parameters associated with the utilization of adaptive activation functions.

4 Adaptive Activation Functions in Small Data Settings

In this section, we undertake a comprehensive evaluation of adaptive activation functions using three distinct experimental data sets. Additionally, we explore three activation functions—ELU, Softplus, and Swish—utilizing both fixed and trainable parameters, denoted as α\alpha. As mentioned earlier, fixed activation functions correspond to the specific choice of α=1\alpha=1, while adaptive activation functions dynamically learn and optimize the optimal value of α\alpha through gradient descent optimization during the training process. To ensure a fair comparison, we initialize the optimization process with a value of 11 for α\alpha and then report the optimized value of α\alpha after completing the training stage. Furthermore, we configured the number of epochs to be 100, employing the Adam optimization algorithm with a learning rate of 0.050.05.

In terms of the neural network architecture, our focus in this paper is on MLPs consisting of a single hidden layer, a deliberate choice driven by the constraint of a small sample size. This choice of L=1L=1 also enables us to explore a critical aspect of adaptive activation functions that has been overlooked in previous research. To elucidate, we examine the common scenario of identical activation functions, where all units in a layer share the same α\alpha parameter. However, we go a step further and investigate the impact of allocating individual parameters for units within a hidden layer. This exploration aims to understand the trade-off between the number of trainable parameters and the adaptability of neural networks, particularly in small data settings.

Figure 2 depicts our systematic investigation, where the model M1 represents networks with fixed activation functions, M2 corresponds to networks that share the same trainable parameter α\alpha, and M3 allows for individual trainable parameters for each activation function. In our investigation, we set the number of units in the hidden layer to be Nh=2N_{h}=2 due to the limited availability of data. However, for completeness, we will explore the impact of NhN_{h} on the performance gap between neural networks with adaptive and fixed activation functions at the end of this section.

Refer to caption
Figure 2: Demonstrating the spectrum of flexibility within the neural network models under examination. M1 denotes the conventional fixed activation functions, while M2 permits a single trainable parameter for the hidden layer. Conversely, M3 offers the utmost flexibility by assigning an individual trainable parameter to each unit in the hidden layer. To implement M3, we employ the Keras Functional API, connecting each hidden unit to the input layer and subsequently concatenating their outputs to form a unified hidden layer. While we fix the number of hidden units Nh=2N_{h}=2, the number of units in the input and output layers are determined by the characteristics of the labeled data specific to each additive manufacturing problem that we consider in this paper.

To gauge the influence of employing trainable activation functions on the confidence of prediction models, we propose a shift from the conventional evaluation method, which relies on singular point predictions, to the creation and assessment of prediction sets. Formally, consider a classification problem with CC classes, denoting the output space as 𝒴=1,,C\mathcal{Y}={1,\ldots,C}. Adopting the conformal inference framework [25, 37], our objective is to construct a prediction set for a new or unseen input vector xtestx_{\text{test}}, expressed as (xtest)𝒴\mathcal{I}(x_{\text{test}})\subseteq\mathcal{Y}, adhering to the following probabilistic rationale:

Prob(ytest(xtest)1δ,δ(0,1)\text{Prob}(y_{\text{test}}\in\mathcal{I}(x_{\text{test}})\geq 1-\delta,\;\delta\in(0,1) (6)

where the desired coverage level 1δ1-\delta is set by users. In this paper, we opt for the conventional selection of δ=0.1\delta=0.1 to establish the desired coverage level at a reasonable level 0.90.9.

Next, we discuss an efficient algorithm to find the prediction set in our numerical experiments. During the training stage, we define the score function as one minus the softmax score of the true class. That is, we have s(xn,yn)=1f(xn)yns(x_{n},y_{n})=1-f(x_{n})_{y_{n}}, where f(xn)ynf(x_{n})_{y_{n}} represents the softmax score of the true class indicated by the ground-truth label yny_{n}. Then, we find the empirical quantile of {s(xn,yn)}n=1Ntrain\{s(x_{n},y_{n})\}_{n=1}^{N_{\text{train}}} at the adjusted level (1δ)(Ntrain+1)/Ntrain\lceil(1-\delta)(N_{\text{train}}+1)/N_{\text{train}}\rceil [37]. Once we obtain the quantile q^\hat{q}, the prediction set for the test vector xtestx_{\text{test}} can be found as follows:

(xtest)={y:f(xtest)y1q^}.\mathcal{I}(x_{\text{test}})=\big{\{}y:f(x_{\text{test}})_{y}\geq 1-\hat{q}\big{\}}. (7)

This implies that the prediction set encompasses all classes with softmax scores that exceed the threshold of 1q^1-\hat{q}.

Given these prediction sets, we evaluate the impact of adaptive activation functions using two metrics: empirical coverage and uncertainty score. Let 𝒟test\mathcal{D}_{\text{test}} represent a distinct testing set with NtestN_{\text{test}} input-output pairs for evaluation. Empirical coverage denotes the proportion of data points with true outputs within the corresponding prediction sets. Similar to classification accuracy, higher values are preferable, and we aim for a minimum coverage level of 0.90.9. On the other hand, the uncertainty score in this context is the average size of prediction sets. A perfect score is 11 for confident and accurate classifiers that predict a single class while adhering to the probabilistic argument. However, higher scores are less desirable, indicating a classifier with less confidence in its predictions.

In the following sections, we partition 30%30\% of the data at random to construct the testing set 𝒟test\mathcal{D}_{\text{test}}, while the remaining portion is allocated for forming the training data set 𝒟train\mathcal{D}_{\text{train}}. Recognizing the sensitivity of neural networks to data splits, we conduct 2020 independent splits and visualize the distribution of evaluation metrics across these experiments. Our chosen visualization method is the violin plot, which combines key elements from box plots and kernel density plots. The width of this plot signifies the density of data points at various values, and the vertical axis represents the probability density. Additionally, the tick in a violin plot denotes the median of each evaluation metric.

As a final note before delving into our three testbeds, it is important to highlight that the version of conformal prediction discussed here differs slightly from the split conformal prediction methods. The key distinction lies in the fact that split conformal prediction typically employs a separate calibration data set for quantile determination. However, in this paper, we opt not to divide the training set for two reasons. First, across all testbeds, our training samples are fewer than 100, and the objective is to utilize as many data points as possible for training neural networks, including optimizing trainable parameters for each activation function. Second, utilizing the training data set enables us to indirectly assess the risk of overfitting—a significant concern in predictive modeling with neural networks. The discussed approach allows us to account for artificially small score functions due to overfitting, negatively impacting empirical coverage on the test set, a factor considered in our comprehensive analysis.

4.1 Filament Selection

Fused filament fabrication, a form of material extrusion additive manufacturing, has gained popularity due to its cost effectiveness, ease of operation, and compatibility with a wide range of thermoplastics [38, 39, 40]. One of the challenges associated with fused filament fabrication is that the quality of the final product depends on many design and processing parameters, including but not limited to layer height, extrusion temperature, print bed temperature, infill density, and infill pattern [41]. Therefore, the selection of appropriate materials and processing and design conditions for a given application is a significant challenge. For example, trial-and-error can be used to determine the upper and lower bounds of adjustment for a parameter. Once these bounds are known, a linear iterative optimization approach can be used to navigate between them. However, machine learning-based surrogates offer the potential to significantly accelerate this search process.

To demonstrate the effectiveness of our classification models in addressing a comparable issue of selecting an appropriate material for fused filament fabrication, we opted for a data set comprising 11 input features. These features encompass four design parameters: layer height, wall thickness, infill density, and infill pattern. Four process parameters (extrusion temperature, print bed temperature, print speed, and fan speed) were also considered. Lastly, three material parameters (roughness, tensile strength, and elongation at break) were included in the data set. The objective of the models is to accurately characterize the material as polylactic Acid (PLA) or acrylonitrile butadiene styrene (ABS). This data set contains 7070 labeled data points, which can be accessed at [42].

In Figure 3, we assess the performance of the three discussed models—M1, M2, and M3—utilizing three evaluation metrics: classification accuracy, empirical coverage, and uncertainty score (i.e., the average size of the prediction set). Based on the first row of Figure 3 that reports the classification accuracy results using plain point predictions, we see that the median scores for M1 with fixed ELU, Softplus, and Swish activation functions are 0.800.80, 0.760.76, and 0.710.71, respectively. Although the best median score is achieved using fixed ELU when M1 is under investigation, a significant drawback of ELU is that its worst-case performance or the minimum score in 2020 data splits is 0.570.57, which is comparable to that of the Swish activation function. Furthermore, the second and third rows of this figure reveal that empirical coverage obtained using the test set 𝒟test\mathcal{D}_{\text{test}} is consistently less than the target coverage level of 0.90.9. For example, the median value of empirical coverage for fixed ELU is 0.660.66. It is also observed that M1 with the fixed Swish activation function exhibits suboptimal empirical coverage, reaching as low as 0.380.38. Therefore, in this case study, we conclude that the three fixed activation functions do not provide accurate and reliable classification models to predict the filament type.

Refer to caption
Figure 3: Employing the filament selection problem as a benchmark, we assess the performance of M1, M2, and M3 using three evaluation metrics. Classification accuracy denotes the fraction of correct predictions on the test data set, while empirical coverage and uncertainty score are derived from prediction sets within the conformal prediction framework using δ=0.1\delta=0.1. We note that M3 demonstrates superior performance compared to M1 and M2. Notably, the worst-case classification accuracy score produced by M3 is comparable to the median score attained by both M1 and M2.

Surprisingly, we observe that the conventional practice of sharing the same trainable parameter across all units in a hidden layer, as implemented in M2, does not produce noticeable improvements compared to M1 with fixed activation functions. For example, the median classification accuracy scores achieved by M2 using trainable ELU, Softplus, and Swish are 0.710.71, 0.850.85, and 0.760.76, respectively. This suggests that the most significant positive impact is observed for adaptive Softplus, while the introduction of the trainable parameter α\alpha decreases the overall accuracy score of the ELU activation function. We notice similar patterns for the second and third rows of Figure 3 when M2 is under investigation. Although the empirical coverage for adaptive ELU is marginally better than that for fixed ELU, it is accompanied by a higher uncertainty score. This indicates that the classifier employing adaptive ELU experiences elevated levels of uncertainty compared to its fixed counterpart. In addition, M2 does not result in notable enhancements in empirical coverage or reductions in the uncertainty score for Softplus and Swish.

The next step involves examining the performance of M3, which incorporates adaptive activation functions with individual trainable parameters. This model shows notable improvements compared to both M1 and M2 from various perspectives. To begin with, as illustrated in the first row of Figure 3, the median classification accuracy scores hover around 0.90.9 for the three selected activation functions. Also, it is important to note that the worst-case performance of M3 is 0.760.76, which is comparable to the median scores obtained by M1 and M2. On the other hand, M3 achieves the best classification accuracy scores of 11, 0.950.95, and 11 when employing ELU, Softplus, and Swish, respectively. Consequently, the integration of adaptive activation functions with individual parameters in M3 represents a substantial improvement in terms of the number of correct predictions. Therefore, our investigation indicates that introducing additional parameters in the training stage is beneficial, even with the limited availability of labeled data.

Shifting focus to conformal prediction for assessing predictive uncertainty beyond plain point predictions, M3 consistently provides superior empirical coverage compared to M1 and M2. For example, the median values of empirical coverage obtained by M3 for ELU, Softplus, and Swish are 0.800.80, 0.830.83, and 0.800.80, respectively. Although the lower-than-expected values of empirical coverage can be attributed to utilizing training samples as calibration data points, it is noteworthy that M3 consistently delivers the highest-quality prediction sets. Importantly, we observe that the constructed prediction sets contain only a single class, indicating that the use of adaptive activation functions with individual trainable parameters results in accurate and reliable classifiers in this example.

4.2 Printer Selection

In addition to the choice of design, material, and printing parameters, the final properties of the parts produced by fused filament fabrication also depend on the specific 3D printer used [43]. This dependence is due to variations in hardware and firmware among different printer models, which can affect material deposition rate, movement precision, and temperature control. Understanding these printer-to-printer variations requires insight into the printing process. Fused filament fabrication begins with slicing three-dimensional computer-aided designs into two-dimensional layers. The printer nozzle then deposits material onto the build surface sequentially, layer-by-layer.

Despite using similar printing parameters (e.g., nozzle temperature, print bed temperature, layer height, infill pattern, and infill density) and the same print geometry, different printers may follow distinct toolpaths, as dictated by both the slicing software and the printer’s own embedded systems. The resulting differences in layer time and deposition pattern significantly impact interlayer adhesion, a key factor for the mechanical properties of the printed object, leading to variations in the strength and surface properties of the final product [44]. Consequently, the process-structure-property correlations also vary across different printers.

In this section, we assess the performance of our classification models, which are designed to identify the specific 3D printers used to manufacture parts. This evaluation is based on a data set collected by Braconnier et al. [43] that contains a total of 104104 input-output pairs. This data set comprises tensile properties—specifically, tensile strength, elastic modulus, and elongation at break—of parts fabricated using three different 3D printers: MakerBot Replicator 2X, Ultimaker 3, and Zortrax M200. These parts were produced with variations in extrusion temperature, layer height, print bed temperature, and print speed. Therefore, our classification models aim to predict the type of printer used for each print, considering 77 input features: tensile strength, elastic modulus, elongation at break, extrusion temperature, layer height, print bed temperature, and print speed.

In Figure 4, we show the comparison results for M1, M2, and M3 with varying levels of adaptability using the printer selection data set. According to the first row of this figure, which presents the conventional classification accuracy score, both M1 and M2 exhibit median scores of approximately 0.90.9. Notably, in this comparison, when utilizing Softplus and Swish, the worst-case scores for M2 are noticeably superior to those of M1. For example, the lowest accuracy score for Softplus with a shared trainable parameter is 0.840.84, while the corresponding value of the fixed Softplus activation function is 0.750.75. Furthermore, we find consistent results within the conformal prediction framework. In general, M2 provides better empirical coverage compared to M1. For example, the maximum empirical coverage values obtained by M2 using ELU, Softplus, and Swish are 0.900.90, 0.930.93, and 0.960.96, respectively. At the same time, the prediction sets generated using M2 contain only a single class, further strengthening the confidence in the prediction models. Therefore, in this case study, adaptive activation functions with a shared trainable parameter can achieve the target coverage level of 0.90.9.

Refer to caption
Figure 4: Employing the printer selection problem as a benchmark, we assess the performance of M1, M2, and M3 using three evaluation metrics. Using the trainable ELU activation function with individual parameters in M3 yields the highest classification accuracy and empirical coverage scores. According to the information from the third row, all prediction sets consist of a single class, except for the fixed Softplus activation function in M1.

Furthermore, Figure 4 indicates that the utilization of individual trainable parameters in M3 leads to higher classification accuracy scores when compared to M2. In particular, the median accuracy score for all three activation function choices in M3 is about 0.930.93. Additionally, the worst-case accuracy score in M3 reaches 0.840.84, which is comparable or superior to that of M2. Consequently, our investigation demonstrates that the increased flexibility of activation functions in M3 improves the classification accuracy score, even with a small sample size.

Within the conformal prediction framework, we also observe substantial improvements in empirical coverage when using individual trainable parameters along with ELU and Softplus. For example, the maximum empirical coverage scores for ELU and Softplus are 0.960.96 and 11, respectively. Therefore, the prediction models in M3 can achieve the desired coverage level of 0.90.9. As a result, this experiment demonstrates that the use of fully adaptive activation functions in M3 provides more accurate and reliable prediction models to determine the type of 3D printer compared to M1 and M2.

Lastly, this experiment highlights the importance of employing the conformal prediction framework for a thorough evaluation of classification models. Although the conventional point predictions indicate that the use of Swish in M3 yields high classification accuracy, a closer look at the second row of Figure 4 reveals that the maximum and minimum empirical coverage scores for Swish in M3 fall short compared to ELU and Softplus. For example, the minimum empirical coverage score in M3 using Swish is 0.680.68, while ELU and Softplus enjoy values of 0.750.75 and 0.710.71 that are closer to the target coverage level. In light of our uncertainty-aware evaluation approach, it becomes evident that ELU delivers the best overall performance.

4.3 Printability Prediction

Vat photopolymerization additive manufacturing fabricates parts by selectively curing a photopolymer feedstock using a light source, typically ultraviolet light. One exciting use of vat photopolymerization is in creation of highly filled polymer composites [45, 46, 47]. Increasing the solid filler content in the feedstock suspension can enhance properties of printed parts, including compressive and tensile strength [48]. However, increasing the solid content beyond 35 vol.% makes the printing process challenging due to increases in viscosity and light scattering. Increasing the filler amount in the suspension causes a exponential increase in viscosity, which can hinder the recoating process between layers, resulting in weak interlayer adhesion, defective prints, or even print failure [49].

To address the viscosity issue corresponding to highly filled (50 vol.% to 70 vol.%) suspensions, a bimodal system can be utilized [50]. This approach involves mixing two distinct sizes of solid particles, each with a different particle size distribution, to enhance packing density. However, experimentally optimizing the blend ratios of fine to coarse particles for minimal viscosity is a resource-intensive process. Additionally, increased solid content can reduce the cure depth, the extent to which the light effectively cures the suspension, due to light scattering [51, 52]. This scatter will lead to prolonged printing times. Therefore, the energy to cure the print layers often needs to be increased. However, excessive curing energy introduces the risks overcuring and parts adhering to the vat, causing print failures. Conversely, insufficient curing energy may lead to incomplete prints. Hence, cure energies must be optimized for obtaining successful prints. Nevertheless, optimizing the cure energy for highly filled suspensions typically relies on a trial-and-error approach.

In this section, we demonstrate the efficacy of our predictive models in predicting the printability of highly filled (50 vol.% to 70 vol.%) bimodal suspensions using a digital light processing-based vat photopolymerization technique. The initial data set includes two input features related to the suspension formulation: the solid loading and blend ratio. Additionally, the data set includes the first layer cure energy, which is the energy used to cure the first five layers onto the build head, and the model layer cure energy, which is the energy input for curing the subsequent layers. The printability of these formulations is labeled as either a “fail” or “success” for 6363 data points.

The left column of Figure 5, which compares the fixed and trainable version of ELU, demonstrates that the default value of α=1\alpha=1 in M1 is inappropriate because the minimum classification accuracy score is about 0.470.47, which is below 0.50.5, a value that can be achieved by a random classifier. Opting for an adaptive ELU with shared trainable parameters in M2 shows a slight improvement in overall performance since the median classification accuracy score is approximately 0.680.68. However, the worst-case performance of ELU in M2 aligns more closely with that of a random classifier.

Refer to caption
Figure 5: Employing the printability prediction problem as a benchmark, we assess the performance of M1, M2, and M3 using three evaluation metrics. Using the trainable ELU and Softplus activation functions with individual parameters in M3 yields the highest classification accuracy and empirical coverage scores. However, it is worth noting that the minimum classification accuracy score for Swish in M3 is 0.370.37, which does not meet the threshold for a random classifier.

Interestingly, we note that the use of fully adaptive ELU activation functions in M3 provides further enhancements. For example, the maximum, median, and minimum classification accuracy scores are 0.840.84, 0.740.74, and 0.580.58, respectively. Therefore, the worst-case performance of M3 across 20 data splits is considerably more reasonable. Additionally, the median empirical coverage score of adaptive ELU in M3 stands at 0.890.89, closely aligning with the desired coverage level of 0.90.9. These data suggest that the predictions made by ELU in M3 are more accurate and reliable compared to M1 and M2.

Furthermore, the middle column of Figure 5 illustrates that the fully adaptive Softplus activation function in M3 surpasses the performance of M1 and M2. In particular, the median classification accuracy and empirical coverage values for Softplus in M3 are 0.730.73 and 0.890.89, respectively. Consequently, similar to ELU, Softplus produces accurate and reliable prediction models when each hidden unit is given the flexibility to optimize the structure of its activation function. However, the right column of this figure highlights a significant drawback of fully adaptive Swish in M3, as its minimum classification accuracy score falls below the threshold of 0.50.5 that a random classifier can achieve. On the other hand, the maximum classification scores for Swish in M1 and M2 are below the scores obtained by ELU and Softplus in M3. Therefore, we can conclude that the fully adaptive ELU and Softplus in M3 offer the best overall performance.

4.4 Exploring the Impact of Optimized α\alpha and NhN_{h}

In this section, our objective is to delve into a more comprehensive understanding of the inner mechanisms governing adaptive activation functions, utilizing the filament selection testbed. The rationale for choosing this particular data set lies in its inclusion of 1111 features, representing the highest count among the three additive manufacturing problems we have examined. In Figure 6, we present histogram plots that show the optimized or learned values of α\alpha in M3. It is essential to recall that the parameter α\alpha is initially set to 11, ensuring that the initial structure of all activation functions is aligned with their fixed counterparts. However, we iteratively update the α\alpha parameter of each activation function during the training process, in conjunction with other weight matrices and bias terms. Also, note that in this scenario, we have 22 individual trainable parameters per run. We consider 2020 independent random data splits, resulting in a total of 4040 optimized values.

Refer to caption
Figure 6: Reporting the learned values of the parameter α\alpha used to regulate the structure of each activation function in M3, as derived from the filament selection data set.

The histogram plot for adaptive ELU reveals that the optimal value of α\alpha is approximately centered around 1, mirroring the default value for its fixed counterpart. However, given the substantial increase in the classification accuracy score in Figure 3 when using the trainable ELU activation function compared to the fixed ELU, we observe instances where adjusting the information flow for negative inputs zz becomes beneficial. For example, the optimized value of α\alpha can reach as high as 44. Notably, we find that the optimized value of α\alpha may even assume negative values. This aligns with the observation that the introduction of nonmonotonicity into the activation function can enhance overall predictive performance, facilitating a more nuanced capture of input-output relationships.

The middle graph in Figure 6 for the parameterized Softplus activation function, i.e., Softplus(z;α)=log(ez+α2)\text{Softplus}(z;\alpha)=\log(e^{z}+\alpha^{2}), demonstrates that the optimized values of α2\alpha^{2} are typically in the interval between 0 and 11. As depicted in Figure 1, it is noteworthy that α2=0\alpha^{2}=0 corresponds to the linear or identity activation function, while α2=1\alpha^{2}=1 offers a smooth approximation of ReLU. This analysis indicates that the optimal degree of nonlinearity lies between these two extremes. Interestingly, we observe similar trends in Figure 6 for the parameterized Swish activation function. In this instance, the majority of optimized α\alpha values fall within the range from 0 to 11, where α=0\alpha=0 corresponds to the linear activation function, and α=1\alpha=1 resembles ReLU. Consequently, our findings for the filament selection data set reaffirm that the suitable degree of nonlinearity for the trainable Swish activation function lies between the linear and ReLU functions.

In the final experiment within this section, we explore the influence of neural network architecture on the accuracy trade-off between fixed and adaptive activation functions. Recall that we initially set the number of hidden layers to 11, and up to this point, our focus has been on the scenario with 22 hidden units, i.e., Nh=2N_{h}=2, as illustrated in Figure 2. The main motivation for this choice was the small sample size in many scientific and engineering applications, including additive manufacturing problems. Thus, our objective was to determine the optimal structure of activation functions for a constrained number of hidden units. To broaden this analysis in the context of the filament selection problem, we now explore varying values of hidden units, specifically Nh{2,4,6,8}N_{h}\in\{2,4,6,8\}.

Figure 7 presents the classification accuracy scores achieved by ELU in M1, M2, and M3, with respect to the number of hidden units NhN_{h}. For the smallest neural network model with Nh=2N_{h}=2, we observe that the adaptive ELU activation function with individual trainable parameters in M3 reaches the maximum accuracy score of 11, outperforming both M1 and M2 by a significant margin. It is worth highlighting that the worst-case performance of M3 when Nh=2N_{h}=2 is much more reasonable compared to M1 and M2. As we increase the number of hidden units to Nh=4N_{h}=4, we see improvements in accuracy in the three models M1, M2, and M3. However, M3 still outperforms the other two models, especially in terms of the worst-case classification accuracy score.

Refer to caption
Figure 7: Investigating the impact of the number of hidden units NhN_{h} on the performance of neural networks with fixed and adaptive ELU activation functions, shown in Figure 2.

Moreover, we observe that higher values of NhN_{h} do not provide substantial accuracy improvements for both fixed and adaptive activation functions. As previously mentioned, this aligns with expectations due to the limited number of training points, which increases the risk of overfitting. This observation reinforces the rationale for employing adaptive activation functions with individual trainable parameters, as implemented in M3. Such an approach allows the training of small yet flexible and accurate neural networks, making them suitable for predictive modeling with sparse experimental data.

5 Conclusion and Future Work

In this study, we investigated the adaptability of neural network models in scenarios with limited data, leveraging parameterized activation functions. Employing three real-world testbeds derived from additive manufacturing problems, we demonstrated that neural network models equipped with individual trainable parameters—going beyond the conventional practice of employing identical activation functions for a given hidden layer—resulted in improved prediction models. We evaluated these enhancements through the lens of point predictions, gauged by the standard classification accuracy score, and further delved into the realm of prediction sets through conformal inference. Specifically, we noted the following key observations.

  1. 1.

    When dealing with sparse scientific data sets, opting for ELU and Softplus activation functions with individual trainable parameters proved advantageous over fixed and parameterized activation functions shared across all units in a hidden layer. The adoption of individual trainable activation functions in M3 demonstrated remarkable flexibility, allowing each unit to discern the optimal degree of nonlinearity introduced when conveying information to the next layer.

  2. 2.

    Leveraging conformal inference emerged as a versatile and crucial approach to assess the confidence in the predictions of neural network models with trainable activation functions. Due to the increase in the number of parameters that must be inferred during the training process, measuring the empirical coverage score and the average prediction set size were informative measures to quantify predictive uncertainty.

  3. 3.

    The performance trade-offs between fixed and adaptive activation functions were heavily contingent on the neural network architecture. Consequently, in scenarios with limited data, striking the right balance between the number of hidden units and their adaptability becomes crucial.

  4. 4.

    As automated machine learning (AutoML) methods gain popularity to democratize the utilization of prediction models among practitioners and engineers [53], incorporating adaptive activation functions can yield substantial improvements. This approach mitigates the dependence on predetermined and fixed-shape activation functions, which can significantly impact predictive performance.

In future work, we plan to explore several extensions of our study. In particular, we aim to investigate the performance of adaptive activation functions in more complex neural network models, such as convolutional neural networks (CNNs). This exploration may find application in the development of adaptive CNNs for constructing accurate prediction models for in situ monitoring of manufacturing technologies with minimal reliance on human supervision. Another extension will revolve around the incorporation of ensemble activation functions. Here, the objective is to leverage richer parameterized activation functions, offering greater flexibility compared to the activation functions employed in this study.

Acknowledgment

Research was sponsored by DEVCOM Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-19-2-0100. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of DEVCOM Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

  • \bibcommenthead
  • Lu and Lu [2020] Lu, Y., Lu, J.: A universal approximation theorem of deep neural networks for expressing probability distributions. Advances in Neural Information Processing Systems 33, 3094–3105 (2020)
  • Talaei Khoei et al. [2023] Talaei Khoei, T., Ould Slimane, H., Kaabouch, N.: Deep learning: systematic review, models, challenges, and research directions. Neural Computing and Applications, 1–22 (2023)
  • Abdou [2022] Abdou, M.: Literature review: Efficient deep neural networks techniques for medical image analysis. Neural Computing and Applications 34(8), 5791–5812 (2022)
  • Weiss et al. [2022] Weiss, R., Karimijafarbigloo, S., Roggenbuck, D., Rödiger, S.: Applications of neural networks in biomedical data analysis. Biomedicines 10(7), 1469 (2022)
  • Liu et al. [2023] Liu, X., Miramini, S., Patel, M., Ebeling, P., Liao, J., Zhang, L.: Development of numerical model-based machine learning algorithms for different healing stages of distal radius fracture healing. Computer Methods and Programs in Biomedicine 233, 107464 (2023)
  • Pourkamali-Anaraki and Hariri-Ardebili [2021] Pourkamali-Anaraki, F., Hariri-Ardebili, M.: Neural networks and imbalanced learning for data-driven scientific computing with uncertainties. IEEE Access 9, 15334–15350 (2021)
  • Khodadadi Koodiani et al. [2023] Khodadadi Koodiani, H., Majlesi, A., Shahriar, A., Matamoros, A.: Non-linear modeling parameters for new construction rc columns. Frontiers in Built Environment 9, 1108319 (2023)
  • Olivier et al. [2021] Olivier, A., Shields, M., Graham-Brady, L.: Bayesian neural networks for uncertainty quantification in data-driven materials modeling. Computer methods in applied mechanics and engineering 386, 114079 (2021)
  • Stuckner et al. [2021] Stuckner, J., Piekenbrock, M., Arnold, S., Ricks, T.: Optimal experimental design with fast neural network surrogate models. Computational Materials Science 200, 110747 (2021)
  • Brunton et al. [2020] Brunton, S., Hemati, M., Taira, K.: Special issue on machine learning and data-driven methods in fluid dynamics. Theoretical and Computational Fluid Dynamics 34(4), 333–337 (2020)
  • Erichson et al. [2020] Erichson, B., Mathelin, L., Yao, Z., Brunton, S., Mahoney, M., Kutz, N.: Shallow neural networks for fluid flow reconstruction with limited sensors. Proceedings of the Royal Society A 476(2238), 20200097 (2020)
  • Johnson et al. [2020] Johnson, N., Vulimiri, P., To, A., Zhang, X., Brice, C., Kappes, B., Stebner, A.: Invited review: Machine learning for materials developments in metals additive manufacturing. Additive Manufacturing 36, 101641 (2020)
  • Pourkamali-Anaraki et al. [2023] Pourkamali-Anaraki, F., Nasrin, T., Jensen, R., Peterson, A., Hansen, C.: Evaluation of classification models in limited data scenarios with application to additive manufacturing. Engineering Applications of Artificial Intelligence 126, 106983 (2023)
  • Hayou et al. [2019] Hayou, S., Doucet, A., Rousseau, J.: On the impact of the activation function on deep neural networks training. In: International Conference on Machine Learning, pp. 2672–2680 (2019)
  • Hu et al. [2021] Hu, Z., Zhang, J., Ge, Y.: Handling vanishing gradient problem using artificial derivative. IEEE Access 9, 22371–22377 (2021)
  • Shen et al. [2022] Shen, S., Zhang, N., Zhou, A., Yin, Z.: Enhancement of neural networks with an alternative activation function tanhlu. Expert Systems with Applications 199, 117181 (2022)
  • Clevert et al. [2015] Clevert, D., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289 (2015)
  • Zheng et al. [2015] Zheng, H., Yang, Z., Liu, W., Liang, J., Li, Y.: Improving deep neural networks using softplus units. In: International Joint Conference on Neural Networks, pp. 1–4 (2015)
  • Ramachandran et al. [2017] Ramachandran, P., Zoph, B., Le, Q.: Searching for activation functions. arXiv preprint arXiv:1710.05941 (2017)
  • Chollet [2021] Chollet, F.: Deep Learning with Python. Simon and Schuster, ??? (2021)
  • Agostinelli et al. [2014] Agostinelli, F., Hoffman, M., Sadowski, P., Baldi, P.: Learning activation functions to improve deep neural networks. arXiv preprint arXiv:1412.6830 (2014)
  • Lee et al. [2022] Lee, K., Yang, J., Lee, H., Hwang, J.: Stochastic adaptive activation function. Advances in Neural Information Processing Systems, 13787–13799 (2022)
  • Dubey et al. [2022] Dubey, S., Singh, S., Chaudhuri, B.: Activation functions in deep learning: A comprehensive survey and benchmark. Neurocomputing (2022)
  • Apicella et al. [2021] Apicella, A., Donnarumma, F., Isgrò, F., Prevete, R.: A survey on modern trainable activation functions. Neural Networks 138, 14–32 (2021)
  • Shafer and Vovk [2008] Shafer, G., Vovk, V.: A tutorial on conformal prediction. Journal of Machine Learning Research 9(3), 371–421 (2008)
  • Barber et al. [2023] Barber, R., Candes, E., Ramdas, A., Tibshirani, R.: Conformal prediction beyond exchangeability. The Annals of Statistics 51(2), 816–845 (2023)
  • Ke and Huang [2020] Ke, K., Huang, M.: Quality prediction for injection molding by using a multilayer perceptron neural network. Polymers 12(8), 1812 (2020)
  • Ren et al. [2020] Ren, J., Yu, C., Ma, X., Zhao, H., Yi, S.: Balanced meta-softmax for long-tailed visual recognition. Advances in Neural Information Processing Systems 33, 4175–4186 (2020)
  • Yang et al. [2023] Yang, D., Ngoc, K., Shin, I., Hwang, M.: DPReLU: Dynamic parametric rectified linear unit and its proper weight initialization method. International Journal of Computational Intelligence Systems 16(1), 11 (2023)
  • Zhu et al. [2021] Zhu, H., Zeng, H., Liu, J., Zhang, X.: Logish: A new nonlinear nonmonotonic activation function for convolutional neural network. Neurocomputing 458, 490–499 (2021)
  • Çatalbaş and Morgül [2023] Çatalbaş, B., Morgül, Ö.: Deep learning with ExtendeD Exponential Linear Unit (DELU). Neural Computing and Applications, 22705–22724 (2023)
  • Emanuel et al. [2023] Emanuel, R., Docherty, P., Lunt, H., Möller, K.: The effect of activation functions on accuracy, convergence speed, and misclassification confidence in CNN text classification: a comprehensive exploration. The Journal of Supercomputing, 1–21 (2023)
  • Wang et al. [2022] Wang, Z., Liu, H., Liu, F., Gao, D.: Why KDAC? A general activation function for knowledge discovery. Neurocomputing 501, 343–358 (2022)
  • Klopries and Schwung [2023] Klopries, H., Schwung, A.: Flexible activation bag: Learning activation functions in autoencoder networks. In: IEEE International Conference on Industrial Technology (ICIT), pp. 1–7 (2023)
  • Jagtap and Karniadakis [2023] Jagtap, A., Karniadakis, G.: How important are activation functions in regression and classification? a survey, performance comparison, and future directions. Journal of Machine Learning for Modeling and Computing 4(1) (2023)
  • Gnanasambandam et al. [2023] Gnanasambandam, R., Shen, B., Chung, J., Yue, X., Kong, Z.: Self-scalable Tanh (Stan): Multi-scale solutions for physics-informed neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(12), 15588–15603 (2023)
  • Angelopoulos and Bates [2023] Angelopoulos, A., Bates, S.: Conformal prediction: A gentle introduction. Foundations and Trends in Machine Learning 16(4), 494–591 (2023)
  • Lee et al. [2019] Lee, J., Lee, H., Cheon, K., Park, C., Jang, T., Kim, H., Jung, H.: Fabrication of poly (lactic acid)/Ti composite scaffolds with enhanced mechanical properties and biocompatibility via fused filament fabrication (FFF)–based 3D printing. Additive Manufacturing 30, 100883 (2019)
  • Wu et al. [2018] Wu, H., Sulkis, M., Driver, J., Saade-Castillo, A., Thompson, A., Koo, J.: Multi-functional ULTEM1010 composite filaments for additive manufacturing using fused filament fabrication (FFF). Additive Manufacturing 24, 298–306 (2018)
  • Pei et al. [2022] Pei, H., Shi, S., Chen, Y., Xiong, Y., Lv, Q.: Combining solid-state shear milling and FFF 3D-printing strategy to fabricate high-performance biomimetic wearable fish-scale PVDF-based piezoelectric energy harvesters. ACS Applied Materials & Interfaces 14(13), 15346–15359 (2022)
  • Goh et al. [2020] Goh, G., Yap, Y., Tan, H., Sing, S., Goh, G., Yeong, W.: Process–structure–properties in polymer additive manufacturing via material extrusion: A review. Critical Reviews in Solid State and Materials Sciences 45(2), 113–133 (2020)
  • [42] Additive Manufacturing. https://apmonitor.com/pds/index.php/Main/AdditiveManufacturing
  • Braconnier et al. [2020] Braconnier, D., Jensen, R., Peterson, A.: Processing parameter correlations in material extrusion additive manufacturing. Additive Manufacturing 31, 100924 (2020)
  • Gao et al. [2021] Gao, X., Qi, S., Kuang, X., Su, Y., Li, J., Wang, D.: Fused filament fabrication of polymer materials: A review of interlayer bond. Additive Manufacturing 37, 101658 (2021)
  • Shah et al. [2021] Shah, D., Morris, J., Plaisted, T., Amirkhizi, A., Hansen, C.: Highly filled resins for DLP-based printing of low density, high modulus materials. Additive Manufacturing 37, 101736 (2021)
  • Zakeri et al. [2020] Zakeri, S., Vippola, M., Levänen, E.: A comprehensive review of the photopolymerization of ceramic resins used in stereolithography. Additive Manufacturing 35, 101177 (2020)
  • Wang et al. [2020] Wang, W., Sun, J., Guo, B., Chen, X., Ananth, K., Bai, J.: Fabrication of piezoelectric nano-ceramics via stereolithography of low viscous and non-aqueous suspensions. Journal of the European Ceramic Society 40(3), 682–688 (2020)
  • Al Rashid et al. [2021] Al Rashid, A., Ahmed, W., Khalid, M., Koc, M.: Vat photopolymerization of polymers and polymer composites: Processes and applications. Additive Manufacturing 47, 102279 (2021)
  • Konijn et al. [2014] Konijn, B., Sanderink, O., Kruyt, N.: Experimental study of the viscosity of suspensions: Effect of solid fraction, particle size and suspending liquid. Powder technology 266, 61–69 (2014)
  • Delarue et al. [2023] Delarue, A., McAninch, I., Peterson, A., Hansen, C.: Increasing printable solid loading in digital light processing using a bimodal particle size distribution. 3D Printing and Additive Manufacturing (2023)
  • Tomeckova and Halloran [2010a] Tomeckova, V., Halloran, J.: Critical energy for photopolymerization of ceramic suspensions in acrylate monomers. Journal of the European Ceramic Society 30(16), 3273–3282 (2010)
  • Tomeckova and Halloran [2010b] Tomeckova, V., Halloran, J.: Cure depth for photopolymerization of ceramic suspensions. Journal of the European Ceramic Society 30(15), 3023–3033 (2010)
  • Jin et al. [2023] Jin, H., Chollet, F., Song, Q., Hu, X.: Autokeras: An AutoML library for deep learning. Journal of Machine Learning Research 24(6), 1–6 (2023)