This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Fuzzy Knowledge Distillation from
High-Order TSK to Low-Order TSK

Xiongtao Zhang, Zezong Yin, Yunliang Jiang, Yizhang Jiang, Danfeng Sun and Yong Liu This work was supported in part by National Natural Science Foundation of China (U22A20102, 62171203), in part by the “Pioneer” and “Leading Goose” R & D Program of Zhejiang Province (2023C01150), and in part by the open project fund of Key Laboratory of Image Processing and Intelligent Control (Huazhong University of science and technology), Ministry of Education (Corresponding author: Yunliang Jiang).Xiongtao Zhang and Zezong Yin are with the Zhejiang Province Key Laboratory of Smart Management&Application of Modern Agricultural Resources, Huzhou University, Huzhou 313000, China, and also with the School of Information Engineering, Huzhou University, Huzhou 313000, China.Yunliang Jiang is with the School of Computer Science and Technology, Zhejiang Normal University, Jinhua 321004, China, and also with the School of Information Engineering, Huzhou University, Huzhou 313000, China. Yizhang Jiang is with the School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China, and also with the Key Laboratory of Image Processing and Intelligent Control (Huazhong University of Science and Technology), Ministry of Education, Wuhan 430074, China. Danfeng Sun is with the School of Computer Science, Hangzhou Dianzi University, Hangzhou 310018, China. Yong Liu is with the College of Control Science and Enginneering, Zhejiang University, Hangzhou 310007, China.
Abstract

High-order Takagi-Sugeno-Kang (TSK) fuzzy classifiers possess powerful classification performance yet have fewer fuzzy rules, but always be impaired by its exponential growth training time and poorer interpretability owing to High-order polynomial used in consequent part of fuzzy rule, while Low-order TSK fuzzy classifiers run quickly with high interpretability, however they usually require more fuzzy rules and perform relatively not very well. Address this issue, a novel TSK fuzzy classifier embeded with knowledge distillation in deep learning called HTSK-LLM-DKD is proposed in this study. HTSK-LLM-DKD achieves the following distinctive characteristics: 1) It takes High-order TSK classifier as teacher model and Low-order TSK fuzzy classifier as student model, and leverages the proposed LLM-DKD (Least Learning Machine based Decoupling Knowledge Distillation) to distill the fuzzy dark knowledge from High-order TSK fuzzy classifier to Low-order TSK fuzzy classifier, which resulting in Low-order TSK fuzzy classifier endowed with enhanced performance surpassing or at least comparable to High-order TSK classifier, as well as high interpretability; specifically 2) The Negative Euclidean distance between the output of teacher model and each class is employed to obtain the teacher logits, and then it compute teacher/student soft labels by the softmax function with distillating temperature parameter; 3) By reformulating the Kullback-Leibler divergence, it decouples fuzzy dark knowledge into target class knowledge and non-target class knowledge, and transfers them to student model. The advantages of HTSK-LLM-DKD are verified on the benchmarking UCI datasets and a real dataset Cleveland heart disease, in terms of classification performance and model interpretability.

Index Terms:
Deep learning, Fuzzy dark knowledge, High-order Takagi-Sugeno-Kang (TSK) fuzzy classifier, Knowledge distillation, Least Learning Machine.

I Introduction

TAkagi-Sugeno-Kang (TSK) fuzzy classifier is one of the most famous fuzzy classifiers, which consists of antecedent part and consequent part with fuzzy rule. The antecedent part divides the input space into several fuzzy areas, the consequent part describes the logic of classifier in those areas. TSK fuzzy classifier has been deeply combined with deep learning [1, 2, 3], and successfully applied to many application fields, including epileptic seizure detection [4, 5, 6], subway fare pricing [7], and vehicles path planning [8].

Among a wide variety of TSK fuzzy classifiers, due to their high interpretability and fast training speed, Zero-order [9] and First-order TSK fuzzy classifiers [10] have attracted the most widespread attentions. However, they are prone to resulting in rule explosion [40], which means the number of fuzzy rules grows rapidly when desire the enhanced classification performance. Address this issue, many methods have been developed, which can be classified into three categories:

  1. 1.

    Hierarchical TSK fuzzy classifier, organizes each layer in a stacked way based on stacking generation principle [11] to obtain improved classification performance. Xiongtao Zhang et al. [3] proposed a novel hierarchical-structure of TSK fuzzy subclassifiers called EP-TSK-FK, which quickly built subclassifiers in terms of parallel learning to obtain augmented validation data, and thus got prediction by fuzzy clustering and KNN. Wang et al. [40] invented a hierarchical TSK fuzzy classifier with shared linguistic fuzzy rules by opening the manifold structure of the original input space, in which the input of each layer includes the output of previous layer besides the original data.

  2. 2.

    Deep learning based TSK fuzzy classifier, integrates the deep learning strategies with TSK fuzzy classifier. Wu et al. [12] extended Dropout to fuzzy rule, where the model dropped some rules at random during training, and used all rules during testing. Cui et al. [13] avoided the excessive influence of strong rules by adjusting the membership degree of fuzzy rules with normalization.

  3. 3.

    High-order TSK fuzzy classifier, uses High-order polynomial in the consequent part to escape the influence of rule explosion and achieve better classification performance with fewer rules. Ren et al. [41] built a High-order type-2 TSK system, in which the membership functions in antecedent part are type-2 fuzzy sets and the consequent part are High-order polynomial function. However, High-order TSK fuzzy classifier still has serious shortcoming: the interpretability suffers substantially damage since the parameters of High-order polynomial in consequent part are overly complex.

In deep learning, large model usually has strong performance owing to using a huge amount of computation to extract structure from data, whose complex parameters makes it difficult to be applied in various real-world fields. In contrast, small model is deployed easily but it performs not very well in practice. Inspired by this, Hinton et al. [14] proposed model compression technique knowledge distillation in recent years, which compresses a student model (usually small or simple model) from a teacher model (usually large or complex model), guides the student model by transferring the dark knowledge via minimizing the Kullback-Leibler divergence (KL divergence) between prediction logits of teacher model and student model, thus improving the performance of student model. As research continues, the direction of knowledge distillation is growing increasingly broad. There have been various knowledge forms and distillation methods in knowledge distillation [15, 16, 17, 18, 19, 20, 21]. The technique of knowledge distillation are also widely used, such as object recognition [22, 23, 24], defect detection [25, 26] and other fields [27, 28, 29], etc. Recent studies have found that knowledge distillation can greatly improve the performance of TSK fuzzy classifier. Gu et al. [30] transfered knowledge from CNN to TSK fuzzy classifier and then explained how the TSK fuzzy classifier made decisions. Erdem et al. [31] used CNN to distill interval type-2 fuzzy classifier, which improved the classification performance on large datasets. Our previous study [1] born-again TSK fuzzy classifier with CNN, which took dark knowledge from CNN as parameters of antecedent part and consequent part of TSK fuzzy classifier respectively, then expressed dark knowledge in an interpretable manner.

As we well known, High-order TSK fuzzy classifier has been exhibiting its outstanding classification performance because of powerful fitting ability, while Low-order TSK fuzzy classifiers have demonstrated strength in concise interpretability due to its interpreatble consequent part. In this paper, we extend our previous study [1] from only binary-class classification task to multi-class classification task by born-again TSK fuzzy classifier using fuzzy knowledge distillation, and propose a novel TSK fuzzy classifier called HTSK-LLM-DKD, which decouples and transfers fuzzy dark knowledge from High-order TSK fuzzy classifier (teacher model) to Low-order TSK fuzzy classifier (student model) while obtaining better classification performance as well as high interpretability. Our study advances in the combination of knowledge distillation and TSK fuzzy classifier to a deeper level. The contributions of this study can be summarized as follows:

  1. 1.

    HTSK-LLM-DKD proposes a novel least learning machine based decoupling knowledge distillation, denoted as LLM-DKD. By comparison with the gradient descent approach, HTSK-LLM-DKD can quickly solve consequent part of teacher model by Least Learning Machine (LLM) [2]. Furthermore, LLM-DKD uses the Negative Euclidean distance between output and each class to obtain logits. According to [32], logits in LLM-DKD can represent more comprehensive class information and transfer fuzzy dark knowledge better with higher semantic level.

  2. 2.

    HTSK-LLM-DKD employs the softmax function with distillating temperature parameter to obtain soft labels of teacher model and student model. With reformulating the Kullback-Leibler divergence (KL divergence), HTSK-LLM-DKD decouples fuzzy dark knowledge into target class knowledge and non-target class knowledge, providing more flexible and efficient logits distillation perspective, so as to further improve the performance of TSK fuzzy classifier.

  3. 3.

    Experimental results on benchmarking datasets and a real-world dataset Cleveland Heart Disease demonstrate the effectiveness of the proposed HTSK-LLM-DKD. In terms of classification performance, HTSK-LLM-DKD achieves the best performance on most datasets and perform better than High-order TSK fuzzy classifier with powerful generalization ability; in terms of interpretability, HTSK-LLM-DKD is inherently comparable with high interpretable Low-order TSK fuzzy classifiers.

Table S1 summarizes the notations used in our study in the supplementary file.

II Related Work

II-A TSK fuzzy classifier

TSK fuzzy classifier can be described by fuzzy IF-THEN rules, which indicate the input-output relationship of the classifier. For TSK fuzzy classifiers, the fuzzy rules can be expressed as follows:

IFx1isA1kx2isA2kxmisAmk\displaystyle{\rm IF}\ x_{1}\ {\rm is}\ A_{1}^{k}\wedge x_{2}\ {\rm is}\ A_{2}^{k}\ \wedge\ldots\wedge\ x_{m}\ {\rm is}\ A_{m}^{k}
THENyk=fk(𝐱)\displaystyle{\rm THEN}\ y^{k}=f^{k}(\mathbf{x}) (1a)

where k=1,2,,Kk=1,2,\ldots,K. KK is the total number of fuzzy rules, the input is denoted as 𝐱=(x1,x2,,xi,,xm)T\mathbf{x}=(x_{1},x_{2},\ldots,x_{i},\ldots,x_{m})^{T}, AikA_{i}^{k} is the fuzzy set of the kk-th rule of xix_{i}, yky^{k} is the output of the kk-th rule, \wedge is the fuzzy conjunction operation. If fk(𝐱)=p0kf^{k}(\mathbf{x})=p_{0}^{k}, we term the TSK fuzzy classifier as the Zero-order TSK fuzzy classifier [9]. If fk(𝐱)=p0k+x1kp1k+x2kp2k++xmkpmkf^{k}(\mathbf{x})=p_{0}^{k}+x_{1}^{k}p_{1}^{k}+x_{2}^{k}p_{2}^{k}+\ldots+x_{m}^{k}p_{m}^{k}, the TSK fuzzy classifier is termed as the First-order TSK fuzzy classifier [10]. If fk(𝐱)f^{k}(\mathbf{x}) is described as:

fk(𝐱)=j1+j2++jmnj1,j2,,jm0aj1,j2,,jmkx1j1x2j2xmjmf^{k}(\mathbf{x})=\sum_{\begin{subarray}{c}j_{1}+j_{2}+\ldots+j_{m}\leq n\\ j_{1},j_{2},\ldots,j_{m}\geq 0\end{subarray}}a_{j_{1},j_{2},\ldots,j_{m}}^{k}x_{1}^{j_{1}}x_{2}^{j_{2}}\ldots x_{m}^{j_{m}} (1b)

It is termed as High-order TSK fuzzy classifier [11]. Where n(n2)n(n\geq 2) is the order of the highest polynomial of TSK fuzzy classifier, jij_{i} is the order of xix_{i}, aj1,j2,,jmka_{j_{1},j_{2},\ldots,j_{m}}^{k} represents the coefficient of the highest polynomial with mm independent variables in the linear combination constituting the kk-th rule.

It is clearly that the consequent parameters of Zero-order and First-order TSK fuzzy classifier are relatively simple, they are the coefficients of the fuzzy membership degree and input variables, which ensure the high interpretability but need more fuzzy rules to achieve satisfactory performance. In contrast, High-order TSK fuzzy classifier can preform better with fewer fuzzy rules, but the consequent part of fuzzy rule are extremely complex and sharply reduce the interpretability. As a result, how to combine the strong performance of High-order TSK fuzzy classifier with the high interpretability of Low-order TSK fuzzy classifier becomes a very valuable research direction.

II-B Knowledge Distillation

Knowledge distillation [14] transfers the dark knowledge from complex/large model, namely the teacher model, to simple/small model, namely the student model, and hence improve the performance of student model. Specifically, knowledge distillation introduces the hyper-parameter temperature in softmax function to obtain soft labels at first, then calculates the KL divergence of soft labels and the cross-entropy of student model output and the ground-truth label, finally transfers the dark knowledge via soft labels from teacher model to student model, improves the performance of student model, as shown in Fig. 1.

There are many strategies to construct the loss function in knowledge distillation, such as the KL divergence [14], the mean squared error [37] and the Jensen–Shannon divergence [38], etc. Traditional knowledge distillation transfers dark knowledge in a highly coupled way, which limits the flexibility for knowledge transfer. Zhao et al. [32] pointed out that dark knowledge can be decoupled into target class knowledge and non-target class knowledge and to be transfered to student model in a more flexible way by reconstructing the KL divergence. Furlanello et al. [33] demonstrated that the decoupled dark knowledge of teacher model can guide student model to have stronger generalization ability than that of teacher model. In this paper, we attempt to distill fuzzy dark knowledge from High-order TSK fuzzy classifier, and propose a novel born-again TSK fuzzy classifier endowed with the powerful classification performance as well as high interpretability.

Refer to caption
Figure 1: Architecture of knowledge distillation. Taking Convolutional Neural Network (CNN) as teacher model and Fully Connected Neural Network (FCNN) as student model for example.

III Htsk-llm-dkd

In this section, we propose a novel TSK fuzzy classifier called HTSK-LLM-DKD, which distilling knowledge from High-order TSK fuzzy classifier (teacher model) to Low-order TSK fuzzy classifier (student model), utilizing the proposed LLM-DKD. Specifically, LLM-DKD firstly trains teacher model quickly and takes the Negative Euclidean distance between the output value and each class to obtain teacher logits. Then, it uses the softmax function with temperature parameter τ\tau to obtain soft labels of teacher model and student model, respectively. Finally, LLM-DKD decouples fuzzy dark knowledge into target class knowledge and non-target class knowledge, and hence transfers them to student model efficiently, so as to obtain better classification performance as well as high interpretability. The overall architecture of HTSK-LLM-DKD is shown in Fig. 2.

Refer to caption
Figure 2: Architecture of HTSK-LLM-DKD. We distill fuzzy dark knowledge from High-order TSK fuzzy classifier to Low-order TSK fuzzy classifier, specifically decouple KL divergence into Target Class KL Loss (TCKL) and Non-Target Class KL Loss (NCKL), optimize the proposed HTSK-LLM-DKD with TCKL, NCKL and Cross-Entropy (H) as loss function.

III-A Specific Architecture of Teacher Model and Student Model

III-A1 Constructing Process of Teacher Model

Based on the definition of High-order polynomial, we prove that High-order polynomial can be composed of several Low-order polynomials, the mathematical proof process is present in the supplementary file. In this paper, Third-order TSK fuzzy classifier is employed as teacher model, which is built by stacking multiple zero-order TSK fuzzy classifiers [34], i.e., the consequent part of teacher model can be expressed as the superposition of the consequent parts of several Low-order TSK fuzzy classifiers:

y=\displaystyle y=\ y0+x1(y0(1,0)+x1y1(1,1)+x2y1(1,2)++xmy1(1,m))\displaystyle y_{0}+x_{1}\left(y_{0}^{(1,0)}+x_{1}y_{1}^{(1,1)}+x_{2}y_{1}^{(1,2)}+\ldots+x_{m}y_{1}^{(1,m)}\right)
+x2(y0(2,0)+x1y1(2,1)+x2y1(2,2)++xmy1(2,m))\displaystyle+x_{2}\left(y_{0}^{(2,0)}+x_{1}y_{1}^{(2,1)}+x_{2}y_{1}^{(2,2)}+\ldots+x_{m}y_{1}^{(2,m)}\right)
++\displaystyle+\ldots+
xm(y0(m,0)+x1y1(m,1)+x2y1(m,2)++xmy1(m,m))\displaystyle x_{m}\left(y_{0}^{(m,0)}+x_{1}y_{1}^{(m,1)}+x_{2}y_{1}^{(m,2)}+\ldots+x_{m}y_{1}^{(m,m)}\right) (1c)

where yn(i,j)y_{n}^{(i,j)} represents the output of the jj-th nn-order TSK fuzzy classifier, which is used to construct the ii-th (n+1n+1)-order TSK fuzzy classifier. Specifically, y0y_{0} indicates the output of Zero-order TSK fuzzy classifier.

III-A2 Calculation Process of Teacher Model and Student Model

(III-A1) is usually calculated by weighted summation:

y=k=1Kμk(𝐱)k=1Kμk(𝐱)yk=k=1Kμ~k(𝐱)yky=\sum_{k=1}^{K}\frac{\mu^{k}(\mathbf{x})}{\sum_{k^{\prime}=1}^{K}\mu^{k^{\prime}}(\mathbf{x})}y^{k}=\sum_{k=1}^{K}\tilde{\mu}^{k}(\mathbf{x})y^{k} (2a)

where μk(𝐱)\mu^{k}(\mathbf{x}) and μ~k(𝐱)\tilde{\mu}^{k}(\mathbf{x}) are the membership degree and the normalized membership degree of the input 𝐱\mathbf{x} in the kk-th fuzzy rule, respectively. Gaussian membership function is widely used to calculate fuzzy membership degree:

μk(𝐱)=i=1mμk(xi)μk(xi)=exp((xivik)22δik)\displaystyle\mu^{k}(\mathbf{x})=\prod_{i=1}^{m}{\mu^{k}(x_{i})}\quad\mu^{k}(x_{i})=\exp\left(\frac{-\left(x_{i}-v_{i}^{k}\right)^{2}}{2\delta_{i}^{k}}\right) (2b)

where μk(xi){\mu^{k}(x_{i})} is the membership degree of xix_{i}. vikv_{i}^{k} is the center of each fuzzy rule, which is randomly selected from {0,0.25,0.5,0.75,1}\{0,0.25,0.5,0.75,1\}, and may have linguistic explanation: {very low,low,medium,high,very high}\{\emph{very low},\emph{low},\emph{medium},\emph{high},\emph{very high}\}, thus ensuring the interpretable antecedent part of the proposed HTSK-LLM-DKD. δik\delta_{i}^{k} is the kernel width, which is set to be a positive value.

Several mathematical transformations from the fuzzy rule to linear function exist, as follows.

𝐱e\displaystyle\mathbf{x}_{e} =(1,𝐱T)T\displaystyle=\left(1,\mathbf{x}^{T}\right)^{T} 𝐱f\displaystyle\mathbf{x}_{f} =(1,xi𝐱eT)T\displaystyle=\left(1,x_{i}\mathbf{x}_{e}^{T}\right)^{T} 𝐱v\displaystyle\mathbf{x}_{v} =(1,xi𝐱fT)T\displaystyle=\left(1,x_{i}\mathbf{x}_{f}^{T}\right)^{T} (3a)
𝐱~gk\displaystyle\tilde{\mathbf{x}}_{g}^{k} =μ~k(𝐱)𝐱v\displaystyle=\tilde{\mu}^{k}(\mathbf{x})\mathbf{x}_{v} 𝐱g\displaystyle\mathbf{x}_{g} =[(𝐱~g1)T,(𝐱~g2)T,,(𝐱~gK)T]T\displaystyle=\left[\left(\tilde{\mathbf{x}}_{g}^{1}\right)^{T},\left(\tilde{\mathbf{x}}_{g}^{2}\right)^{T},\ldots,\left(\tilde{\mathbf{x}}_{g}^{K}\right)^{T}\right]^{T} (3b)
𝐱~hk\displaystyle\tilde{\mathbf{x}}_{h}^{k} =μ~k(𝐱)𝐱e\displaystyle=\tilde{\mu}^{k}(\mathbf{x})\mathbf{x}_{e} 𝐱h\displaystyle\mathbf{x}_{h} =[(𝐱~h1)T,(𝐱~h2)T,,(𝐱~hK)T]T\displaystyle=\left[\left(\tilde{\mathbf{x}}_{h}^{1}\right)^{T},\left(\tilde{\mathbf{x}}_{h}^{2}\right)^{T},\ldots,\left(\tilde{\mathbf{x}}_{h}^{K}\right)^{T}\right]^{T} (3c)

\mathcal{M} and 𝒮\mathcal{S} are denoted as teacher model and student model, respectively. Our teacher model is Third-order TSK fuzzy classifier, which performs well but usually spends much time to train, in this paper we use Least Learning Machine (LLM) [2] to quickly solve the consequent part of teacher model:

y\displaystyle y^{\mathcal{M}} =𝐪gT𝐱g\displaystyle=\mathbf{q}_{g}^{T}\mathbf{x}_{g} 𝐪g\displaystyle\mathbf{q}_{g} =((1/L)𝐈+𝐗gT𝐗g)1𝐗gT𝐘¯\displaystyle=\left((1/L)\mathbf{I}+\mathbf{X}_{g}^{T}\mathbf{X}_{g}\right)^{-1}\mathbf{X}_{g}^{T}\overline{\mathbf{Y}} (4)

where yy^{\mathcal{M}} is the output of teacher model, 𝐪g\mathbf{q}_{g} is the consequent parameters of teacher model, LL is the regularization parameter, 𝐗g=[𝐱g1,𝐱g2,,𝐱gN]T\mathbf{X}_{g}=\left[\mathbf{x}_{g}^{1},\mathbf{x}_{g}^{2},\ldots,\mathbf{x}_{g}^{N}\right]^{T} , 𝐈\mathbf{I} is the identity matrix of N×NN\times N, NN is the number of samples, 𝐘¯=[Y¯1,Y¯2,,Y¯N]T\overline{\mathbf{Y}}=\left[\overline{Y}_{1},\overline{Y}_{2},\ldots,\overline{Y}_{N}\right]^{T} is the ground-truth label.

First-order TSK fuzzy classifier is utilized as student model in this paper, which runs fastly with high interpretability but indeed performs not very well compared with High-order TSK fuzzy classifier. In HTSK-LLM-DKD, the consequent parameters of student model can be updated by the gradient descent approach that is derived by minimizing the cross-entropy error criterion:

𝐳𝒮\displaystyle\mathbf{z}^{\mathcal{S}} =𝐐hT𝐱h\displaystyle=\mathbf{Q}_{h}^{T}\mathbf{x}_{h} H\displaystyle H =i=1Nt=1CY¯i,tlog(zi,t𝒮)\displaystyle=-\sum_{i=1}^{N}\sum_{t=1}^{C}\overline{Y}_{i,t}\log\left(z_{i,t}^{\mathcal{S}}\right) (5)
𝐐h(d+1)=𝐐h(d)ηH𝐐h(d)\mathbf{Q}_{h}(d+1)=\mathbf{Q}_{h}(d)-\eta\frac{\partial H}{\partial\mathbf{Q}_{h}(d)} (6)

where 𝐳𝒮\mathbf{z}^{\mathcal{S}} is student logits as shown in Fig. 2, 𝐐h\mathbf{Q}_{h} is the consequent parameters of student model, HH is the cross-entropy loss, η\eta is the given learning rate.

III-B Least Learning Machine based Decoupling Knowledge Distillation (LLM-DKD)

III-B1 Teacher Logits

LLM-DKD takes the Negative Euclidean distance [39] between teacher model output and each class as teacher logits:

zt=(yy^t)2z_{t}^{\mathcal{M}}=-\sqrt{\left(y^{\mathcal{M}}-\hat{y}_{t}\right)^{2}} (7)

where y^t\hat{y}_{t} is the label of the tt-th class, 𝐳\mathbf{z}^{\mathcal{M}} is teacher logits as shown in Fig. 2, 𝐳=[z1,z2,,zt,,zC]R1×C\mathbf{z}^{\mathcal{M}}=\left[z^{\mathcal{M}}_{1},z^{\mathcal{M}}_{2},\ldots,z^{\mathcal{M}}_{t},\ldots,z^{\mathcal{M}}_{C}\right]\in\mathrm{R}^{1\times C}, CC is the number of classes.

III-B2 Target Class Knowledge and Non-target Class Knowledge

For a given data from the tt-th class, the soft labels can be denoted as 𝐮=[u1,u2,,ut,,uC]R1×C\mathbf{u}=[u_{1},u_{2},\ldots,u_{t},\ldots,u_{C}]\in\mathrm{R}^{1\times C}, where uiu_{i} is the soft label of the ii-th class. Each element in 𝐮\mathbf{u} can be obtained by the softmax function with temperature τ\tau:

ui=exp(zi/τ)j=1Cexp(zj/τ)u_{i}=\frac{\exp\left(z_{i}/\tau\right)}{\sum_{j=1}^{C}\exp\left(z_{j}/\tau\right)} (8a)

where ziz_{i} represents the logit of the ii-th class.

LLM-DKD decouples fuzzy dark knowledge into target class knowledge and non-target class knowledge by:

ut\displaystyle u_{t} =exp(zt/τ)j=1Cexp(zj/τ)\displaystyle=\frac{\exp\left(z_{t}/\tau\right)}{\sum_{j=1}^{C}\exp\left(z_{j}/\tau\right)} u\t\displaystyle u_{\backslash t} =j=1,jtCexp(zj/τ)j=1Cexp(zj/τ)\displaystyle=\frac{\sum_{j^{\prime}=1,j^{\prime}\neq t}^{C}\exp\left(z_{j^{\prime}}/\tau\right)}{\sum_{j=1}^{C}\exp\left(z_{j}/\tau\right)} (8b)

where utu_{t} represents soft labels of the target class, containing knowledge about the “difficulty” of data. u\tu_{\backslash t} represents soft labels of non-target class, containing knowledge making knowledge distillation work [32].

We use 𝐮^=[u^1,u^2,,u^t1,u^t+1,,u^C]R1×(C1)\mathbf{\hat{u}}=\left[\hat{u}_{1},\hat{u}_{2},\ldots,\hat{u}_{t-1},\hat{u}_{t+1},\ldots,\hat{u}_{C}\right]\in\mathrm{R}^{1\times(C-1)} to independently model probabilities among non-target classes:

u^i=exp(zi/τ)j=1,jtCexp(zj/τ)\hat{u}_{i}=\frac{\exp\left(z_{i}/\tau\right)}{\sum_{j=1,j\neq t}^{C}\exp\left(z_{j}/\tau\right)} (8c)
Algorithm 1 Teacher model and Student model.

Input: Training dataset 𝐗={𝐱i,𝐱iRm,i=1,2,,N}\mathbf{X}=\{\mathbf{x}_{i},\mathbf{x}_{i}\in\mathrm{R}^{m},i=1,2,...,N\} and the ground-truth label 𝐘¯={Y¯i,Y¯iR,i=1,2,,N}\overline{\mathbf{Y}}=\{\overline{Y}_{i},\overline{Y}_{i}\in\mathrm{R},i=1,2,...,N\}, the number of fuzzy rules KK, the regularization parameter LL, the maximum iteration epoch θ\theta, the threshold parameter ξ\xi, the learning rate η\eta.
Output: the outputs of teacher model and student model.
Procedure:
Step 1: Randomly select the center vikv_{i}^{k} of Gaussian membership function in (2b) from five fixed fuzzy partition {0,0.25,0.5,0.75,1}\{0,0.25,0.5,0.75,1\}, set the width δik\delta_{i}^{k} to be a positive value, and compute the normalized fuzzy membership degree by (2a)-(2b).
Step 2: Calculate antecedent parameter matrixs of teacher model and student model by (3a)-(3c).
Step 3: The consequent parameter of teacher model 𝐪g\mathbf{q}_{g} can be determined by using 𝐪g=((1/L)𝐈+𝐗gT𝐗g)1𝐗gT𝐘¯\mathbf{q}_{g}=\left((1/L)\mathbf{I}+\mathbf{X}_{g}^{T}\mathbf{X}_{g}\right)^{-1}\mathbf{X}_{g}^{T}\overline{\mathbf{Y}}.
Step 4: The consequent parameter of student model 𝐐h\mathbf{Q}_{h} can be updated using the gradient descent approach:
     Step 4(a): Initialize consequent parameter 𝐐h\mathbf{Q}_{h} and d=1d=1.
     Repeat
     Step 4(b): Use (6) to compute 𝐐h(d+1)\mathbf{Q}_{h}(d+1).
     Step 4(c): d=d+1d=d+1.
     Until H(d)H(d1)ξH(d)-H(d-1)\leq\xi or dθd\geq\theta
Step 5: Calculate the output of teacher model y=𝐪gT𝐱gy^{\mathcal{M}}=\mathbf{q}_{g}^{T}\mathbf{x}_{g} and the output of student model 𝐳𝒮=𝐐hT𝐱h\mathbf{z}^{\mathcal{S}}=\mathbf{Q}_{h}^{T}\mathbf{x}_{h}.

Algorithm 2 HTSK-LLM-DKD.

Input: Training dataset 𝐗={𝐱i,𝐱iRm,i=1,2,,N}\mathbf{X}=\{\mathbf{x}_{i},\mathbf{x}_{i}\in\mathrm{R}^{m},i=1,2,...,N\} and the ground-truth label 𝐘¯={Y¯i,Y¯iR,i=1,2,,N}\overline{\mathbf{Y}}=\{\overline{Y}_{i},\overline{Y}_{i}\in\mathrm{R},i=1,2,...,N\}, the outputs of teacher model 𝐲={yi,yiR,i=1,2,,N}\mathbf{y}^{\mathcal{M}}=\{y^{\mathcal{M}}_{i},y^{\mathcal{M}}_{i}\in\mathrm{R},i=1,2,...,N\} and the outputs of student model 𝐙𝒮={𝐳i𝒮,𝐳i𝒮RC,i=1,2,,N}\mathbf{Z}^{\mathcal{S}}=\{\mathbf{z}^{\mathcal{S}}_{i},\mathbf{z}^{\mathcal{S}}_{i}\in\mathrm{R}^{C},i=1,2,...,N\}, the maximum iteration epoch θ\theta, the threshold parameter ξ\xi, the learning rate η\eta, the distillation parameters τ\tau, ζ\zeta, λ\lambda, φ\varphi.
Output: the output of HTSK-LLM-DKD.
Procedure:
Step 1: Calculate the logits of teacher model 𝐳\mathbf{z}^{\mathcal{M}} with the Negative Euclidean distance between yy^{\mathcal{M}} and the label of each class by (7).
Step 2: Calculate soft labels of teacher model 𝐮\mathbf{u}^{\mathcal{M}} and student model 𝐮𝒮\mathbf{u}^{\mathcal{S}} with the softmax function by (8a).
Step 3: Decouple fuzzy dark knowledge into target class knowledge 𝐫\mathbf{r} and non-target class knowledge 𝐮^\mathbf{\hat{u}} by (8a)-(8c).
Step 4: Use 𝐫\mathbf{r} and 𝐮^\mathbf{\hat{u}} to rephrase KL(𝐮𝐮𝒮)\operatorname{KL}(\mathbf{u}^{\mathcal{M}}\|\mathbf{u}^{\mathcal{S}}) by (III-B3)-(9e).
Step 5: Calculate the new loss function of HTSK-LLM-DKD: Loss=ζKL(𝐫𝐫𝒮)+λKL(𝐮^𝐮^𝒮)+φHLoss=\zeta\operatorname{KL}\left(\mathbf{r}^{\mathcal{M}}\|\mathbf{r}^{\mathcal{S}}\right)+\lambda\mathrm{KL}\left(\widehat{\mathbf{u}}^{\mathcal{M}}\|\widehat{\mathbf{u}}^{\mathcal{S}}\right)+\varphi H.
Step 6: Calculate the consequent parameter 𝐐h\mathbf{Q}_{h} of HTSK-LLM-DKD using the gradient descent approach with the new loss function:
     Step 6(a): Initialize consequent parameter 𝐐h\mathbf{Q}_{h} and d=1d=1.
     Repeat
     Step 6(b): 𝐐h(d+1)=𝐐h(d)ηLoss𝐐h(d)\mathbf{Q}_{h}(d+1)=\mathbf{Q}_{h}(d)-\eta\frac{\partial Loss}{\partial\mathbf{Q}_{h}(d)}.
     Step 6(c): d=d+1d=d+1.
     Until Loss(d)Loss(d1)ξLoss(d)-Loss(d-1)\leq\xi or dθd\geq\theta
Step 7: Calculate the output of HTSK-LLM-DKD.

III-B3 Fuzzy Dark Knowledge Decoupling Process of LLM-DKD

Widely used KL divergence [14] is employed for decoupling fuzzy dark knowledge, which can be expressed as:

KD=\displaystyle KD= KL(𝐮𝐮𝒮)=utlog(utut𝒮)\displaystyle\operatorname{KL}\left(\mathbf{u}^{\mathcal{M}}\|\mathbf{u}^{\mathcal{S}}\right)=u_{t}^{\mathcal{M}}\log\left(\frac{u_{t}^{\mathcal{M}}}{u_{t}^{\mathcal{S}}}\right)
+i=1,itCuilog(uiui𝒮)\displaystyle+\sum_{i=1,i\neq t}^{C}u_{i}^{\mathcal{M}}\log\left(\frac{u_{i}^{\mathcal{M}}}{u_{i}^{\mathcal{S}}}\right) (9a)

According to (8a)-(8c), we can obtain that u^i=ui/u\t\hat{u}_{i}={u}_{i}/u_{\backslash t}, and (III-B3) can be reformulated as:

KD=\displaystyle KD= utlog(utut𝒮)\displaystyle u_{t}^{\mathcal{M}}\log\left(\frac{u_{t}^{\mathcal{M}}}{u_{t}^{\mathcal{S}}}\right)
+u\ti=1,itCu^i(log(u^iu^i𝒮)+log(u\tu\t𝒮))\displaystyle+u_{\backslash t}^{\mathcal{M}}\sum_{i=1,i\neq t}^{C}\hat{u}_{i}^{\mathcal{M}}\left(\log\left(\frac{\hat{u}_{i}^{\mathcal{M}}}{\hat{u}_{i}^{\mathcal{S}}}\right)+\log\left(\frac{u_{\backslash t}^{\mathcal{M}}}{u_{\backslash t}^{\mathcal{S}}}\right)\right) (9b)

Since u\tu_{\backslash t}^{\mathcal{M}} and u\t𝒮u_{\backslash t}^{\mathcal{S}} are irrelevant to the class index ii, (III-B3) can be further expressed as:

KD=\displaystyle KD= utlog(utut𝒮)+u\tlog(u\tu\t𝒮)\displaystyle u_{t}^{\mathcal{M}}\log\left(\frac{u_{t}^{\mathcal{M}}}{u_{t}^{\mathcal{S}}}\right)+u_{\backslash t}^{\mathcal{M}}\log\left(\frac{u_{\backslash t}^{\mathcal{M}}}{u_{\backslash t}^{\mathcal{S}}}\right)
+u\ti=1,itCu^ilog(u^iu^i𝒮)\displaystyle+u_{\backslash t}^{\mathcal{M}}\sum_{i=1,i\neq t}^{C}\hat{u}_{i}^{\mathcal{M}}\log\left(\frac{\hat{u}_{i}^{\mathcal{M}}}{\hat{u}_{i}^{\mathcal{S}}}\right) (9c)

We define binary prediction 𝐫=[ut,u\t]R1×2\mathbf{r}=[u_{t},u_{\backslash t}]\in\mathrm{R}^{1\times 2} to represent soft labels of target class and non-target class. The new expression of knowledge distillation can be described as:

KD=KL(𝐫𝐫𝒮)+(1ut)KL(𝐮^𝐮^𝒮)KD=\operatorname{KL}\left(\mathbf{r}^{\mathcal{M}}\|\mathbf{r}^{\mathcal{S}}\right)+\left(1-u_{t}^{\mathcal{M}}\right)\mathrm{KL}\left(\widehat{\mathbf{u}}^{\mathcal{M}}\|\widehat{\mathbf{u}}^{\mathcal{S}}\right) (9d)

where 𝐫\mathbf{r}^{\mathcal{M}} and 𝐫𝒮\mathbf{r}^{\mathcal{S}} represent the soft labels of teacher model and student model, respectively.

As shown in (9d), the loss function of knowledge distillation is reformulated as the weighted summation of two terms. KL(𝐫𝐫𝒮)\operatorname{KL}\left(\mathbf{r}^{\mathcal{M}}\|\mathbf{r}^{\mathcal{S}}\right) indicates the similarity of binary probability for teacher and student model about target class prediction, called target class knowledge; KL(𝐮^𝐮^𝒮)\mathrm{KL}\left(\widehat{\mathbf{u}}^{\mathcal{M}}\|\widehat{\mathbf{u}}^{\mathcal{S}}\right) indicates the similarity of teacher and student model for internal relations in non-target class, called non-target class knowledge. The transfer efficiency of non-target knowledge is negatively correlated with utu_{t}^{\mathcal{M}}. We introduce ζ\zeta and λ\lambda to reweight them, and get the following expression of decoupled knowledge distillation:

DKD=ζKL(𝐫𝐫𝒮)+λKL(𝐮^𝐮^𝒮)DKD=\zeta\operatorname{KL}\left(\mathbf{r}^{\mathcal{M}}\|\mathbf{r}^{\mathcal{S}}\right)+\lambda\mathrm{KL}\left(\widehat{\mathbf{u}}^{\mathcal{M}}\|\widehat{\mathbf{u}}^{\mathcal{S}}\right) (9e)

We integrate the cross-entropy loss HH in (5) to obtain the total loss function of HTSK-LLM-DKD:

Loss=ζKL(𝐫𝐫𝒮)+λKL(𝐮^𝐮^𝒮)+φHLoss=\zeta\operatorname{KL}\left(\mathbf{r}^{\mathcal{M}}\|\mathbf{r}^{\mathcal{S}}\right)+\lambda\mathrm{KL}\left(\widehat{\mathbf{u}}^{\mathcal{M}}\|\widehat{\mathbf{u}}^{\mathcal{S}}\right)+\varphi H (10)

III-C HTSK-LLM-DKD Algorithm

Here, we summarize the learning algorithm of teacher model and student model in Algorithm 1. The proposed HTSK-LLM-DKD is given in Algorithm 2.

IV Experiments

Our HTSK-LLM-DKD is an interpretable model, for its interpretability guarantee, nine fuzzy classifiers are selected to do comparative experiments with HTSK-LLM-DKD. In Section IV-A, we describe the experiment setups. In Section IV-B, we report the experimental results and analysis on UCI datasets. In Section IV-C, a case study of HTSK-LLM-DKD on the Cleveland heart disease dataset is given in detail. Section IV-D discusses the effectiveness of decoupled knowledge distillation. In Section IV-E, we explain the interpretability of HTSK-LLM-DKD.

IV-A Experiment Setups and Performance Indicators

IV-A1 Datasets

Since we focus on classification tasks, in this experiment, we randomly select sixteen widely used classification datasets from UCI repository [35] and a real dataset Cleveland heart disease. All the adopted dataset are normalized and the categorical features are converted into numerical features. Table S2 discribes all the adopted UCI datasets in the supplementary file.

IV-A2 Comparative Methods

As stated in Section III, HTSK-LLM-DKD is invented as a novel TSK fuzzy classifier by distilling fuzzy dark knowledge from High-order TSK fuzzy classifier to Low-order TSK fuzzy classifier. Therefore, we adopt nn-order TSK fuzzy classifier (n=0,1,2,3n=0,1,2,3) with several variants as comparative methods, i.e., TSKnv1{}_{v1}^{n} (nn is order, n=0,1,2,3n=0,1,2,3) use LLM to solve consequent part; TSKnv2{}_{v2}^{n} (nn is order, n=0,1,2,3n=0,1,2,3) take gradient descent approach to update consequent parameters; LSSVFSn (nn is order, n=3n=3) utilizes the Least Squares Support Vector Machine (LSSVM) [36] with third-order polynomial to solve consequent part. On the other hand, we realize three distillation models with two versions i.e., CNN-TSK-KD and CNN-TSK-DKD, which all take CNN as teacher model, and First-order TSK fuzzy classifier as student model, employing traditional distillation method (denoted as KD) and the decoupled knowledge distillation method proposed in this paper (denoted as DKD), respectively. In addition, in oder to futher evalute the effectiveness of decoupled knowledge distillation method in HTSK-LLM-DKD, we implement HTSK-LLM-KD as comparative method, which use traditional distillation method in HTSK-LLM-DKD. Table S3 discribes all comparative methods in the supplementary file.

IV-A3 Parameters setting

All methods are cross-validated by ten-fold on the adopted datasets, and all adjustable parameters are optimized using grid search strategy. The optimization range of fuzzy rules is searched from K={1,2,,20}K=\{1,2,...,20\}; the regularization parameter LL is set to 100100; distillation parameters τ\tau, ζ\zeta, λ\lambda and φ\varphi are searched from {1,2,5,10,20,100}\{1,2,5,10,20,100\}; the maximum iteration θ\theta is set to 30; the threshold parameter ξ\xi is set to 10510^{-5}; the learning rate η\eta is set to 0.01; other parameters are set by the default values.

IV-A4 Performance Indicators

To evaluate the classification performance of all the adopted methods, four commonly-used performance indices are used in the experiment, i.e., AccuracyAccuracy (Acc for simplication), WeightedFWeighted-F (W-F for simplication), the running time and the average number of fuzzy rule. AccuracyAccuracy is the most intuitive performance metric, which displays the ratio of properly predicted samples to total samples. WeightedFWeighted-F is a weighted average of PrecisionPrecision and RecallRecall, which takes into account both false positives and false negatives. The best results are marked in bold, please note that ‘-’ means that the adopted methods cannot work out its results within 3 hours.

IV-B Experimental Results and Analysis on UCI Datasets

Table I demonstrates the experiment results of the proposed HTSK-LLM-DKD and HTSK-LLM-KD compared with nine adopted fuzzy classifiers. We conclude the following results:

  1. 1.

    Our HTSK-LLM-DKD and HTSK-LLM-KD all achieve the first and the second best performance on eleven out of sixteen datasets in terms of AccuracyAccuracy and WeightedFWeighted-F, respectively, which shows that knowledge distillation can improve the performance of TSK fuzzy classifier very well and HTSK-LLM-DKD is much better than HTSK-LLM-KD. Additionally, HTSK-LLM-DKD keeps a comparable performance on TIT, WIL, PHO, MAG and ADU datasets.

  2. 2.

    Evidently, HTSK-LLM-DKD performs better with fewer fuzzy rules than TSK1v2{}_{v2}^{1}, which is the student model of HTSK-LLM-DKD. At the same time, HTSK-LLM-DKD performs better with shorter running time than High-order TSK fuzzy classifier on most datasets. It shows that the fuzzy dark knowledge improves the generalization ability of HTSK-LLM-DKD greatly, and the proposed LLM-DKD can speed up the running time of HTSK-LLM-DKD.

Table II displays the experiment results of HTSK-LLM-DKD, HTSK-LLM-KD, CNN-TSK-DKD and CNN-TSK-KD compared with its corresponding student model in terms of AccuracyAccuracy and WeightedFWeighted-F, respectively. We conclude the following results:

  1. 1.

    Our HTSK-LLM-DKD has achieved the best performance improvement, with 1.41% and 1.56% in terms of Accuracy and Weighted-F, which indicates that knowledge distillation can effectively improve the performance of student model by transferring fuzzy dark knowledge from teacher model.

  2. 2.

    The comparative experiments in Table II can be divided into two parts. The first part is the comparison between decoupled knowledge distillation and traditional knowledge distillation. It can be seen that HTSK-LLM-DKD and CNN-TSK-DKD perform better than HTSK-LLM-KD and CNN-TSK-KD, we believe the reason lie in that: in traditional knowledge distillation, the transfer efficiency of non-target knowledge is negatively correlated with the confidence on the target class of teacher model, and decoupled knowledge distillation can better transfer non-target knowledge by re-weighting, thus improving the generalization ability of student model.

  3. 3.

    The second part is the comparison between CNN and TSK fuzzy classifier as teacher models. We can observe that the average AccuracyAccuracy and WeightedFWeighted-F of HTSK-LLM-KD and HTSK-LLM-DKD are usually higher than CNN-TSK-KD and CNN-TSK-DKD, which indicates that the fuzzy dark knowledge transfered from High-order TSK fuzzy classifier is more conducive to improving the performance of low-order TSK fuzzy classifier than CNN.

TABLE I: Rule Number, Average Time, Average Accuracy (%), Average Weighted-F (%) and Standard Deviation (%) on UCI Datasets
Datasets TSK0v1{}_{v1}^{0} TSK1v1{}_{v1}^{1} TSK2v1{}_{v1}^{2} TSK3v1{}_{v1}^{3} TSK0v2{}_{v2}^{0} TSK1v2{}_{v2}^{1} TSK2v2{}_{v2}^{2} TSK3v2{}_{v2}^{3} LSSVFS3 HTSK -LLM -KD HTSK -LLM -DKD
Acc±Std Acc±Std Acc±Std Acc±Std Acc±Std Acc±Std Acc±Std Acc±Std Acc±Std Acc±Std Acc±Std \bigstrut[t]
W-F±Std W-F±Std W-F±Std W-F±Std W-F±Std W-F±Std W-F±Std W-F±Std W-F±Std W-F±Std W-F±Std
Rules Time Rules Time Rules Time Rules Time Rules Time Rules Time Rules Time Rules Time Rules Time Rules Time Rules Time
IRI 82.59±4.05 85.33±4.23 83.33±4.67 80.06±4.89 90.44±4.72 97.08±3.46 96.80±3.49 96.33±4.68 85.40±4.25 98.00±3.22 98.66±2.81 \bigstrut[t]
83.28±4.04 85.48±4.93 83.87±4.95 81.12±4.74 90.31±4.94 96.96±3.70 96.39±4.07 95.84±5.42 86.15±4.52 98.06±3.16 98.56±3.01
18.1 0.0050 10.3 0.0063 2.6 0.0087 1.1 0.0133 17.4 0.1806 8.5 0.1421 6.2 0.1583 2.8 0.1825 2.4 0.5021 7.9 0.1462 6.3 0.1511 \bigstrut[b]
WIN 80.48±4.98 88.02±4.10 83.43±5.47 75.71±4.94 93.32±3.69 98.39±2.66 97.20±3.07 96.27±3.70 89.50±3.53 98.88±2.34 99.44±1.75 \bigstrut[t]
80.52±4.91 88.04±4.06 83.40±5.08 75.45±4.91 93.36±3.69 98.08±3.29 97.01±3.39 96.30±3.67 89.49±3.61 98.48±3.22 99.15±2.67
18.2 0.0049 4.4 0.0130 8.1 0.1034 7.8 0.5818 18.3 0.1656 9.4 0.1622 8.2 0.3519 4.6 1.6877 5 0.7188 8.5 0.1698 8.1 0.1532 \bigstrut[b]
TIT 79.05±3.85 79.08±3.82 78.87±4.02 79.12±3.75 77.82±3.62 77.90±3.75 78.82±3.69 79.00±3.85 78.46±3.99 78.41±4.09 78.60±3.87 \bigstrut[t]
76.00±4.85 76.02±4.72 75.71±4.89 76.21±4.67 75.20±4.44 75.42±4.25 75.42±4.02 75.35±4.42 75.39±4.38 75.34±4.91 75.41±4.50
16.3 0.0103 6.2 0.0118 4.2 0.0135 3.2 0.0424 14.7 0.8405 7.5 0.8386 6.8 0.9367 5.9 1.1095 2.4 93.4825 5.7 1.1072 4.8 0.9931 \bigstrut[b]
SEE 92.04±5.30 92.90±5.10 91.90±5.04 82.85±5.75 92.32±4.56 92.50±4.64 93.76±4.95 91.66±5.76 94.00±4.12 94.85±4.61 96.19±3.85 \bigstrut[t]
92.05±5.41 92.91±5.12 91.92±5.09 82.96±5.69 92.30±4.66 92.19±4.98 93.66±4.19 91.47±5.23 94.03±4.09 94.32±4.37 96.06±3.97
17.5 0.0039 6.9 0.0112 1 0.0100 1 0.0594 19.1 0.1724 8.4 0.1733 6.5 0.2075 3.8 0.3060 2.2 1.1360 8.9 0.2588 6.2 0.2152 \bigstrut[b]
ION 82.59±5.99 86.43±5.06 80.57±5.31 79.34±5.34 81.91±5.33 92.40±4.77 90.53±4.68 90.10±5.36 80.88±5.11 92.53±5.12 92.87±4.31 \bigstrut[t]
81.75±5.70 85.76±5.49 79.49±5.25 78.13±5.29 81.12±5.42 91.44±5.26 89.32±5.17 88.97±5.15 79.48±5.17 91.58±4.74 92.11±4.40
17.6 0.0152 2.5 0.0339 7.1 0.7432 6.5 20.3081 17.3 0.3414 13.4 0.3944 6.9 2.8994 4.6 44.3618 5.2 2.7954 8.8 0.3069 7.3 0.3158 \bigstrut[b]
WIL 95.05±1.09 97.89±0.81 97.54±0.94 97.99±0.79 94.60±1.03 95.36±0.95 96.94±1.52 97.18±1.11 97.46±0.70 95.44±1.11 95.57±0.98 \bigstrut[t]
94.10±1.77 97.74±0.93 97.44±1.06 97.83±0.88 93.98±1.51 94.45±1.56 96.36±2.02 96.72±2.15 97.18±0.79 94.04±1.73 94.16±1.40
19.1 0.0231 18.4 0.0758 8.8 0.1498 4.1 0.5598 17.3 1.5435 10.1 2.2452 8.7 2.8232 7.7 5.6222 1.6 438.6728 9.4 3.0981 9.3 3.6865 \bigstrut[b]
WIS 96.87±1.64 96.89±1.96 96.77±2.05 89.60±2.21 96.44±1.64 96.99±1.51 96.55±1.88 96.47±2.04 95.81±1.65 97.07±1.38 97.80±1.24 \bigstrut[t]
96.87±1.65 96.89±1.97 96.78±2.06 89.28±2.48 96.43±1.65 96.64±1.61 96.18±2.07 96.08±2.25 95.77±1.67 97.01±1.45 97.52±1.40
19.3 0.0235 7.3 0.0475 1 0.0539 1 0.4966 18.7 0.3595 10.7 0.3736 7.8 0.6556 2.9 1.4914 9.1 10.8646 7.6 0.3545 6.1 0.3974 \bigstrut[b]
QSA 77.80±4.88 86.21±3.19 79.30±4.30 78.57±5.57 77.88±4.42 88.02±2.62 86.72±2.59 87.00±3.19 79.15±4.72 88.03±2.06 88.05±3.33 \bigstrut[t]
77.00±5.29 86.15±3.23 79.19±4.06 78.56±5.43 77.19±4.81 86.51±2.66 85.05±2.69 85.36±3.33 77.39±5.23 86.51±2.08 86.53±3.41
19.3 0.0179 7.4 0.3647 6.7 6.7294 1 20.6744 19.2 0.8045 14.6 1.3554 8.8 14.9634 8.3 533.1391 6.5 22.6447 10.6 0.9893 8.7 0.7985 \bigstrut[b]
PHO 77.11±1.38 84.36±1.56 85.58±1.71 87.26±1.75 74.68±1.68 80.79±1.33 83.38±1.64 83.14±1.42 84.62±1.91 80.45±1.12 81.51±1.20 \bigstrut[t]
76.34±1.48 84.32±1.55 85.51±1.71 87.17±1.78 72.46±2.30 80.39±1.23 83.00±2.26 82.94±2.15 84.11±2.49 80.27±1.86 81.23±1.71
18.5 0.0267 17.9 0.0887 9.6 0.1892 9.2 3.3785 16.1 2.2788 15.1 2.9822 9.3 3.2500 6.5 5.4895 1.3 443.6744 14.3 2.4684 14.5 2.6239 \bigstrut[b]
SON 68.39±4.15 78.63±3.44 79.56±3.75 79.36±3.94 69.40±3.82 85.66±2.92 84.55±1.88 87.08±1.97 81.79±3.18 87.15±2.41 88.02±2.13 \bigstrut[t]
68.42±4.18 78.59±3.42 79.32±3.98 78.56±3.77 69.58±3.93 84.74±3.24 83.30±2.55 86.09±1.53 81.65±2.24 86.11±2.36 86.88±2.35
14.4 0.0114 13.8 0.1089 6.9 0.9521 5 42.6555 16.6 0.2687 10.2 0.2974 6.8 4.8772 5.5 208.4717 2.8 0.9349 9.8 0.3065 9.3 0.2782 \bigstrut[b]
SEG 92.49±3.90 99.54±0.50 99.43±0.50 98.53±0.81 86.51±2.67 99.61±0.43 99.52±0.43 99.56±0.53 99.35±0.42 99.64±0.50 99.69±0.45 \bigstrut[t]
91.20±3.41 99.52±0.46 99.40±0.58 98.66±0.74 84.31±2.61 99.13±0.97 99.01±0.88 99.11±1.10 98.65±0.88 99.19±1.03 99.40±0.86
18.6 0.0198 9.2 0.0610 1 0.1318 1 9.7983 16.5 1.1173 16.2 1.8406 8.1 5.3774 7.8 78.6618 7.9 107.0428 13.6 1.5924 8.1 1.1664 \bigstrut[b]
VOT 90.89±4.88 94.02±3.64 93.53±4.23 93.22±3.82 90.42±4.00 94.81±3.61 93.53±3.41 93.82±3.58 94.38±3.18 94.93±3.04 96.07±3.09 \bigstrut[t]
90.93±4.83 94.06±3.61 93.55±4.20 93.18±3.85 90.44±4.00 94.83±3.86 93.81±3.85 93.01±4.08 93.91±3.16 94.90±3.31 95.72±3.44
18.1 0.0142 6.5 0.0558 1 0.1160 3.1 1.2698 18.2 0.2776 9.6 0.2836 8.4 0.8319 4.9 7.0024 7.1 4.1051 4.1 0.3036 5.4 0.3140 \bigstrut[b]
CAR 79.35±3.12 88.67±2.64 88.13±2.72 88.80±2.51 82.61±2.35 90.84±2.20 90.75±2.22 90.89±2.34 89.79±1.41 90.92±2.47 90.96±1.83 \bigstrut[t]
78.79±3.16 88.86±2.54 88.28±2.58 88.99±2.37 81.08±3.61 90.33±2.55 90.79±2.04 90.84±2.41 89.23±1.52 90.65±2.51 90.66±1.93
17.7 0.0207 16.6 0.1723 2.1 0.9202 1 12.3888 17.9 1.1823 16.5 1.9752 8.1 6.4973 7.9 113.7304 8.2 87.4473 12.4 1.7951 9.3 1.3367 \bigstrut[b]
MAG 79.48±1.23 86.39±0.72 86.23±1.31 86.94±1.63 76.65±1.98 85.39±0.99 84.97±1.32 82.86±0.52 86.61±0.98 85.41±0.76 85.59±0.91 \bigstrut[t]
78.75±1.75 85.32±0.53 84.82±1.64 85.88±1.36 75.87±2.63 84.57±1.02 84.06±1.68 81.74±0.86 85.52±1.05 84.22±1.58 84.58±0.96
18.3 0.1075 15.6 0.4689 6.9 34.7426 2.7 184.4161 18.6 8.9985 17.3 13.8653 5.7 13.5862 4.3 60.2334 1 6870.9871 16.9 12.2797 16.3 12.8768 \bigstrut[b]
BRA 94.92±0.74 95.27±0.41 95.38±0.37 95.43±0.47 95.17±0.48 95.54±0.58 95.51±0.53 95.55±0.67 95.51±0.18 95.58±0.44 95.61±0.32 \bigstrut[t]
94.85±0.68 95.23±0.48 95.28±0.38 95.33±0.45 95.04±0.54 95.48±0.54 95.42±0.57 95.50±0.62 95.46±0.17 95.51±0.46 95.52±0.36
18.2 0.0653 13.3 0.1503 5.6 0.6842 2.9 0.7094 17.4 8.1934 10.7 8.5122 4.3 8.8453 3.7 9.7481 6.8 8.8453 11.7 8.5681 11.1 8.9657 \bigstrut[b]
ADU 80.32±1.57 84.55±0.78 84.95±0.68 85.28±0.84 78.99±1.54 84.36±0.84 - - - 84.44±0.83 84.46±0.87 \bigstrut[t]
77.25±2.58 83.58±0.87 84.16±0.82 84.48±0.95 74.24±2.82 83.92±0.74 83.99±0.87 84.04±1.83
19.6 0.2658 17.1 1.7375 7.2 7.8395 1.3 55.6825 19.3 22.8843 17.5 36.8652 16.4 36.5892 16.3 37.9834 \bigstrut[b]
TABLE II: Comparison of Accuracy (%) and Weighted-F (%) Improvements of HTSK-LLM-DKD and CNN-TSK-DKD Models on UCI Datasets
Datasets CNN-TSK-KD CNN-TSK-DKD HTSK-LLM-KD HTSK-LLM-DKD    \bigstrut[t]
Distillation Student Promotion Distillation Student Promotion Distillation Student Promotion Distillation Student Promotion \bigstrut[t]
Acc/W-F Acc/W-F Acc/W-F Acc/W-F Acc/W-F Acc/W-F Acc/W-F Acc/W-F Acc/W-F Acc/W-F Acc/W-F Acc/W-F \bigstrut[t]
IRI 97.66/97.24 95.99/95.14 1.67/2.10 98.33/98.06 95.66/94.80 2.67/3.26 98.00/98.06 95.96/95.71 2.04/2.35 98.66/98.56 95.66/94.98 3.00/3.58 \bigstrut[t]
WIN 99.18/98.85 97.62/97.21 1.56/1.64 99.24/98.93 97.54/97.15 1.70/1.78 98.88/98.48 97.73/97.31 1.15/1.17 99.44/99.15 97.63/97.33 1.81/1.82 \bigstrut[t]
TIT 78.23/75.20 77.41/74.12 0.82/1.08 78.28/75.10 77.28/73.51 1.00/1.59 78.41/75.34 77.50/73.84 0.91/1.50 78.60/75.41 77.60/73.82 1.00/1.59 \bigstrut[t]
SEE 93.90/93.79 92.00/91.42 1.90/2.37 95.76/95.61 91.00/90.50 4.76/5.11 94.85/94.32 91.99/91.10 2.86/3.22 96.19/96.06 91.42/91.02 4.77/5.04 \bigstrut[t]
ION 92.62/91.89 91.89/91.05 0.73/0.84 92.78/91.95 91.35/90.37 1.43/1.58 92.53/91.58 91.97/90.95 0.56/0.63 92.87/92.11 91.73/90.78 1.14/1.33 \bigstrut[t]
WIL 95.52/93.89 93.79/91.83 1.73/2.06 95.78/94.36 93.82/91.99 1.96/2.37 95.44/93.46 93.86/91.75 1.58/1.71 95.57/94.16 93.80/91.92 1.77/2.24 \bigstrut[t]
WIS 96.95/96.74 96.53/96.26 0.42/0.48 97.35/97.06 96.48/96.11 0.87/0.95 97.07/97.01 96.63/96.50 0.44/0.51 97.80/97.52 96.92/96.53 0.88/0.99 \bigstrut[t]
QSA 87.79/86.27 87.03/85.30 0.76/0.97 87.94/86.41 87.09/85.32 0.85/1.09 88.03/86.51 87.17/85.51 0.86/1.00 88.05/86.53 87.10/85.41 0.95/1.12 \bigstrut[t]
PHO 80.32/80.15 79.98/79.79 0.34/0.36 81.16/80.96 79.59/79.24 1.57/1.72 80.45/80.27 79.89/79.66 0.56/0.61 81.51/81.23 79.73/79.36 1.78/1.87 \bigstrut[t]
SON 87.23/86.33 85.18/84.24 2.05/2.09 88.08/86.97 85.13/83.99 2.95/2.98 87.15/86.11 85.19/84.23 1.96/1.88 88.02/86.88 85.12/83.96 2.90/2.92 \bigstrut[t]
SEG 99.63/99.17 99.56/99.05 0.07/0.12 99.67/99.37 99.37/98.96 0.30/0.41 99.64/99.19 99.55/99.04 0.09/0.15 99.69/99.40 99.30/98.92 0.39/0.48 \bigstrut[t]
VOT 94.85/94.65 94.45/94.44 0.40/0.21 95.89/95.54 94.59/94.52 1.30/1.02 94.93/94.90 94.41/94.67 0.52/0.23 96.07/95.72 94.54/94.52 1.53/1.20 \bigstrut[t]
CAR 90.88/90.56 90.66/90.26 0.22/0.30 90.93/90.66 90.65/90.26 0.28/0.40 90.92/90.65 90.68/90.31 0.24/0.34 90.96/90.66 90.68/90.30 0.28/0.36 \bigstrut[t]
MAG 84.75/82.22 84.68/82.11 0.07/0.11 85.50/84.46 85.31/84.25 0.19/0.21 85.41/84.22 85.36/84.15 0.05/0.07 85.59/84.58 85.39/84.37 0.20/0.21 \bigstrut[t]
BRA 95.56/95.49 95.49/95.42 0.07/0.07 95.59/95.52 95.51/95.44 0.08/0.08 95.58/95.51 95.51/95.43 0.07/0.08 95.61/95.52 95.53/95.44 0.08/0.08 \bigstrut[t]
ADU 84.39/83.90 84.30/83.81 0.09/0.09 84.43/83.96 84.31/83.83 0.12/0.13 84.44/83.99 84.33/83.87 0.11/0.12 84.46/84.04 84.31/83.87 0.15/0.17 \bigstrut[t]
Average 91.21/90.39 90.41/89.46 0.80/0.93 91.66/90.93 90.29/89.39 1.37/1.54 91.35/90.60 90.48/89.62 0.87/0.97 91.81/91.09 90.40/89.53 1.41/1.56 \bigstrut[t]
TABLE III: Average Accuracy (%), Weighted-F (%) with Corresponding Standard Deviation (%), Rule Number and Running Time on CLE
Methods Acc±Std W-F±Std Rules Time \bigstrut[t]
TSK0v1{}_{v1}^{0} 70.28±2.65 69.87±3.25 16.6 0.0079 \bigstrut[t]
TSK1v1{}_{v1}^{1} 73.28±3.72 73.54±3.38 9.3 0.0472 \bigstrut[t]
TSK2v1{}_{v1}^{2} 70.98±3.63 70.96±3.83 1 0.0168 \bigstrut[t]
TSK3v1{}_{v1}^{3} 70.25±4.68 70.24±4.34 1.2 0.2073 \bigstrut[t]
TSK0v2{}_{v2}^{0} 73.68±3.77 72.38±3.24 17.5 0.2475 \bigstrut[t]
TSK1v2{}_{v2}^{1} 75.85±3.67 75.68±3.35 10.6 0.3275 \bigstrut[t]
TSK2v2{}_{v2}^{2} 71.35±2.64 71.12±3.36 2.6 1.7883 \bigstrut[t]
TSK3v2{}_{v2}^{3} 71.75±2.35 71.83±2.57 1 1.0838 \bigstrut[t]
LSSVFS3 69.68±4.35 69.38±4.79 6.6 2.1436 \bigstrut[t]
HTSK-LLM-KD 78.63±3.83 78.52±3.96 8.6 0.2782 \bigstrut[t]
HTSK-LLM-DKD 79.10±3.52 79.08±3.35 8.1 0.3168 \bigstrut[t]
TABLE IV: Comparison of The Accuracy (%) and Weighted-F (%) Improvement of HTSK-LLM-DKD and CNN-TSK-DKD Models on CLE
Methods Accuracy/Weighted-F    \bigstrut[t]
Distillation Student Promotion    \bigstrut[t]
CNN-TSK-KD 78.56/78.50 75.19/75.15 3.37/3.35 \bigstrut[t]
CNN-TSK-DKD 78.80/78.69 75.35/75.25 3.45/3.44 \bigstrut[t]
HTSK-LLM-KD 78.63/78.52 75.21/75.12 3.42/3.40 \bigstrut[t]
HTSK-LLM-DKD 79.10/79.08 75.60/75.62 3.50/3.46 \bigstrut[t]

From the experiments results above we can conclude that knowledge distillation can improve the generalization ability of model. Next, we will explain how knowledge distillation can affect the performance of student model on the real-world dataset Cleveland heart disease.

IV-C Experimental Results and Analysis on Cleveland Heart Disease Dataset

We use a real-world dataset Cleveland heart disease (CLE) to demonstrate the performance of our model in detail. CLE is about heart disease in Cleveland city of USA, which is built by University Hospital Zurich and Cleveland Clinic Foundation. CLE is composed of 303 instances with 13 features, including chest pain type, resting blood pressure, resting electrocardiographic results, etc., and the output indicates what risk does the patient have with cardiac disease, which is an integer number ranging from 0 (not present) to 4: 0 for nil risk; 1 for low risk; 2 for potential risk; 3 for high risk; 4 for very high risk [42]. In order to clearly demonstrate the effect of knowledge distillation on the proposed model, we divide CLE into three classes: 0 for zero-risk (nil risk), 1 for low-risk (low risk, potential risk and high risk) and 2 for high-risk (very high risk). The experiments in Table III shows that HTSK-LLM-DKD and HTSK-LLM-KD defeat other comparative methods and obtain the first best and the second best performance on CLE, respectively. In addition, HTSK-LLM-DKD and HTSK-LLM-KD require fewer fuzzy rules (8.1 & 8.6) and less running time (0.3168 & 0.2782), which shows once again that the fuzzy dark knowledge greatly improves the generalization ability of model. In Table IV, HTSK-LLM-DKD achieves the best performance improvement with 3.50% in AccuracyAccuracy and 3.46% in WeightedFWeighted-F, which are better than HTSK-LLM-KD with 3.42% in AccuracyAccuracy and 3.40% in WeightedFWeighted-F, illustrating the great performance improvement brought by decoupled knowledge distillation once again, which is in accordance with the conclusions obtained in Section IV-B.

Refer to caption
Figure 3: The effects of distillation parameters on Cleveland heart disease dataset.

IV-D Effectiveness of Decoupled Knowledge Distillation

Fig. 3 shows the effects of distillation parameters of the proposed HTSK-LLM-DKD on dataset CLE. All distillation parameters are selected from {1,2,5,10,20,100}\{1,2,5,10,20,100\}, and it can be clearly seen that in the process of most parameter selection, the classification performance improves first and then degrades. The distillation temperature τ\tau has an important effect on the softness of class label. Fig. 3(a) shows that distillation temperature τ\tau in the range {2,20}\{2,20\} may be a better choice. It means that the lower temperature can’t effectively distill the similar information between classes, while the higher temperature destroys the prediction of various classes, so the temperature in the middle is the most appropriate. ζ\zeta and λ\lambda show the proportion of target class knowledge and non-target class knowledge, respectively, as shown in Figs. 3(b) & (c). We use λ/ζ\lambda/\zeta to show the proportion of them. Fig. 3(d) reveals that λ/ζ\lambda/\zeta within {2,5}\{2,5\} may be a better option, which means that non-target class knowledge can better improve the efficiency of knowledge distillation, thus improving the performance of the model, but too much non-target class knowledge will also reduce the classification performance of the model, because the target class knowledge transfers the fuzzy dark knowledge about the sample difficulty. φ\varphi in Fig. 3(e) exhibits the proportion of the knowledge contained in the ground-truth label, we use (λ+ζ)/φ(\lambda+\zeta)/\varphi to show the proportion of fuzzy dark knowledge and knowledge contained in the ground-truth label. Fig. 3(f) displays that (λ+ζ)/φ(\lambda+\zeta)/\varphi within {0.3,3}\{0.3,3\} may be a better choice, which means that a appropriate amount of fuzzy dark knowledge can improve the classification performance. We can conclude that when transferring too little fuzzy dark knowledge, student model cannot be effectively guided by teacher model, and when transferring too much fuzzy dark knowledge, student model will be misled by the mistakes made by teacher model.

TABLE V: Fuzzy Rule of HTSK-LLM-DKD on CLE Dataset
Fuzzy Rule of HTSK-LLM-DKD   \bigstrut[t]
Rule 1:\bigstrut[t]
IF: the 1st feature is very high, and
the 2nd feature is medium, and
……, and
the 13th feature is very high.
THEN: the 1st output is 0.3429+0.3453x1\emph{x}_{1}+…+0.1834x13\emph{x}_{13} = 0.1201,
the 2nd output is 0.6039+0.2608x1\emph{x}_{1}+…+0.5618x13\emph{x}_{13} = 0.0863,
the 3rd output is -0.3570-0.0543x1\emph{x}_{1}+…-0.4343x13\emph{x}_{13} = -0.1403.
Rule 2:
IF: the 1st feature is very high, and
the 2nd feature is very low, and
……, and
the 13th feature is low.
THEN: the 1st output is 0.4320+0.1222x1\emph{x}_{1}+…-0.1354x13\emph{x}_{13} = 1.6256,
the 2nd output is 0.4737-0.1269x1\emph{x}_{1}+…+0.1912x13\emph{x}_{13} = -5.2402,
the 3rd output is -0.5363+0.4600x1\emph{x}_{1}+…+0.0088x13\emph{x}_{13} = 2.3668.
Rule 3:
IF: the 1st feature is very low, and
the 2nd feature is very low, and
……, and
the 13th feature is medium.
THEN: the 1st output is 0.8216+0.1555x1\emph{x}_{1}+…+0.6766x13\emph{x}_{13} = -0.7301,
the 2nd output is 0.1020+0.1342x1\emph{x}_{1}+…+0.1751x13\emph{x}_{13} = -0.1926,
the 3rd output is -0.5423-0.1844x1\emph{x}_{1}+…-1.1100x13\emph{x}_{13} = 1.4580.

IV-E Interpretability of HTSK-LLM-DKD on Cleveland Heart Disease Dataset

The benefit of fuzzy classifiers is that they may be articulated in terms of linguistic explanation. Here, we choose one trial with the best AccuracyAccuracy on CLE among all experiment trials, and then, list in Table V the corresponding antecedent parts and consequent parts of all fuzzy rules. Due to space limitation, we only present three features of CLE dataset, i.e., the first, the second, and the last (13th) features of all three fuzzy rules. For a random given datum 𝐱=(0.2730,0.6799,,16.4620)T\mathbf{x}=(-0.2730,-0.6799,\ldots,-16.4620)^{T}, we can observed from Table V that, each fuzzy set Aki{}_{i}^{k} for the rule kk and feature ii in the If-Part of HTSK-LLM-DKD can be interpreted with a possible linguistic phrase. In the Then-Part of HTSK-LLM-DKD, the absolute value of the outputs in Rules 2 & 3 are much larger than that of Rule 1, which means that Rules 2 & 3 play a more important role in HTSK-LLM-DKD. In Rules 2 & 3, the 3rd output value is obviously the largest, after comprehensive calculation, we can conclude that the final prediction of HTSK-LLM-DKD is 3rd class, which means high-risk.

V Conclusion

In this study, we mainly focus on how to integrate knowledge distillation and TSK fuzzy classifiers in a deeper level. Therefore, we propose a novel TSK fuzzy classifier based decoupling knowledge distillation denoted as HTSK-LLM-DKD, transferring fuzzy dark knowledge from High-order TSK fuzzy classifier to Low-order TSK fuzzy classifier. HTSK-LLM-DKD uses LLM-DKD to decouple the fuzzy dark knowledge from High-order TSK fuzzy classifier into target class knowledge and non-target class knowledge, then transfers to Low-order TSK fuzzy classifier more efficiently, obtaining better classification performance with high interpretability. Experimental results on the benchmarking datasets and a real dataset Cleveland heart disease demonstrate its effectiveness.

HTSK-LLM-DKD has some improvements that deserve further study. First of all, we will conduct in-depth research on practical applications including epilepsy detection and movement prediction. Secondly, these state-of-the-art knowledge distillation methods that can be employed to distill fuzzy dark knowledge to TSK fuzzy classifier, which will also be the focus of our future research.

References

  • [1] Yunliang Jiang, Jiangwei Weng, Xiongtao Zhang, Zhen Yang and Wenjun Hu, “A CNN-Based Born-Again TSK Fuzzy Classifier Integrating Soft Label Information and Knowledge Distillation,” in IEEE Transactions on Fuzzy Systems, 2022.
  • [2] Xiongtao Zhang, Fu-Lai Chung, and Shitong Wang, “An interpretable fuzzy DBN-based classifier for indoor user movement prediction in ambient assisted living applications,” in IEEE Transactions on Industrial Informatics, vol. 16, no. 1, pp. 42-53, Jan. 2020.
  • [3] Xiongtao Zhang, Yusuke Nojima, Hisao Ishibuchi, Wenjun Hu and Shitong Wang, “Prediction by Fuzzy Clustering and KNN on Validation Data With Parallel Ensemble of Interpretable TSK Fuzzy Classifiers,” in IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 52, no. 1, pp. 400-414, Jan. 2022.
  • [4] Y. Jiang et al., “Recognition of Epileptic EEG Signals Using a Novel Multiview TSK Fuzzy System,” in IEEE Transactions on Fuzzy Systems, vol. 25, no. 1, pp. 3-20, Feb. 2017.
  • [5] Y. Jiang et al., “Seizure Classification From EEG Signals Using Transfer Learning, Semi-Supervised Learning and TSK Fuzzy System,” in IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 25, no. 12, pp. 2270-2284, Dec. 2017.
  • [6] X. Tian et al., “Deep Multi-View Feature Learning for EEG-Based Epileptic Seizure Detection,” in IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 27, no. 10, pp. 1962-1972, Oct. 2019.
  • [7] Z. Saghian, A. Esfahanipour, and B. Karimi, “A novel dynamic fare pricing model based on fuzzy bi-level programming for subway systems with heterogeneous passengers,” in Computers & Industrial Engineering, 2022.
  • [8] C. Ntakolia and D. V. Lyridis, “A comparative study on Ant Colony Optimization algorithm approaches for solving multi-objective path planning problems in case of unmanned surface vehicles,” in Ocean Engineering, vol. 255, p. 111418, 2022.
  • [9] R. Xie and S. Wang, “A wide interpretable Gaussian Takagi–Sugeno–Kang fuzzy classifier and its incremental learning,” in Knowledge-Based Systems, vol. 241, p. 108203, 2022.
  • [10] E. Zhou, C. -M. Vong, Y. Nojima and S. Wang, “A Fully Interpretable First-order TSK Fuzzy System and Its Training with Negative Entropic and Rule-stability-based Regularization,” in IEEE Transactions on Fuzzy Systems, 2022.
  • [11] B. Qin, Y. Nojima, H. Ishibuchi and S. Wang, “Realizing Deep High-Order TSK Fuzzy Classifier by Ensembling Interpretable Zero-Order TSK Fuzzy Subclassifiers,” in IEEE Transactions on Fuzzy Systems, vol. 29, no. 11, pp. 3441-3455, Nov. 2021.
  • [12] D. Wu, Y. Yuan, J. Huang and Y. Tan, “Optimize TSK Fuzzy Systems for Regression Problems: Minibatch Gradient Descent With Regularization, DropRule, and AdaBound (MBGD-RDA),” in IEEE Transactions on Fuzzy Systems, vol. 28, no. 5, pp. 1003-1015, May 2020.
  • [13] Y. Cui, Y. Xu, R. Peng, and D. Wu, “Layer Normalization for TSK Fuzzy System Optimization in Regression Problems,” IEEE Transactions on Fuzzy Systems, pp. 1-11, 2022.
  • [14] G. Hinton, J. Dean, and O. Vinyals, “Distilling the knowledge in a neural network,” in Adv. neural inf. proc. syst., NIPS, 2014.
  • [15] C. Li, G. Li, H. Zhang, and D. Ji, “Embedded mutual learning: A novel online distillation method integrating diverse knowledge sources,” in Applied Intelligence, pp. 1-14, 2022.
  • [16] X. Peng and F. Liu, “Knowledge Distillation Algorithm of Feature Reconstruction Based on Feature Maps of the Middle Layer,” 2022 4th International Conference on Advances in Computer Technology, Information Science and Communications (CTISC), 2022, pp. 1-7.
  • [17] Y. Wang, H. Chen and J. Li, “The Chain of Self-Taught Knowledge Distillation Combining Output and Features,” 2021 33rd Chinese Control and Decision Conference (CCDC), 2021, pp. 5115-5120.
  • [18] D. Liu et al., “KDCRec: Knowledge Distillation for Counterfactual Recommendation Via Uniform Data,” in IEEE Transactions on Knowledge and Data Engineering, 2022, doi: 10.1109/TKDE.2022.3199585.
  • [19] H. Zhang, D. Chen and C. Wang, “Confidence-Aware Multi-Teacher Knowledge Distillation,” ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 4498-4502.
  • [20] T. Morikawa and K. Kameyama, “Multi-Stage Model Compression using Teacher Assistant and Distillation with Hint-Based Training,” 2022 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops), 2022, pp. 484-490.
  • [21] Z. Long, F. Ma, B. Sun, M. Tan, and S. Li, “Diversified branch fusion for self-knowledge distillation,” in Information Fusion, vol. 90, pp. 12-22, 2023.
  • [22] W. Zhao, T. Tong, H. Wang, F. Zhao, Y. He and H. Lu, “Diversity Consistency Learning for Remote-Sensing Object Recognition With Limited Labels,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1-10, 2022.
  • [23] P. Wang, J. Wen, C. Si, Y. Qian and L. Wang, “Contrast-Reconstruction Representation Learning for Self-Supervised Skeleton-Based Action Recognition,” in IEEE Transactions on Image Processing, vol. 31, pp. 6224-6238, 2022.
  • [24] B. Li, Z. Cui, Z. Cao and J. Yang, “Incremental Learning Based on Anchored Class Centers for SAR Automatic Target Recognition,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1-13, 2022.
  • [25] X. Zhang, Y. Hu, A. Yin, J. Deng, H. Xu and J. Si, “Inferable Deep Distilled Attention Network for Diagnosing Multiple Motor Bearing Faults,” in IEEE Transactions on Transportation Electrification, 2022.
  • [26] Z. Liu, W. Lyu, C. Wang, Q. Guo, D. Zhou and W. Xu, “D-CenterNet: An Anchor-Free Detector With Knowledge Distillation for Industrial Defect Detection,” in IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1-12, 2022.
  • [27] H. Xing, Z. Xiao, R. Qu, Z. Zhu and B. Zhao, “An Efficient Federated Distillation Learning System for Multitask Time Series Classification,” in IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1-12, 2022.
  • [28] M. Zhou, J. Huang, X. Fu, F. Zhao and D. Hong, “Effective Pan-Sharpening by Multiscale Invertible Neural Network and Heterogeneous Task Distilling,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1-14, 2022.
  • [29] J. Zeng, Y. Sheng, Y. Yang, Z. Zhou and H. Liu, “Cross Modality Knowledge Distillation Between A-Mode Ultrasound and Surface Electromyography,” in IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1-9, 2022.
  • [30] X. Gu and X. Cheng, “Distilling a deep neural network into a Takagi-Sugeno-Kang fuzzy inference system,” in arXiv preprint arXiv:2010.04974, 2020.
  • [31] D. Erdem and T. Kumbasar, “Enhancing the Learning of Interval Type-2 Fuzzy Classifiers with Knowledge Distillation,” 2021 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 2021, pp. 1-6.
  • [32] B. Zhao, Q. Cui, R. Song, Y. Qiu and J. Liang, “Decoupled Knowledge Distillation,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11943-11952.
  • [33] T. Furlanello et al., “Born-again neural networks,” in Proc. 35th Int. Conf. Mach. Learn., ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, vol. 80, 2018, pp. 1602–1611.
  • [34] K. Demirli and P. Muthukumaran, “Higher order fuzzy system identification using subtractive clustering,” in Journal of Intelligent & Fuzzy Systems, vol. 9, no. 3-4, pp. 129-158, 2000.
  • [35] K. Bache, and M. Lichman, UCI Machine Learning Repository 2015. [Online]. Available: http://archive.ics.uci.edu/ml
  • [36] Qianfeng Cai, Zhifeng Hao and Xiawei Yang, “Higer-order Takagi-Sugeno Fuzzy Model Based on Kernel mapping (in Chinese),” in Control Theory & Applications, vol. 28, no. 5, pp. 681–687, May. 2011.
  • [37] Z. Fang, J. Wang, X. Hu, L. Wang, Y. Yang and Z. Liu, “Compressing Visual-linguistic Model via Knowledge Distillation,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 1408-1418.
  • [38] H. Yin et al., “Dreaming to Distill: Data-Free Knowledge Transfer via DeepInversion,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 8712-8721.
  • [39] M. Bisiada, Empirical studies in translation and discourse (Volume 14). Berlin, Germany. Language Science Press, 2021.
  • [40] Y. Zhang, H. Ishibuchi and S. Wang, “Deep Takagi-Sugeno-Kang Fuzzy Classifier With Shared Linguistic Fuzzy Rules,” in IEEE Transactions on Fuzzy Systems, vol. 26, no. 3, pp. 1535-1549, June 2018.
  • [41] Q. Ren, L. Baron and M. Balazinski, “High order type-2 TSK fuzzy logic system,” NAFIPS 2008 - 2008 Annual Meeting of the North American Fuzzy Information Processing Society, 2008, pp. 1-6.
  • [42] D. Shah et al., “Heart Disease Prediction using Machine Learning Techniques,” in SN Computer Science, vol. 1, no. 6, pp. 1-6, 2020.