Fuzzy Knowledge Distillation from
High-Order TSK to Low-Order TSK

Xiongtao Zhang, Zezong Yin, Yunliang Jiang, Yizhang Jiang, Danfeng Sun and Yong Liu This work was supported in part by National Natural Science Foundation of China (U22A20102, 62171203), in part by the “Pioneer” and “Leading Goose” R & D Program of Zhejiang Province (2023C01150), and in part by the open project fund of Key Laboratory of Image Processing and Intelligent Control (Huazhong University of science and technology), Ministry of Education (Corresponding author: Yunliang Jiang).Xiongtao Zhang and Zezong Yin are with the Zhejiang Province Key Laboratory of Smart Management&Application of Modern Agricultural Resources, Huzhou University, Huzhou 313000, China, and also with the School of Information Engineering, Huzhou University, Huzhou 313000, China.Yunliang Jiang is with the School of Computer Science and Technology, Zhejiang Normal University, Jinhua 321004, China, and also with the School of Information Engineering, Huzhou University, Huzhou 313000, China. Yizhang Jiang is with the School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China, and also with the Key Laboratory of Image Processing and Intelligent Control (Huazhong University of Science and Technology), Ministry of Education, Wuhan 430074, China. Danfeng Sun is with the School of Computer Science, Hangzhou Dianzi University, Hangzhou 310018, China. Yong Liu is with the College of Control Science and Enginneering, Zhejiang University, Hangzhou 310007, China.

Abstract

High-order Takagi-Sugeno-Kang (TSK) fuzzy classifiers possess powerful classification performance yet have fewer fuzzy rules, but always be impaired by its exponential growth training time and poorer interpretability owing to High-order polynomial used in consequent part of fuzzy rule, while Low-order TSK fuzzy classifiers run quickly with high interpretability, however they usually require more fuzzy rules and perform relatively not very well. Address this issue, a novel TSK fuzzy classifier embeded with knowledge distillation in deep learning called HTSK-LLM-DKD is proposed in this study. HTSK-LLM-DKD achieves the following distinctive characteristics: 1) It takes High-order TSK classifier as teacher model and Low-order TSK fuzzy classifier as student model, and leverages the proposed LLM-DKD (Least Learning Machine based Decoupling Knowledge Distillation) to distill the fuzzy dark knowledge from High-order TSK fuzzy classifier to Low-order TSK fuzzy classifier, which resulting in Low-order TSK fuzzy classifier endowed with enhanced performance surpassing or at least comparable to High-order TSK classifier, as well as high interpretability; specifically 2) The Negative Euclidean distance between the output of teacher model and each class is employed to obtain the teacher logits, and then it compute teacher/student soft labels by the softmax function with distillating temperature parameter; 3) By reformulating the Kullback-Leibler divergence, it decouples fuzzy dark knowledge into target class knowledge and non-target class knowledge, and transfers them to student model. The advantages of HTSK-LLM-DKD are verified on the benchmarking UCI datasets and a real dataset Cleveland heart disease, in terms of classification performance and model interpretability.

Index Terms:

Deep learning, Fuzzy dark knowledge, High-order Takagi-Sugeno-Kang (TSK) fuzzy classifier, Knowledge distillation, Least Learning Machine.

I Introduction

TAkagi-Sugeno-Kang (TSK) fuzzy classifier is one of the most famous fuzzy classifiers, which consists of antecedent part and consequent part with fuzzy rule. The antecedent part divides the input space into several fuzzy areas, the consequent part describes the logic of classifier in those areas. TSK fuzzy classifier has been deeply combined with deep learning [1, 2, 3], and successfully applied to many application fields, including epileptic seizure detection [4, 5, 6], subway fare pricing [7], and vehicles path planning [8].

Among a wide variety of TSK fuzzy classifiers, due to their high interpretability and fast training speed, Zero-order [9] and First-order TSK fuzzy classifiers [10] have attracted the most widespread attentions. However, they are prone to resulting in rule explosion [40], which means the number of fuzzy rules grows rapidly when desire the enhanced classification performance. Address this issue, many methods have been developed, which can be classified into three categories:

1.

Hierarchical TSK fuzzy classifier, organizes each layer in a stacked way based on stacking generation principle [11] to obtain improved classification performance. Xiongtao Zhang et al. [3] proposed a novel hierarchical-structure of TSK fuzzy subclassifiers called EP-TSK-FK, which quickly built subclassifiers in terms of parallel learning to obtain augmented validation data, and thus got prediction by fuzzy clustering and KNN. Wang et al. [40] invented a hierarchical TSK fuzzy classifier with shared linguistic fuzzy rules by opening the manifold structure of the original input space, in which the input of each layer includes the output of previous layer besides the original data.
2.

Deep learning based TSK fuzzy classifier, integrates the deep learning strategies with TSK fuzzy classifier. Wu et al. [12] extended Dropout to fuzzy rule, where the model dropped some rules at random during training, and used all rules during testing. Cui et al. [13] avoided the excessive influence of strong rules by adjusting the membership degree of fuzzy rules with normalization.
3.

High-order TSK fuzzy classifier, uses High-order polynomial in the consequent part to escape the influence of rule explosion and achieve better classification performance with fewer rules. Ren et al. [41] built a High-order type-2 TSK system, in which the membership functions in antecedent part are type-2 fuzzy sets and the consequent part are High-order polynomial function. However, High-order TSK fuzzy classifier still has serious shortcoming: the interpretability suffers substantially damage since the parameters of High-order polynomial in consequent part are overly complex.

In deep learning, large model usually has strong performance owing to using a huge amount of computation to extract structure from data, whose complex parameters makes it difficult to be applied in various real-world fields. In contrast, small model is deployed easily but it performs not very well in practice. Inspired by this, Hinton et al. [14] proposed model compression technique knowledge distillation in recent years, which compresses a student model (usually small or simple model) from a teacher model (usually large or complex model), guides the student model by transferring the dark knowledge via minimizing the Kullback-Leibler divergence (KL divergence) between prediction logits of teacher model and student model, thus improving the performance of student model. As research continues, the direction of knowledge distillation is growing increasingly broad. There have been various knowledge forms and distillation methods in knowledge distillation [15, 16, 17, 18, 19, 20, 21]. The technique of knowledge distillation are also widely used, such as object recognition [22, 23, 24], defect detection [25, 26] and other fields [27, 28, 29], etc. Recent studies have found that knowledge distillation can greatly improve the performance of TSK fuzzy classifier. Gu et al. [30] transfered knowledge from CNN to TSK fuzzy classifier and then explained how the TSK fuzzy classifier made decisions. Erdem et al. [31] used CNN to distill interval type-2 fuzzy classifier, which improved the classification performance on large datasets. Our previous study [1] born-again TSK fuzzy classifier with CNN, which took dark knowledge from CNN as parameters of antecedent part and consequent part of TSK fuzzy classifier respectively, then expressed dark knowledge in an interpretable manner.

As we well known, High-order TSK fuzzy classifier has been exhibiting its outstanding classification performance because of powerful fitting ability, while Low-order TSK fuzzy classifiers have demonstrated strength in concise interpretability due to its interpreatble consequent part. In this paper, we extend our previous study [1] from only binary-class classification task to multi-class classification task by born-again TSK fuzzy classifier using fuzzy knowledge distillation, and propose a novel TSK fuzzy classifier called HTSK-LLM-DKD, which decouples and transfers fuzzy dark knowledge from High-order TSK fuzzy classifier (teacher model) to Low-order TSK fuzzy classifier (student model) while obtaining better classification performance as well as high interpretability. Our study advances in the combination of knowledge distillation and TSK fuzzy classifier to a deeper level. The contributions of this study can be summarized as follows:

1.

HTSK-LLM-DKD proposes a novel least learning machine based decoupling knowledge distillation, denoted as LLM-DKD. By comparison with the gradient descent approach, HTSK-LLM-DKD can quickly solve consequent part of teacher model by Least Learning Machine (LLM) [2]. Furthermore, LLM-DKD uses the Negative Euclidean distance between output and each class to obtain logits. According to [32], logits in LLM-DKD can represent more comprehensive class information and transfer fuzzy dark knowledge better with higher semantic level.
2.

HTSK-LLM-DKD employs the softmax function with distillating temperature parameter to obtain soft labels of teacher model and student model. With reformulating the Kullback-Leibler divergence (KL divergence), HTSK-LLM-DKD decouples fuzzy dark knowledge into target class knowledge and non-target class knowledge, providing more flexible and efficient logits distillation perspective, so as to further improve the performance of TSK fuzzy classifier.
3.

Experimental results on benchmarking datasets and a real-world dataset Cleveland Heart Disease demonstrate the effectiveness of the proposed HTSK-LLM-DKD. In terms of classification performance, HTSK-LLM-DKD achieves the best performance on most datasets and perform better than High-order TSK fuzzy classifier with powerful generalization ability; in terms of interpretability, HTSK-LLM-DKD is inherently comparable with high interpretable Low-order TSK fuzzy classifiers.

Table S1 summarizes the notations used in our study in the supplementary file.

II Related Work

II-A TSK fuzzy classifier

TSK fuzzy classifier can be described by fuzzy IF-THEN rules, which indicate the input-output relationship of the classifier. For TSK fuzzy classifiers, the fuzzy rules can be expressed as follows:

		$\displaystyle{\rm IF}\ x_{1}\ {\rm is}\ A_{1}^{k}\wedge x_{2}\ {\rm is}\ A_{2}^{k}\ \wedge\ldots\wedge\ x_{m}\ {\rm is}\ A_{m}^{k}$
		$\displaystyle{\rm THEN}\ y^{k}=f^{k}(\mathbf{x})$		(1a)

where $k=1,2,\ldots,K$ . $K$ is the total number of fuzzy rules, the input is denoted as $\mathbf{x}=(x_{1},x_{2},\ldots,x_{i},\ldots,x_{m})^{T}$ , $A_{i}^{k}$ is the fuzzy set of the $k$ -th rule of $x_{i}$ , $y^{k}$ is the output of the $k$ -th rule, $\wedge$ is the fuzzy conjunction operation. If $f^{k}(\mathbf{x})=p_{0}^{k}$ , we term the TSK fuzzy classifier as the Zero-order TSK fuzzy classifier [9]. If $f^{k}(\mathbf{x})=p_{0}^{k}+x_{1}^{k}p_{1}^{k}+x_{2}^{k}p_{2}^{k}+\ldots+x_{m}^{k}p_{m}^{k}$ , the TSK fuzzy classifier is termed as the First-order TSK fuzzy classifier [10]. If $f^{k}(\mathbf{x})$ is described as:

f^{k}(\mathbf{x})=\sum_{\begin{subarray}{c}j_{1}+j_{2}+\ldots+j_{m}\leq n\\ j_{1},j_{2},\ldots,j_{m}\geq 0\end{subarray}}a_{j_{1},j_{2},\ldots,j_{m}}^{k}x_{1}^{j_{1}}x_{2}^{j_{2}}\ldots x_{m}^{j_{m}}

(1b)

It is termed as High-order TSK fuzzy classifier [11]. Where $n(n\geq 2)$ is the order of the highest polynomial of TSK fuzzy classifier, $j_{i}$ is the order of $x_{i}$ , $a_{j_{1},j_{2},\ldots,j_{m}}^{k}$ represents the coefficient of the highest polynomial with $m$ independent variables in the linear combination constituting the $k$ -th rule.

It is clearly that the consequent parameters of Zero-order and First-order TSK fuzzy classifier are relatively simple, they are the coefficients of the fuzzy membership degree and input variables, which ensure the high interpretability but need more fuzzy rules to achieve satisfactory performance. In contrast, High-order TSK fuzzy classifier can preform better with fewer fuzzy rules, but the consequent part of fuzzy rule are extremely complex and sharply reduce the interpretability. As a result, how to combine the strong performance of High-order TSK fuzzy classifier with the high interpretability of Low-order TSK fuzzy classifier becomes a very valuable research direction.

II-B Knowledge Distillation

Knowledge distillation [14] transfers the dark knowledge from complex/large model, namely the teacher model, to simple/small model, namely the student model, and hence improve the performance of student model. Specifically, knowledge distillation introduces the hyper-parameter temperature in softmax function to obtain soft labels at first, then calculates the KL divergence of soft labels and the cross-entropy of student model output and the ground-truth label, finally transfers the dark knowledge via soft labels from teacher model to student model, improves the performance of student model, as shown in Fig. 1.

There are many strategies to construct the loss function in knowledge distillation, such as the KL divergence [14], the mean squared error [37] and the Jensen–Shannon divergence [38], etc. Traditional knowledge distillation transfers dark knowledge in a highly coupled way, which limits the flexibility for knowledge transfer. Zhao et al. [32] pointed out that dark knowledge can be decoupled into target class knowledge and non-target class knowledge and to be transfered to student model in a more flexible way by reconstructing the KL divergence. Furlanello et al. [33] demonstrated that the decoupled dark knowledge of teacher model can guide student model to have stronger generalization ability than that of teacher model. In this paper, we attempt to distill fuzzy dark knowledge from High-order TSK fuzzy classifier, and propose a novel born-again TSK fuzzy classifier endowed with the powerful classification performance as well as high interpretability.

Refer to caption — Figure 1: Architecture of knowledge distillation. Taking Convolutional Neural Network (CNN) as teacher model and Fully Connected Neural Network (FCNN) as student model for example.

III Htsk-llm-dkd

In this section, we propose a novel TSK fuzzy classifier called HTSK-LLM-DKD, which distilling knowledge from High-order TSK fuzzy classifier (teacher model) to Low-order TSK fuzzy classifier (student model), utilizing the proposed LLM-DKD. Specifically, LLM-DKD firstly trains teacher model quickly and takes the Negative Euclidean distance between the output value and each class to obtain teacher logits. Then, it uses the softmax function with temperature parameter $\tau$ to obtain soft labels of teacher model and student model, respectively. Finally, LLM-DKD decouples fuzzy dark knowledge into target class knowledge and non-target class knowledge, and hence transfers them to student model efficiently, so as to obtain better classification performance as well as high interpretability. The overall architecture of HTSK-LLM-DKD is shown in Fig. 2.

III-A Specific Architecture of Teacher Model and Student Model

III-A1 Constructing Process of Teacher Model

Based on the definition of High-order polynomial, we prove that High-order polynomial can be composed of several Low-order polynomials, the mathematical proof process is present in the supplementary file. In this paper, Third-order TSK fuzzy classifier is employed as teacher model, which is built by stacking multiple zero-order TSK fuzzy classifiers [34], i.e., the consequent part of teacher model can be expressed as the superposition of the consequent parts of several Low-order TSK fuzzy classifiers:

$\displaystyle y=\$	$\displaystyle y_{0}+x_{1}\left(y_{0}^{(1,0)}+x_{1}y_{1}^{(1,1)}+x_{2}y_{1}^{(1,2)}+\ldots+x_{m}y_{1}^{(1,m)}\right)$
	$\displaystyle+x_{2}\left(y_{0}^{(2,0)}+x_{1}y_{1}^{(2,1)}+x_{2}y_{1}^{(2,2)}+\ldots+x_{m}y_{1}^{(2,m)}\right)$
	$\displaystyle+\ldots+$
	$\displaystyle x_{m}\left(y_{0}^{(m,0)}+x_{1}y_{1}^{(m,1)}+x_{2}y_{1}^{(m,2)}+\ldots+x_{m}y_{1}^{(m,m)}\right)$	(1c)

where $y_{n}^{(i,j)}$ represents the output of the $j$ -th $n$ -order TSK fuzzy classifier, which is used to construct the $i$ -th ( $n+1$ )-order TSK fuzzy classifier. Specifically, $y_{0}$ indicates the output of Zero-order TSK fuzzy classifier.

III-A2 Calculation Process of Teacher Model and Student Model

(III-A1) is usually calculated by weighted summation:

y=\sum_{k=1}^{K}\frac{\mu^{k}(\mathbf{x})}{\sum_{k^{\prime}=1}^{K}\mu^{k^{\prime}}(\mathbf{x})}y^{k}=\sum_{k=1}^{K}\tilde{\mu}^{k}(\mathbf{x})y^{k}

(2a)

where $\mu^{k}(\mathbf{x})$ and $\tilde{\mu}^{k}(\mathbf{x})$ are the membership degree and the normalized membership degree of the input $\mathbf{x}$ in the $k$ -th fuzzy rule, respectively. Gaussian membership function is widely used to calculate fuzzy membership degree:

\displaystyle\mu^{k}(\mathbf{x})=\prod_{i=1}^{m}{\mu^{k}(x_{i})}\quad\mu^{k}(x_{i})=\exp\left(\frac{-\left(x_{i}-v_{i}^{k}\right)^{2}}{2\delta_{i}^{k}}\right)

(2b)

where ${\mu^{k}(x_{i})}$ is the membership degree of $x_{i}$ . $v_{i}^{k}$ is the center of each fuzzy rule, which is randomly selected from $\{0,0.25,0.5,0.75,1\}$ , and may have linguistic explanation: $\{\emph{very low},\emph{low},\emph{medium},\emph{high},\emph{very high}\}$ , thus ensuring the interpretable antecedent part of the proposed HTSK-LLM-DKD. $\delta_{i}^{k}$ is the kernel width, which is set to be a positive value.

Several mathematical transformations from the fuzzy rule to linear function exist, as follows.

\displaystyle\mathbf{x}_{e}

\displaystyle=\left(1,\mathbf{x}^{T}\right)^{T}

\displaystyle\mathbf{x}_{f}

\displaystyle=\left(1,x_{i}\mathbf{x}_{e}^{T}\right)^{T}

\displaystyle\mathbf{x}_{v}

\displaystyle=\left(1,x_{i}\mathbf{x}_{f}^{T}\right)^{T}

(3a)

\displaystyle\tilde{\mathbf{x}}_{g}^{k}

\displaystyle=\tilde{\mu}^{k}(\mathbf{x})\mathbf{x}_{v}

\displaystyle\mathbf{x}_{g}

\displaystyle=\left[\left(\tilde{\mathbf{x}}_{g}^{1}\right)^{T},\left(\tilde{\mathbf{x}}_{g}^{2}\right)^{T},\ldots,\left(\tilde{\mathbf{x}}_{g}^{K}\right)^{T}\right]^{T}

(3b)

\displaystyle\tilde{\mathbf{x}}_{h}^{k}

\displaystyle=\tilde{\mu}^{k}(\mathbf{x})\mathbf{x}_{e}

\displaystyle\mathbf{x}_{h}

\displaystyle=\left[\left(\tilde{\mathbf{x}}_{h}^{1}\right)^{T},\left(\tilde{\mathbf{x}}_{h}^{2}\right)^{T},\ldots,\left(\tilde{\mathbf{x}}_{h}^{K}\right)^{T}\right]^{T}

(3c)

$\mathcal{M}$ and $\mathcal{S}$ are denoted as teacher model and student model, respectively. Our teacher model is Third-order TSK fuzzy classifier, which performs well but usually spends much time to train, in this paper we use Least Learning Machine (LLM) [2] to quickly solve the consequent part of teacher model:

\displaystyle y^{\mathcal{M}}

\displaystyle=\mathbf{q}_{g}^{T}\mathbf{x}_{g}

\displaystyle\mathbf{q}_{g}

\displaystyle=\left((1/L)\mathbf{I}+\mathbf{X}_{g}^{T}\mathbf{X}_{g}\right)^{-1}\mathbf{X}_{g}^{T}\overline{\mathbf{Y}}

(4)

where $y^{\mathcal{M}}$ is the output of teacher model, $\mathbf{q}_{g}$ is the consequent parameters of teacher model, $L$ is the regularization parameter, $\mathbf{X}_{g}=\left[\mathbf{x}_{g}^{1},\mathbf{x}_{g}^{2},\ldots,\mathbf{x}_{g}^{N}\right]^{T}$ , $\mathbf{I}$ is the identity matrix of $N\times N$ , $N$ is the number of samples, $\overline{\mathbf{Y}}=\left[\overline{Y}_{1},\overline{Y}_{2},\ldots,\overline{Y}_{N}\right]^{T}$ is the ground-truth label.

First-order TSK fuzzy classifier is utilized as student model in this paper, which runs fastly with high interpretability but indeed performs not very well compared with High-order TSK fuzzy classifier. In HTSK-LLM-DKD, the consequent parameters of student model can be updated by the gradient descent approach that is derived by minimizing the cross-entropy error criterion:

\displaystyle\mathbf{z}^{\mathcal{S}}

\displaystyle=\mathbf{Q}_{h}^{T}\mathbf{x}_{h}

\displaystyle H

\displaystyle=-\sum_{i=1}^{N}\sum_{t=1}^{C}\overline{Y}_{i,t}\log\left(z_{i,t}^{\mathcal{S}}\right)

(5)

\mathbf{Q}_{h}(d+1)=\mathbf{Q}_{h}(d)-\eta\frac{\partial H}{\partial\mathbf{Q}_{h}(d)}

(6)

where $\mathbf{z}^{\mathcal{S}}$ is student logits as shown in Fig. 2, $\mathbf{Q}_{h}$ is the consequent parameters of student model, $H$ is the cross-entropy loss, $\eta$ is the given learning rate.

III-B Least Learning Machine based Decoupling Knowledge Distillation (LLM-DKD)

III-B1 Teacher Logits

LLM-DKD takes the Negative Euclidean distance [39] between teacher model output and each class as teacher logits:

z_{t}^{\mathcal{M}}=-\sqrt{\left(y^{\mathcal{M}}-\hat{y}_{t}\right)^{2}}

(7)

where $\hat{y}_{t}$ is the label of the $t$ -th class, $\mathbf{z}^{\mathcal{M}}$ is teacher logits as shown in Fig. 2, $\mathbf{z}^{\mathcal{M}}=\left[z^{\mathcal{M}}_{1},z^{\mathcal{M}}_{2},\ldots,z^{\mathcal{M}}_{t},\ldots,z^{\mathcal{M}}_{C}\right]\in\mathrm{R}^{1\times C}$ , $C$ is the number of classes.

III-B2 Target Class Knowledge and Non-target Class Knowledge

For a given data from the $t$ -th class, the soft labels can be denoted as $\mathbf{u}=[u_{1},u_{2},\ldots,u_{t},\ldots,u_{C}]\in\mathrm{R}^{1\times C}$ , where $u_{i}$ is the soft label of the $i$ -th class. Each element in $\mathbf{u}$ can be obtained by the softmax function with temperature $\tau$ :

u_{i}=\frac{\exp\left(z_{i}/\tau\right)}{\sum_{j=1}^{C}\exp\left(z_{j}/\tau\right)}

(8a)

where $z_{i}$ represents the logit of the $i$ -th class.

LLM-DKD decouples fuzzy dark knowledge into target class knowledge and non-target class knowledge by:

\displaystyle u_{t}

\displaystyle=\frac{\exp\left(z_{t}/\tau\right)}{\sum_{j=1}^{C}\exp\left(z_{j}/\tau\right)}

\displaystyle u_{\backslash t}

\displaystyle=\frac{\sum_{j^{\prime}=1,j^{\prime}\neq t}^{C}\exp\left(z_{j^{\prime}}/\tau\right)}{\sum_{j=1}^{C}\exp\left(z_{j}/\tau\right)}

(8b)

where $u_{t}$ represents soft labels of the target class, containing knowledge about the “difficulty” of data. $u_{\backslash t}$ represents soft labels of non-target class, containing knowledge making knowledge distillation work [32].

We use $\mathbf{\hat{u}}=\left[\hat{u}_{1},\hat{u}_{2},\ldots,\hat{u}_{t-1},\hat{u}_{t+1},\ldots,\hat{u}_{C}\right]\in\mathrm{R}^{1\times(C-1)}$ to independently model probabilities among non-target classes:

\hat{u}_{i}=\frac{\exp\left(z_{i}/\tau\right)}{\sum_{j=1,j\neq t}^{C}\exp\left(z_{j}/\tau\right)}

(8c)

Algorithm 1 Teacher model and Student model.

Input: Training dataset $\mathbf{X}=\{\mathbf{x}_{i},\mathbf{x}_{i}\in\mathrm{R}^{m},i=1,2,...,N\}$ and the ground-truth label $\overline{\mathbf{Y}}=\{\overline{Y}_{i},\overline{Y}_{i}\in\mathrm{R},i=1,2,...,N\}$ , the number of fuzzy rules $K$ , the regularization parameter $L$ , the maximum iteration epoch $\theta$ , the threshold parameter $\xi$ , the learning rate $\eta$ .
Output: the outputs of teacher model and student model.
Procedure:
Step 1: Randomly select the center $v_{i}^{k}$ of Gaussian membership function in (2b) from five fixed fuzzy partition $\{0,0.25,0.5,0.75,1\}$ , set the width $\delta_{i}^{k}$ to be a positive value, and compute the normalized fuzzy membership degree by (2a)-(2b).
Step 2: Calculate antecedent parameter matrixs of teacher model and student model by (3a)-(3c).
Step 3: The consequent parameter of teacher model $\mathbf{q}_{g}$ can be determined by using $\mathbf{q}_{g}=\left((1/L)\mathbf{I}+\mathbf{X}_{g}^{T}\mathbf{X}_{g}\right)^{-1}\mathbf{X}_{g}^{T}\overline{\mathbf{Y}}$ .
Step 4: The consequent parameter of student model $\mathbf{Q}_{h}$ can be updated using the gradient descent approach:
     Step 4(a): Initialize consequent parameter $\mathbf{Q}_{h}$ and $d=1$ .
     Repeat
     Step 4(b): Use (6) to compute $\mathbf{Q}_{h}(d+1)$ .
     Step 4(c): $d=d+1$ .
     Until $H(d)-H(d-1)\leq\xi$ or $d\geq\theta$
Step 5: Calculate the output of teacher model $y^{\mathcal{M}}=\mathbf{q}_{g}^{T}\mathbf{x}_{g}$ and the output of student model $\mathbf{z}^{\mathcal{S}}=\mathbf{Q}_{h}^{T}\mathbf{x}_{h}$ .

Algorithm 2 HTSK-LLM-DKD.

Input: Training dataset $\mathbf{X}=\{\mathbf{x}_{i},\mathbf{x}_{i}\in\mathrm{R}^{m},i=1,2,...,N\}$ and the ground-truth label $\overline{\mathbf{Y}}=\{\overline{Y}_{i},\overline{Y}_{i}\in\mathrm{R},i=1,2,...,N\}$ , the outputs of teacher model $\mathbf{y}^{\mathcal{M}}=\{y^{\mathcal{M}}_{i},y^{\mathcal{M}}_{i}\in\mathrm{R},i=1,2,...,N\}$ and the outputs of student model $\mathbf{Z}^{\mathcal{S}}=\{\mathbf{z}^{\mathcal{S}}_{i},\mathbf{z}^{\mathcal{S}}_{i}\in\mathrm{R}^{C},i=1,2,...,N\}$ , the maximum iteration epoch $\theta$ , the threshold parameter $\xi$ , the learning rate $\eta$ , the distillation parameters $\tau$ , $\zeta$ , $\lambda$ , $\varphi$ .
Output: the output of HTSK-LLM-DKD.
Procedure:
Step 1: Calculate the logits of teacher model $\mathbf{z}^{\mathcal{M}}$ with the Negative Euclidean distance between $y^{\mathcal{M}}$ and the label of each class by (7).
Step 2: Calculate soft labels of teacher model $\mathbf{u}^{\mathcal{M}}$ and student model $\mathbf{u}^{\mathcal{S}}$ with the softmax function by (8a).
Step 3: Decouple fuzzy dark knowledge into target class knowledge $\mathbf{r}$ and non-target class knowledge $\mathbf{\hat{u}}$ by (8a)-(8c).
Step 4: Use $\mathbf{r}$ and $\mathbf{\hat{u}}$ to rephrase $\operatorname{KL}(\mathbf{u}^{\mathcal{M}}\|\mathbf{u}^{\mathcal{S}})$ by (III-B3)-(9e).
Step 5: Calculate the new loss function of HTSK-LLM-DKD: $Loss=\zeta\operatorname{KL}\left(\mathbf{r}^{\mathcal{M}}\|\mathbf{r}^{\mathcal{S}}\right)+\lambda\mathrm{KL}\left(\widehat{\mathbf{u}}^{\mathcal{M}}\|\widehat{\mathbf{u}}^{\mathcal{S}}\right)+\varphi H$ .
Step 6: Calculate the consequent parameter $\mathbf{Q}_{h}$ of HTSK-LLM-DKD using the gradient descent approach with the new loss function:
     Step 6(a): Initialize consequent parameter $\mathbf{Q}_{h}$ and $d=1$ .
     Repeat
     Step 6(b): $\mathbf{Q}_{h}(d+1)=\mathbf{Q}_{h}(d)-\eta\frac{\partial Loss}{\partial\mathbf{Q}_{h}(d)}$ .
     Step 6(c): $d=d+1$ .
     Until $Loss(d)-Loss(d-1)\leq\xi$ or $d\geq\theta$
Step 7: Calculate the output of HTSK-LLM-DKD.

III-B3 Fuzzy Dark Knowledge Decoupling Process of LLM-DKD

Widely used KL divergence [14] is employed for decoupling fuzzy dark knowledge, which can be expressed as:

	$\displaystyle KD=$	$\displaystyle\operatorname{KL}\left(\mathbf{u}^{\mathcal{M}}\\|\mathbf{u}^{\mathcal{S}}\right)=u_{t}^{\mathcal{M}}\log\left(\frac{u_{t}^{\mathcal{M}}}{u_{t}^{\mathcal{S}}}\right)$
		$\displaystyle+\sum_{i=1,i\neq t}^{C}u_{i}^{\mathcal{M}}\log\left(\frac{u_{i}^{\mathcal{M}}}{u_{i}^{\mathcal{S}}}\right)$		(9a)

According to (8a)-(8c), we can obtain that $\hat{u}_{i}={u}_{i}/u_{\backslash t}$ , and (III-B3) can be reformulated as:

	$\displaystyle KD=$	$\displaystyle u_{t}^{\mathcal{M}}\log\left(\frac{u_{t}^{\mathcal{M}}}{u_{t}^{\mathcal{S}}}\right)$
		$\displaystyle+u_{\backslash t}^{\mathcal{M}}\sum_{i=1,i\neq t}^{C}\hat{u}_{i}^{\mathcal{M}}\left(\log\left(\frac{\hat{u}_{i}^{\mathcal{M}}}{\hat{u}_{i}^{\mathcal{S}}}\right)+\log\left(\frac{u_{\backslash t}^{\mathcal{M}}}{u_{\backslash t}^{\mathcal{S}}}\right)\right)$		(9b)

Since $u_{\backslash t}^{\mathcal{M}}$ and $u_{\backslash t}^{\mathcal{S}}$ are irrelevant to the class index $i$ , (III-B3) can be further expressed as:

	$\displaystyle KD=$	$\displaystyle u_{t}^{\mathcal{M}}\log\left(\frac{u_{t}^{\mathcal{M}}}{u_{t}^{\mathcal{S}}}\right)+u_{\backslash t}^{\mathcal{M}}\log\left(\frac{u_{\backslash t}^{\mathcal{M}}}{u_{\backslash t}^{\mathcal{S}}}\right)$
		$\displaystyle+u_{\backslash t}^{\mathcal{M}}\sum_{i=1,i\neq t}^{C}\hat{u}_{i}^{\mathcal{M}}\log\left(\frac{\hat{u}_{i}^{\mathcal{M}}}{\hat{u}_{i}^{\mathcal{S}}}\right)$		(9c)

We define binary prediction $\mathbf{r}=[u_{t},u_{\backslash t}]\in\mathrm{R}^{1\times 2}$ to represent soft labels of target class and non-target class. The new expression of knowledge distillation can be described as:

KD=\operatorname{KL}\left(\mathbf{r}^{\mathcal{M}}\|\mathbf{r}^{\mathcal{S}}\right)+\left(1-u_{t}^{\mathcal{M}}\right)\mathrm{KL}\left(\widehat{\mathbf{u}}^{\mathcal{M}}\|\widehat{\mathbf{u}}^{\mathcal{S}}\right)

(9d)

where $\mathbf{r}^{\mathcal{M}}$ and $\mathbf{r}^{\mathcal{S}}$ represent the soft labels of teacher model and student model, respectively.

As shown in (9d), the loss function of knowledge distillation is reformulated as the weighted summation of two terms. $\operatorname{KL}\left(\mathbf{r}^{\mathcal{M}}\|\mathbf{r}^{\mathcal{S}}\right)$ indicates the similarity of binary probability for teacher and student model about target class prediction, called target class knowledge; $\mathrm{KL}\left(\widehat{\mathbf{u}}^{\mathcal{M}}\|\widehat{\mathbf{u}}^{\mathcal{S}}\right)$ indicates the similarity of teacher and student model for internal relations in non-target class, called non-target class knowledge. The transfer efficiency of non-target knowledge is negatively correlated with $u_{t}^{\mathcal{M}}$ . We introduce $\zeta$ and $\lambda$ to reweight them, and get the following expression of decoupled knowledge distillation:

DKD=\zeta\operatorname{KL}\left(\mathbf{r}^{\mathcal{M}}\|\mathbf{r}^{\mathcal{S}}\right)+\lambda\mathrm{KL}\left(\widehat{\mathbf{u}}^{\mathcal{M}}\|\widehat{\mathbf{u}}^{\mathcal{S}}\right)

(9e)

We integrate the cross-entropy loss $H$ in (5) to obtain the total loss function of HTSK-LLM-DKD:

Loss=\zeta\operatorname{KL}\left(\mathbf{r}^{\mathcal{M}}\|\mathbf{r}^{\mathcal{S}}\right)+\lambda\mathrm{KL}\left(\widehat{\mathbf{u}}^{\mathcal{M}}\|\widehat{\mathbf{u}}^{\mathcal{S}}\right)+\varphi H

(10)

III-C HTSK-LLM-DKD Algorithm

Here, we summarize the learning algorithm of teacher model and student model in Algorithm 1. The proposed HTSK-LLM-DKD is given in Algorithm 2.

IV Experiments

Our HTSK-LLM-DKD is an interpretable model, for its interpretability guarantee, nine fuzzy classifiers are selected to do comparative experiments with HTSK-LLM-DKD. In Section IV-A, we describe the experiment setups. In Section IV-B, we report the experimental results and analysis on UCI datasets. In Section IV-C, a case study of HTSK-LLM-DKD on the Cleveland heart disease dataset is given in detail. Section IV-D discusses the effectiveness of decoupled knowledge distillation. In Section IV-E, we explain the interpretability of HTSK-LLM-DKD.

IV-A Experiment Setups and Performance Indicators

IV-A1 Datasets

Since we focus on classification tasks, in this experiment, we randomly select sixteen widely used classification datasets from UCI repository [35] and a real dataset Cleveland heart disease. All the adopted dataset are normalized and the categorical features are converted into numerical features. Table S2 discribes all the adopted UCI datasets in the supplementary file.

IV-A2 Comparative Methods

As stated in Section III, HTSK-LLM-DKD is invented as a novel TSK fuzzy classifier by distilling fuzzy dark knowledge from High-order TSK fuzzy classifier to Low-order TSK fuzzy classifier. Therefore, we adopt $n$ -order TSK fuzzy classifier ( $n=0,1,2,3$ ) with several variants as comparative methods, i.e., TSK ${}_{v1}^{n}$ ( $n$ is order, $n=0,1,2,3$ ) use LLM to solve consequent part; TSK ${}_{v2}^{n}$ ( $n$ is order, $n=0,1,2,3$ ) take gradient descent approach to update consequent parameters; LSSVFSⁿ ( $n$ is order, $n=3$ ) utilizes the Least Squares Support Vector Machine (LSSVM) [36] with third-order polynomial to solve consequent part. On the other hand, we realize three distillation models with two versions i.e., CNN-TSK-KD and CNN-TSK-DKD, which all take CNN as teacher model, and First-order TSK fuzzy classifier as student model, employing traditional distillation method (denoted as KD) and the decoupled knowledge distillation method proposed in this paper (denoted as DKD), respectively. In addition, in oder to futher evalute the effectiveness of decoupled knowledge distillation method in HTSK-LLM-DKD, we implement HTSK-LLM-KD as comparative method, which use traditional distillation method in HTSK-LLM-DKD. Table S3 discribes all comparative methods in the supplementary file.

IV-A3 Parameters setting

All methods are cross-validated by ten-fold on the adopted datasets, and all adjustable parameters are optimized using grid search strategy. The optimization range of fuzzy rules is searched from $K=\{1,2,...,20\}$ ; the regularization parameter $L$ is set to $100$ ; distillation parameters $\tau$ , $\zeta$ , $\lambda$ and $\varphi$ are searched from $\{1,2,5,10,20,100\}$ ; the maximum iteration $\theta$ is set to 30; the threshold parameter $\xi$ is set to $10^{-5}$ ; the learning rate $\eta$ is set to 0.01; other parameters are set by the default values.

IV-A4 Performance Indicators

To evaluate the classification performance of all the adopted methods, four commonly-used performance indices are used in the experiment, i.e., $Accuracy$ (Acc for simplication), $Weighted-F$ (W-F for simplication), the running time and the average number of fuzzy rule. $Accuracy$ is the most intuitive performance metric, which displays the ratio of properly predicted samples to total samples. $Weighted-F$ is a weighted average of $Precision$ and $Recall$ , which takes into account both false positives and false negatives. The best results are marked in bold, please note that ‘-’ means that the adopted methods cannot work out its results within 3 hours.

IV-B Experimental Results and Analysis on UCI Datasets

Table I demonstrates the experiment results of the proposed HTSK-LLM-DKD and HTSK-LLM-KD compared with nine adopted fuzzy classifiers. We conclude the following results:

1.

Our HTSK-LLM-DKD and HTSK-LLM-KD all achieve the first and the second best performance on eleven out of sixteen datasets in terms of $Accuracy$ and $Weighted-F$ , respectively, which shows that knowledge distillation can improve the performance of TSK fuzzy classifier very well and HTSK-LLM-DKD is much better than HTSK-LLM-KD. Additionally, HTSK-LLM-DKD keeps a comparable performance on TIT, WIL, PHO, MAG and ADU datasets.
2.

Evidently, HTSK-LLM-DKD performs better with fewer fuzzy rules than TSK ${}_{v2}^{1}$ , which is the student model of HTSK-LLM-DKD. At the same time, HTSK-LLM-DKD performs better with shorter running time than High-order TSK fuzzy classifier on most datasets. It shows that the fuzzy dark knowledge improves the generalization ability of HTSK-LLM-DKD greatly, and the proposed LLM-DKD can speed up the running time of HTSK-LLM-DKD.

Table II displays the experiment results of HTSK-LLM-DKD, HTSK-LLM-KD, CNN-TSK-DKD and CNN-TSK-KD compared with its corresponding student model in terms of $Accuracy$ and $Weighted-F$ , respectively. We conclude the following results:

1.

Our HTSK-LLM-DKD has achieved the best performance improvement, with 1.41% and 1.56% in terms of Accuracy and Weighted-F, which indicates that knowledge distillation can effectively improve the performance of student model by transferring fuzzy dark knowledge from teacher model.
2.

The comparative experiments in Table II can be divided into two parts. The first part is the comparison between decoupled knowledge distillation and traditional knowledge distillation. It can be seen that HTSK-LLM-DKD and CNN-TSK-DKD perform better than HTSK-LLM-KD and CNN-TSK-KD, we believe the reason lie in that: in traditional knowledge distillation, the transfer efficiency of non-target knowledge is negatively correlated with the confidence on the target class of teacher model, and decoupled knowledge distillation can better transfer non-target knowledge by re-weighting, thus improving the generalization ability of student model.
3.

The second part is the comparison between CNN and TSK fuzzy classifier as teacher models. We can observe that the average $Accuracy$ and $Weighted-F$ of HTSK-LLM-KD and HTSK-LLM-DKD are usually higher than CNN-TSK-KD and CNN-TSK-DKD, which indicates that the fuzzy dark knowledge transfered from High-order TSK fuzzy classifier is more conducive to improving the performance of low-order TSK fuzzy classifier than CNN.

TABLE I: Rule Number, Average Time, Average Accuracy (%), Average Weighted-F (%) and Standard Deviation (%) on UCI Datasets

Datasets	TSK ${}_{v1}^{0}$	TSK ${}_{v1}^{1}$	TSK ${}_{v1}^{2}$	TSK ${}_{v1}^{3}$	TSK ${}_{v2}^{0}$	TSK ${}_{v2}^{1}$	TSK ${}_{v2}^{2}$	TSK ${}_{v2}^{3}$	LSSVFS³	HTSK -LLM -KD	HTSK -LLM -DKD
	Acc±Std	Acc±Std	Acc±Std	Acc±Std	Acc±Std	Acc±Std	Acc±Std	Acc±Std	Acc±Std	Acc±Std	Acc±Std \bigstrut[t]
	W-F±Std	W-F±Std	W-F±Std	W-F±Std	W-F±Std	W-F±Std	W-F±Std	W-F±Std	W-F±Std	W-F±Std	W-F±Std
	Rules Time	Rules Time	Rules Time	Rules Time	Rules Time	Rules Time	Rules Time	Rules Time	Rules Time	Rules Time	Rules Time
IRI	82.59±4.05	85.33±4.23	83.33±4.67	80.06±4.89	90.44±4.72	97.08±3.46	96.80±3.49	96.33±4.68	85.40±4.25	98.00±3.22	98.66±2.81 \bigstrut[t]
	83.28±4.04	85.48±4.93	83.87±4.95	81.12±4.74	90.31±4.94	96.96±3.70	96.39±4.07	95.84±5.42	86.15±4.52	98.06±3.16	98.56±3.01
	18.1 0.0050	10.3 0.0063	2.6 0.0087	1.1 0.0133	17.4 0.1806	8.5 0.1421	6.2 0.1583	2.8 0.1825	2.4 0.5021	7.9 0.1462	6.3 0.1511 \bigstrut[b]
WIN	80.48±4.98	88.02±4.10	83.43±5.47	75.71±4.94	93.32±3.69	98.39±2.66	97.20±3.07	96.27±3.70	89.50±3.53	98.88±2.34	99.44±1.75 \bigstrut[t]
	80.52±4.91	88.04±4.06	83.40±5.08	75.45±4.91	93.36±3.69	98.08±3.29	97.01±3.39	96.30±3.67	89.49±3.61	98.48±3.22	99.15±2.67
	18.2 0.0049	4.4 0.0130	8.1 0.1034	7.8 0.5818	18.3 0.1656	9.4 0.1622	8.2 0.3519	4.6 1.6877	5 0.7188	8.5 0.1698	8.1 0.1532 \bigstrut[b]
TIT	79.05±3.85	79.08±3.82	78.87±4.02	79.12±3.75	77.82±3.62	77.90±3.75	78.82±3.69	79.00±3.85	78.46±3.99	78.41±4.09	78.60±3.87 \bigstrut[t]
	76.00±4.85	76.02±4.72	75.71±4.89	76.21±4.67	75.20±4.44	75.42±4.25	75.42±4.02	75.35±4.42	75.39±4.38	75.34±4.91	75.41±4.50
	16.3 0.0103	6.2 0.0118	4.2 0.0135	3.2 0.0424	14.7 0.8405	7.5 0.8386	6.8 0.9367	5.9 1.1095	2.4 93.4825	5.7 1.1072	4.8 0.9931 \bigstrut[b]
SEE	92.04±5.30	92.90±5.10	91.90±5.04	82.85±5.75	92.32±4.56	92.50±4.64	93.76±4.95	91.66±5.76	94.00±4.12	94.85±4.61	96.19±3.85 \bigstrut[t]
	92.05±5.41	92.91±5.12	91.92±5.09	82.96±5.69	92.30±4.66	92.19±4.98	93.66±4.19	91.47±5.23	94.03±4.09	94.32±4.37	96.06±3.97
	17.5 0.0039	6.9 0.0112	1 0.0100	1 0.0594	19.1 0.1724	8.4 0.1733	6.5 0.2075	3.8 0.3060	2.2 1.1360	8.9 0.2588	6.2 0.2152 \bigstrut[b]
ION	82.59±5.99	86.43±5.06	80.57±5.31	79.34±5.34	81.91±5.33	92.40±4.77	90.53±4.68	90.10±5.36	80.88±5.11	92.53±5.12	92.87±4.31 \bigstrut[t]
	81.75±5.70	85.76±5.49	79.49±5.25	78.13±5.29	81.12±5.42	91.44±5.26	89.32±5.17	88.97±5.15	79.48±5.17	91.58±4.74	92.11±4.40
	17.6 0.0152	2.5 0.0339	7.1 0.7432	6.5 20.3081	17.3 0.3414	13.4 0.3944	6.9 2.8994	4.6 44.3618	5.2 2.7954	8.8 0.3069	7.3 0.3158 \bigstrut[b]
WIL	95.05±1.09	97.89±0.81	97.54±0.94	97.99±0.79	94.60±1.03	95.36±0.95	96.94±1.52	97.18±1.11	97.46±0.70	95.44±1.11	95.57±0.98 \bigstrut[t]
	94.10±1.77	97.74±0.93	97.44±1.06	97.83±0.88	93.98±1.51	94.45±1.56	96.36±2.02	96.72±2.15	97.18±0.79	94.04±1.73	94.16±1.40
	19.1 0.0231	18.4 0.0758	8.8 0.1498	4.1 0.5598	17.3 1.5435	10.1 2.2452	8.7 2.8232	7.7 5.6222	1.6 438.6728	9.4 3.0981	9.3 3.6865 \bigstrut[b]
WIS	96.87±1.64	96.89±1.96	96.77±2.05	89.60±2.21	96.44±1.64	96.99±1.51	96.55±1.88	96.47±2.04	95.81±1.65	97.07±1.38	97.80±1.24 \bigstrut[t]
	96.87±1.65	96.89±1.97	96.78±2.06	89.28±2.48	96.43±1.65	96.64±1.61	96.18±2.07	96.08±2.25	95.77±1.67	97.01±1.45	97.52±1.40
	19.3 0.0235	7.3 0.0475	1 0.0539	1 0.4966	18.7 0.3595	10.7 0.3736	7.8 0.6556	2.9 1.4914	9.1 10.8646	7.6 0.3545	6.1 0.3974 \bigstrut[b]
QSA	77.80±4.88	86.21±3.19	79.30±4.30	78.57±5.57	77.88±4.42	88.02±2.62	86.72±2.59	87.00±3.19	79.15±4.72	88.03±2.06	88.05±3.33 \bigstrut[t]
	77.00±5.29	86.15±3.23	79.19±4.06	78.56±5.43	77.19±4.81	86.51±2.66	85.05±2.69	85.36±3.33	77.39±5.23	86.51±2.08	86.53±3.41
	19.3 0.0179	7.4 0.3647	6.7 6.7294	1 20.6744	19.2 0.8045	14.6 1.3554	8.8 14.9634	8.3 533.1391	6.5 22.6447	10.6 0.9893	8.7 0.7985 \bigstrut[b]
PHO	77.11±1.38	84.36±1.56	85.58±1.71	87.26±1.75	74.68±1.68	80.79±1.33	83.38±1.64	83.14±1.42	84.62±1.91	80.45±1.12	81.51±1.20 \bigstrut[t]
	76.34±1.48	84.32±1.55	85.51±1.71	87.17±1.78	72.46±2.30	80.39±1.23	83.00±2.26	82.94±2.15	84.11±2.49	80.27±1.86	81.23±1.71
	18.5 0.0267	17.9 0.0887	9.6 0.1892	9.2 3.3785	16.1 2.2788	15.1 2.9822	9.3 3.2500	6.5 5.4895	1.3 443.6744	14.3 2.4684	14.5 2.6239 \bigstrut[b]
SON	68.39±4.15	78.63±3.44	79.56±3.75	79.36±3.94	69.40±3.82	85.66±2.92	84.55±1.88	87.08±1.97	81.79±3.18	87.15±2.41	88.02±2.13 \bigstrut[t]
	68.42±4.18	78.59±3.42	79.32±3.98	78.56±3.77	69.58±3.93	84.74±3.24	83.30±2.55	86.09±1.53	81.65±2.24	86.11±2.36	86.88±2.35
	14.4 0.0114	13.8 0.1089	6.9 0.9521	5 42.6555	16.6 0.2687	10.2 0.2974	6.8 4.8772	5.5 208.4717	2.8 0.9349	9.8 0.3065	9.3 0.2782 \bigstrut[b]
SEG	92.49±3.90	99.54±0.50	99.43±0.50	98.53±0.81	86.51±2.67	99.61±0.43	99.52±0.43	99.56±0.53	99.35±0.42	99.64±0.50	99.69±0.45 \bigstrut[t]
	91.20±3.41	99.52±0.46	99.40±0.58	98.66±0.74	84.31±2.61	99.13±0.97	99.01±0.88	99.11±1.10	98.65±0.88	99.19±1.03	99.40±0.86
	18.6 0.0198	9.2 0.0610	1 0.1318	1 9.7983	16.5 1.1173	16.2 1.8406	8.1 5.3774	7.8 78.6618	7.9 107.0428	13.6 1.5924	8.1 1.1664 \bigstrut[b]
VOT	90.89±4.88	94.02±3.64	93.53±4.23	93.22±3.82	90.42±4.00	94.81±3.61	93.53±3.41	93.82±3.58	94.38±3.18	94.93±3.04	96.07±3.09 \bigstrut[t]
	90.93±4.83	94.06±3.61	93.55±4.20	93.18±3.85	90.44±4.00	94.83±3.86	93.81±3.85	93.01±4.08	93.91±3.16	94.90±3.31	95.72±3.44
	18.1 0.0142	6.5 0.0558	1 0.1160	3.1 1.2698	18.2 0.2776	9.6 0.2836	8.4 0.8319	4.9 7.0024	7.1 4.1051	4.1 0.3036	5.4 0.3140 \bigstrut[b]
CAR	79.35±3.12	88.67±2.64	88.13±2.72	88.80±2.51	82.61±2.35	90.84±2.20	90.75±2.22	90.89±2.34	89.79±1.41	90.92±2.47	90.96±1.83 \bigstrut[t]
	78.79±3.16	88.86±2.54	88.28±2.58	88.99±2.37	81.08±3.61	90.33±2.55	90.79±2.04	90.84±2.41	89.23±1.52	90.65±2.51	90.66±1.93
	17.7 0.0207	16.6 0.1723	2.1 0.9202	1 12.3888	17.9 1.1823	16.5 1.9752	8.1 6.4973	7.9 113.7304	8.2 87.4473	12.4 1.7951	9.3 1.3367 \bigstrut[b]
MAG	79.48±1.23	86.39±0.72	86.23±1.31	86.94±1.63	76.65±1.98	85.39±0.99	84.97±1.32	82.86±0.52	86.61±0.98	85.41±0.76	85.59±0.91 \bigstrut[t]
	78.75±1.75	85.32±0.53	84.82±1.64	85.88±1.36	75.87±2.63	84.57±1.02	84.06±1.68	81.74±0.86	85.52±1.05	84.22±1.58	84.58±0.96
	18.3 0.1075	15.6 0.4689	6.9 34.7426	2.7 184.4161	18.6 8.9985	17.3 13.8653	5.7 13.5862	4.3 60.2334	1 6870.9871	16.9 12.2797	16.3 12.8768 \bigstrut[b]
BRA	94.92±0.74	95.27±0.41	95.38±0.37	95.43±0.47	95.17±0.48	95.54±0.58	95.51±0.53	95.55±0.67	95.51±0.18	95.58±0.44	95.61±0.32 \bigstrut[t]
	94.85±0.68	95.23±0.48	95.28±0.38	95.33±0.45	95.04±0.54	95.48±0.54	95.42±0.57	95.50±0.62	95.46±0.17	95.51±0.46	95.52±0.36
	18.2 0.0653	13.3 0.1503	5.6 0.6842	2.9 0.7094	17.4 8.1934	10.7 8.5122	4.3 8.8453	3.7 9.7481	6.8 8.8453	11.7 8.5681	11.1 8.9657 \bigstrut[b]
ADU	80.32±1.57	84.55±0.78	84.95±0.68	85.28±0.84	78.99±1.54	84.36±0.84	-	-	-	84.44±0.83	84.46±0.87 \bigstrut[t]
	77.25±2.58	83.58±0.87	84.16±0.82	84.48±0.95	74.24±2.82	83.92±0.74				83.99±0.87	84.04±1.83
	19.6 0.2658	17.1 1.7375	7.2 7.8395	1.3 55.6825	19.3 22.8843	17.5 36.8652				16.4 36.5892	16.3 37.9834 \bigstrut[b]

TABLE II: Comparison of Accuracy (%) and Weighted-F (%) Improvements of HTSK-LLM-DKD and CNN-TSK-DKD Models on UCI Datasets

Datasets

CNN-TSK-KD

CNN-TSK-DKD

HTSK-LLM-KD

HTSK-LLM-DKD \bigstrut[t]

Distillation

Student

Promotion

Distillation

Student

Promotion

Distillation

Student

Promotion

Distillation

Student

Promotion \bigstrut[t]

Acc/W-F

Acc/W-F \bigstrut[t]

IRI

97.66/97.24

95.99/95.14

1.67/2.10

98.33/98.06

95.66/94.80

2.67/3.26

98.00/98.06

95.96/95.71

2.04/2.35

98.66/98.56

95.66/94.98

3.00/3.58 \bigstrut[t]

WIN

99.18/98.85

97.62/97.21

1.56/1.64

99.24/98.93

97.54/97.15

1.70/1.78

98.88/98.48

97.73/97.31

1.15/1.17

99.44/99.15

97.63/97.33

1.81/1.82 \bigstrut[t]

TIT

78.23/75.20

77.41/74.12

0.82/1.08

78.28/75.10

77.28/73.51

1.00/1.59

78.41/75.34

77.50/73.84

0.91/1.50

78.60/75.41

77.60/73.82

1.00/1.59 \bigstrut[t]

SEE

93.90/93.79

92.00/91.42

1.90/2.37

95.76/95.61

91.00/90.50

4.76/5.11

94.85/94.32

91.99/91.10

2.86/3.22

96.19/96.06

91.42/91.02

4.77/5.04 \bigstrut[t]

ION

92.62/91.89

91.89/91.05

0.73/0.84

92.78/91.95

91.35/90.37

1.43/1.58

92.53/91.58

91.97/90.95

0.56/0.63

92.87/92.11

91.73/90.78

1.14/1.33 \bigstrut[t]

WIL

95.52/93.89

93.79/91.83

1.73/2.06

95.78/94.36

93.82/91.99

1.96/2.37

95.44/93.46

93.86/91.75

1.58/1.71

95.57/94.16

93.80/91.92

1.77/2.24 \bigstrut[t]

WIS

96.95/96.74

96.53/96.26

0.42/0.48

97.35/97.06

96.48/96.11

0.87/0.95

97.07/97.01

96.63/96.50

0.44/0.51

97.80/97.52

96.92/96.53

0.88/0.99 \bigstrut[t]

QSA

87.79/86.27

87.03/85.30

0.76/0.97

87.94/86.41

87.09/85.32

0.85/1.09

88.03/86.51

87.17/85.51

0.86/1.00

88.05/86.53

87.10/85.41

0.95/1.12 \bigstrut[t]

PHO

80.32/80.15

79.98/79.79

0.34/0.36

81.16/80.96

79.59/79.24

1.57/1.72

80.45/80.27

79.89/79.66

0.56/0.61

81.51/81.23

79.73/79.36

1.78/1.87 \bigstrut[t]

SON

87.23/86.33

85.18/84.24

2.05/2.09

88.08/86.97

85.13/83.99

2.95/2.98

87.15/86.11

85.19/84.23

1.96/1.88

88.02/86.88

85.12/83.96

2.90/2.92 \bigstrut[t]

SEG

99.63/99.17

99.56/99.05

0.07/0.12

99.67/99.37

99.37/98.96

0.30/0.41

99.64/99.19

99.55/99.04

0.09/0.15

99.69/99.40

99.30/98.92

0.39/0.48 \bigstrut[t]

VOT

94.85/94.65

94.45/94.44

0.40/0.21

95.89/95.54

94.59/94.52

1.30/1.02

94.93/94.90

94.41/94.67

0.52/0.23

96.07/95.72

94.54/94.52

1.53/1.20 \bigstrut[t]

CAR

90.88/90.56

90.66/90.26

0.22/0.30

90.93/90.66

90.65/90.26

0.28/0.40

90.92/90.65

90.68/90.31

0.24/0.34

90.96/90.66

90.68/90.30

0.28/0.36 \bigstrut[t]

MAG

84.75/82.22

84.68/82.11

0.07/0.11

85.50/84.46

85.31/84.25

0.19/0.21

85.41/84.22

85.36/84.15

0.05/0.07

85.59/84.58

85.39/84.37

0.20/0.21 \bigstrut[t]

BRA

95.56/95.49

95.49/95.42

0.07/0.07

95.59/95.52

95.51/95.44

0.08/0.08

95.58/95.51

95.51/95.43

0.07/0.08

95.61/95.52

95.53/95.44

0.08/0.08 \bigstrut[t]

ADU

84.39/83.90

84.30/83.81

0.09/0.09

84.43/83.96

84.31/83.83

0.12/0.13

84.44/83.99

84.33/83.87

0.11/0.12

84.46/84.04

84.31/83.87

0.15/0.17 \bigstrut[t]

Average

91.21/90.39

90.41/89.46

0.80/0.93

91.66/90.93

90.29/89.39

1.37/1.54

91.35/90.60

90.48/89.62

0.87/0.97

91.81/91.09

90.40/89.53

1.41/1.56 \bigstrut[t]

TABLE III: Average Accuracy (%), Weighted-F (%) with Corresponding Standard Deviation (%), Rule Number and Running Time on CLE

Methods	Acc±Std	W-F±Std	Rules	Time \bigstrut[t]
TSK ${}_{v1}^{0}$	70.28±2.65	69.87±3.25	16.6	0.0079 \bigstrut[t]
TSK ${}_{v1}^{1}$	73.28±3.72	73.54±3.38	9.3	0.0472 \bigstrut[t]
TSK ${}_{v1}^{2}$	70.98±3.63	70.96±3.83	1	0.0168 \bigstrut[t]
TSK ${}_{v1}^{3}$	70.25±4.68	70.24±4.34	1.2	0.2073 \bigstrut[t]
TSK ${}_{v2}^{0}$	73.68±3.77	72.38±3.24	17.5	0.2475 \bigstrut[t]
TSK ${}_{v2}^{1}$	75.85±3.67	75.68±3.35	10.6	0.3275 \bigstrut[t]
TSK ${}_{v2}^{2}$	71.35±2.64	71.12±3.36	2.6	1.7883 \bigstrut[t]
TSK ${}_{v2}^{3}$	71.75±2.35	71.83±2.57	1	1.0838 \bigstrut[t]
LSSVFS³	69.68±4.35	69.38±4.79	6.6	2.1436 \bigstrut[t]
HTSK-LLM-KD	78.63±3.83	78.52±3.96	8.6	0.2782 \bigstrut[t]
HTSK-LLM-DKD	79.10±3.52	79.08±3.35	8.1	0.3168 \bigstrut[t]

TABLE IV: Comparison of The Accuracy (%) and Weighted-F (%) Improvement of HTSK-LLM-DKD and CNN-TSK-DKD Models on CLE

Methods	Accuracy/Weighted-F \bigstrut[t]
Methods	Distillation	Student	Promotion \bigstrut[t]
CNN-TSK-KD	78.56/78.50	75.19/75.15	3.37/3.35 \bigstrut[t]
CNN-TSK-DKD	78.80/78.69	75.35/75.25	3.45/3.44 \bigstrut[t]
HTSK-LLM-KD	78.63/78.52	75.21/75.12	3.42/3.40 \bigstrut[t]
HTSK-LLM-DKD	79.10/79.08	75.60/75.62	3.50/3.46 \bigstrut[t]

From the experiments results above we can conclude that knowledge distillation can improve the generalization ability of model. Next, we will explain how knowledge distillation can affect the performance of student model on the real-world dataset Cleveland heart disease.

IV-C Experimental Results and Analysis on Cleveland Heart Disease Dataset

We use a real-world dataset Cleveland heart disease (CLE) to demonstrate the performance of our model in detail. CLE is about heart disease in Cleveland city of USA, which is built by University Hospital Zurich and Cleveland Clinic Foundation. CLE is composed of 303 instances with 13 features, including chest pain type, resting blood pressure, resting electrocardiographic results, etc., and the output indicates what risk does the patient have with cardiac disease, which is an integer number ranging from 0 (not present) to 4: 0 for nil risk; 1 for low risk; 2 for potential risk; 3 for high risk; 4 for very high risk [42]. In order to clearly demonstrate the effect of knowledge distillation on the proposed model, we divide CLE into three classes: 0 for zero-risk (nil risk), 1 for low-risk (low risk, potential risk and high risk) and 2 for high-risk (very high risk). The experiments in Table III shows that HTSK-LLM-DKD and HTSK-LLM-KD defeat other comparative methods and obtain the first best and the second best performance on CLE, respectively. In addition, HTSK-LLM-DKD and HTSK-LLM-KD require fewer fuzzy rules (8.1 & 8.6) and less running time (0.3168 & 0.2782), which shows once again that the fuzzy dark knowledge greatly improves the generalization ability of model. In Table IV, HTSK-LLM-DKD achieves the best performance improvement with 3.50% in $Accuracy$ and 3.46% in $Weighted-F$ , which are better than HTSK-LLM-KD with 3.42% in $Accuracy$ and 3.40% in $Weighted-F$ , illustrating the great performance improvement brought by decoupled knowledge distillation once again, which is in accordance with the conclusions obtained in Section IV-B.

IV-D Effectiveness of Decoupled Knowledge Distillation

Fig. 3 shows the effects of distillation parameters of the proposed HTSK-LLM-DKD on dataset CLE. All distillation parameters are selected from $\{1,2,5,10,20,100\}$ , and it can be clearly seen that in the process of most parameter selection, the classification performance improves first and then degrades. The distillation temperature $\tau$ has an important effect on the softness of class label. Fig. 3(a) shows that distillation temperature $\tau$ in the range $\{2,20\}$ may be a better choice. It means that the lower temperature can’t effectively distill the similar information between classes, while the higher temperature destroys the prediction of various classes, so the temperature in the middle is the most appropriate. $\zeta$ and $\lambda$ show the proportion of target class knowledge and non-target class knowledge, respectively, as shown in Figs. 3(b) & (c). We use $\lambda/\zeta$ to show the proportion of them. Fig. 3(d) reveals that $\lambda/\zeta$ within $\{2,5\}$ may be a better option, which means that non-target class knowledge can better improve the efficiency of knowledge distillation, thus improving the performance of the model, but too much non-target class knowledge will also reduce the classification performance of the model, because the target class knowledge transfers the fuzzy dark knowledge about the sample difficulty. $\varphi$ in Fig. 3(e) exhibits the proportion of the knowledge contained in the ground-truth label, we use $(\lambda+\zeta)/\varphi$ to show the proportion of fuzzy dark knowledge and knowledge contained in the ground-truth label. Fig. 3(f) displays that $(\lambda+\zeta)/\varphi$ within $\{0.3,3\}$ may be a better choice, which means that a appropriate amount of fuzzy dark knowledge can improve the classification performance. We can conclude that when transferring too little fuzzy dark knowledge, student model cannot be effectively guided by teacher model, and when transferring too much fuzzy dark knowledge, student model will be misled by the mistakes made by teacher model.

TABLE V: Fuzzy Rule of HTSK-LLM-DKD on CLE Dataset

Fuzzy Rule of HTSK-LLM-DKD \bigstrut[t]
Rule 1: \bigstrut[t]
IF:	the 1st feature is very high, and
	the 2nd feature is medium, and
	……, and
	the 13th feature is very high.
THEN:	the 1st output is 0.3429+0.3453 $\emph{x}_{1}$ +…+0.1834 $\emph{x}_{13}$ = 0.1201,
	the 2nd output is 0.6039+0.2608 $\emph{x}_{1}$ +…+0.5618 $\emph{x}_{13}$ = 0.0863,
	the 3rd output is -0.3570-0.0543 $\emph{x}_{1}$ +…-0.4343 $\emph{x}_{13}$ = -0.1403.
Rule 2:
IF:	the 1st feature is very high, and
	the 2nd feature is very low, and
	……, and
	the 13th feature is low.
THEN:	the 1st output is 0.4320+0.1222 $\emph{x}_{1}$ +…-0.1354 $\emph{x}_{13}$ = 1.6256,
	the 2nd output is 0.4737-0.1269 $\emph{x}_{1}$ +…+0.1912 $\emph{x}_{13}$ = -5.2402,
	the 3rd output is -0.5363+0.4600 $\emph{x}_{1}$ +…+0.0088 $\emph{x}_{13}$ = 2.3668.
Rule 3:
IF:	the 1st feature is very low, and
	the 2nd feature is very low, and
	……, and
	the 13th feature is medium.
THEN:	the 1st output is 0.8216+0.1555 $\emph{x}_{1}$ +…+0.6766 $\emph{x}_{13}$ = -0.7301,
	the 2nd output is 0.1020+0.1342 $\emph{x}_{1}$ +…+0.1751 $\emph{x}_{13}$ = -0.1926,
	the 3rd output is -0.5423-0.1844 $\emph{x}_{1}$ +…-1.1100 $\emph{x}_{13}$ = 1.4580.

IV-E Interpretability of HTSK-LLM-DKD on Cleveland Heart Disease Dataset

The benefit of fuzzy classifiers is that they may be articulated in terms of linguistic explanation. Here, we choose one trial with the best $Accuracy$ on CLE among all experiment trials, and then, list in Table V the corresponding antecedent parts and consequent parts of all fuzzy rules. Due to space limitation, we only present three features of CLE dataset, i.e., the first, the second, and the last (13th) features of all three fuzzy rules. For a random given datum $\mathbf{x}=(-0.2730,-0.6799,\ldots,-16.4620)^{T}$ , we can observed from Table V that, each fuzzy set A ${}_{i}^{k}$ for the rule $k$ and feature $i$ in the If-Part of HTSK-LLM-DKD can be interpreted with a possible linguistic phrase. In the Then-Part of HTSK-LLM-DKD, the absolute value of the outputs in Rules 2 & 3 are much larger than that of Rule 1, which means that Rules 2 & 3 play a more important role in HTSK-LLM-DKD. In Rules 2 & 3, the 3rd output value is obviously the largest, after comprehensive calculation, we can conclude that the final prediction of HTSK-LLM-DKD is 3rd class, which means high-risk.

V Conclusion

In this study, we mainly focus on how to integrate knowledge distillation and TSK fuzzy classifiers in a deeper level. Therefore, we propose a novel TSK fuzzy classifier based decoupling knowledge distillation denoted as HTSK-LLM-DKD, transferring fuzzy dark knowledge from High-order TSK fuzzy classifier to Low-order TSK fuzzy classifier. HTSK-LLM-DKD uses LLM-DKD to decouple the fuzzy dark knowledge from High-order TSK fuzzy classifier into target class knowledge and non-target class knowledge, then transfers to Low-order TSK fuzzy classifier more efficiently, obtaining better classification performance with high interpretability. Experimental results on the benchmarking datasets and a real dataset Cleveland heart disease demonstrate its effectiveness.

HTSK-LLM-DKD has some improvements that deserve further study. First of all, we will conduct in-depth research on practical applications including epilepsy detection and movement prediction. Secondly, these state-of-the-art knowledge distillation methods that can be employed to distill fuzzy dark knowledge to TSK fuzzy classifier, which will also be the focus of our future research.

References

[1] Yunliang Jiang, Jiangwei Weng, Xiongtao Zhang, Zhen Yang and Wenjun Hu, “A CNN-Based Born-Again TSK Fuzzy Classifier Integrating Soft Label Information and Knowledge Distillation,” in IEEE Transactions on Fuzzy Systems, 2022.
[2] Xiongtao Zhang, Fu-Lai Chung, and Shitong Wang, “An interpretable fuzzy DBN-based classifier for indoor user movement prediction in ambient assisted living applications,” in IEEE Transactions on Industrial Informatics, vol. 16, no. 1, pp. 42-53, Jan. 2020.
[3] Xiongtao Zhang, Yusuke Nojima, Hisao Ishibuchi, Wenjun Hu and Shitong Wang, “Prediction by Fuzzy Clustering and KNN on Validation Data With Parallel Ensemble of Interpretable TSK Fuzzy Classifiers,” in IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 52, no. 1, pp. 400-414, Jan. 2022.
[4] Y. Jiang et al., “Recognition of Epileptic EEG Signals Using a Novel Multiview TSK Fuzzy System,” in IEEE Transactions on Fuzzy Systems, vol. 25, no. 1, pp. 3-20, Feb. 2017.
[5] Y. Jiang et al., “Seizure Classification From EEG Signals Using Transfer Learning, Semi-Supervised Learning and TSK Fuzzy System,” in IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 25, no. 12, pp. 2270-2284, Dec. 2017.
[6] X. Tian et al., “Deep Multi-View Feature Learning for EEG-Based Epileptic Seizure Detection,” in IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 27, no. 10, pp. 1962-1972, Oct. 2019.
[7] Z. Saghian, A. Esfahanipour, and B. Karimi, “A novel dynamic fare pricing model based on fuzzy bi-level programming for subway systems with heterogeneous passengers,” in Computers & Industrial Engineering, 2022.
[8] C. Ntakolia and D. V. Lyridis, “A comparative study on Ant Colony Optimization algorithm approaches for solving multi-objective path planning problems in case of unmanned surface vehicles,” in Ocean Engineering, vol. 255, p. 111418, 2022.
[9] R. Xie and S. Wang, “A wide interpretable Gaussian Takagi–Sugeno–Kang fuzzy classifier and its incremental learning,” in Knowledge-Based Systems, vol. 241, p. 108203, 2022.
[10] E. Zhou, C. -M. Vong, Y. Nojima and S. Wang, “A Fully Interpretable First-order TSK Fuzzy System and Its Training with Negative Entropic and Rule-stability-based Regularization,” in IEEE Transactions on Fuzzy Systems, 2022.
[11] B. Qin, Y. Nojima, H. Ishibuchi and S. Wang, “Realizing Deep High-Order TSK Fuzzy Classifier by Ensembling Interpretable Zero-Order TSK Fuzzy Subclassifiers,” in IEEE Transactions on Fuzzy Systems, vol. 29, no. 11, pp. 3441-3455, Nov. 2021.
[12] D. Wu, Y. Yuan, J. Huang and Y. Tan, “Optimize TSK Fuzzy Systems for Regression Problems: Minibatch Gradient Descent With Regularization, DropRule, and AdaBound (MBGD-RDA),” in IEEE Transactions on Fuzzy Systems, vol. 28, no. 5, pp. 1003-1015, May 2020.
[13] Y. Cui, Y. Xu, R. Peng, and D. Wu, “Layer Normalization for TSK Fuzzy System Optimization in Regression Problems,” IEEE Transactions on Fuzzy Systems, pp. 1-11, 2022.
[14] G. Hinton, J. Dean, and O. Vinyals, “Distilling the knowledge in a neural network,” in Adv. neural inf. proc. syst., NIPS, 2014.
[15] C. Li, G. Li, H. Zhang, and D. Ji, “Embedded mutual learning: A novel online distillation method integrating diverse knowledge sources,” in Applied Intelligence, pp. 1-14, 2022.
[16] X. Peng and F. Liu, “Knowledge Distillation Algorithm of Feature Reconstruction Based on Feature Maps of the Middle Layer,” 2022 4th International Conference on Advances in Computer Technology, Information Science and Communications (CTISC), 2022, pp. 1-7.
[17] Y. Wang, H. Chen and J. Li, “The Chain of Self-Taught Knowledge Distillation Combining Output and Features,” 2021 33rd Chinese Control and Decision Conference (CCDC), 2021, pp. 5115-5120.
[18] D. Liu et al., “KDCRec: Knowledge Distillation for Counterfactual Recommendation Via Uniform Data,” in IEEE Transactions on Knowledge and Data Engineering, 2022, doi: 10.1109/TKDE.2022.3199585.
[19] H. Zhang, D. Chen and C. Wang, “Confidence-Aware Multi-Teacher Knowledge Distillation,” ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 4498-4502.
[20] T. Morikawa and K. Kameyama, “Multi-Stage Model Compression using Teacher Assistant and Distillation with Hint-Based Training,” 2022 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops), 2022, pp. 484-490.
[21] Z. Long, F. Ma, B. Sun, M. Tan, and S. Li, “Diversified branch fusion for self-knowledge distillation,” in Information Fusion, vol. 90, pp. 12-22, 2023.
[22] W. Zhao, T. Tong, H. Wang, F. Zhao, Y. He and H. Lu, “Diversity Consistency Learning for Remote-Sensing Object Recognition With Limited Labels,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1-10, 2022.
[23] P. Wang, J. Wen, C. Si, Y. Qian and L. Wang, “Contrast-Reconstruction Representation Learning for Self-Supervised Skeleton-Based Action Recognition,” in IEEE Transactions on Image Processing, vol. 31, pp. 6224-6238, 2022.
[24] B. Li, Z. Cui, Z. Cao and J. Yang, “Incremental Learning Based on Anchored Class Centers for SAR Automatic Target Recognition,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1-13, 2022.
[25] X. Zhang, Y. Hu, A. Yin, J. Deng, H. Xu and J. Si, “Inferable Deep Distilled Attention Network for Diagnosing Multiple Motor Bearing Faults,” in IEEE Transactions on Transportation Electrification, 2022.
[26] Z. Liu, W. Lyu, C. Wang, Q. Guo, D. Zhou and W. Xu, “D-CenterNet: An Anchor-Free Detector With Knowledge Distillation for Industrial Defect Detection,” in IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1-12, 2022.
[27] H. Xing, Z. Xiao, R. Qu, Z. Zhu and B. Zhao, “An Efficient Federated Distillation Learning System for Multitask Time Series Classification,” in IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1-12, 2022.
[28] M. Zhou, J. Huang, X. Fu, F. Zhao and D. Hong, “Effective Pan-Sharpening by Multiscale Invertible Neural Network and Heterogeneous Task Distilling,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1-14, 2022.
[29] J. Zeng, Y. Sheng, Y. Yang, Z. Zhou and H. Liu, “Cross Modality Knowledge Distillation Between A-Mode Ultrasound and Surface Electromyography,” in IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1-9, 2022.
[30] X. Gu and X. Cheng, “Distilling a deep neural network into a Takagi-Sugeno-Kang fuzzy inference system,” in arXiv preprint arXiv:2010.04974, 2020.
[31] D. Erdem and T. Kumbasar, “Enhancing the Learning of Interval Type-2 Fuzzy Classifiers with Knowledge Distillation,” 2021 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 2021, pp. 1-6.
[32] B. Zhao, Q. Cui, R. Song, Y. Qiu and J. Liang, “Decoupled Knowledge Distillation,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11943-11952.
[33] T. Furlanello et al., “Born-again neural networks,” in Proc. 35th Int. Conf. Mach. Learn., ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, vol. 80, 2018, pp. 1602–1611.
[34] K. Demirli and P. Muthukumaran, “Higher order fuzzy system identification using subtractive clustering,” in Journal of Intelligent & Fuzzy Systems, vol. 9, no. 3-4, pp. 129-158, 2000.
[35] K. Bache, and M. Lichman, UCI Machine Learning Repository 2015. [Online]. Available: http://archive.ics.uci.edu/ml
[36] Qianfeng Cai, Zhifeng Hao and Xiawei Yang, “Higer-order Takagi-Sugeno Fuzzy Model Based on Kernel mapping (in Chinese),” in Control Theory & Applications, vol. 28, no. 5, pp. 681–687, May. 2011.
[37] Z. Fang, J. Wang, X. Hu, L. Wang, Y. Yang and Z. Liu, “Compressing Visual-linguistic Model via Knowledge Distillation,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 1408-1418.
[38] H. Yin et al., “Dreaming to Distill: Data-Free Knowledge Transfer via DeepInversion,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 8712-8721.
[39] M. Bisiada, Empirical studies in translation and discourse (Volume 14). Berlin, Germany. Language Science Press, 2021.
[40] Y. Zhang, H. Ishibuchi and S. Wang, “Deep Takagi-Sugeno-Kang Fuzzy Classifier With Shared Linguistic Fuzzy Rules,” in IEEE Transactions on Fuzzy Systems, vol. 26, no. 3, pp. 1535-1549, June 2018.
[41] Q. Ren, L. Baron and M. Balazinski, “High order type-2 TSK fuzzy logic system,” NAFIPS 2008 - 2008 Annual Meeting of the North American Fuzzy Information Processing Society, 2008, pp. 1-6.
[42] D. Shah et al., “Heart Disease Prediction using Machine Learning Techniques,” in SN Computer Science, vol. 1, no. 6, pp. 1-6, 2020.

Fuzzy Knowledge Distillation from High-Order TSK to Low-Order TSK

Abstract

Index Terms:

I Introduction

II Related Work

II-A TSK fuzzy classifier

II-B Knowledge Distillation

III Htsk-llm-dkd

III-A Specific Architecture of Teacher Model and Student Model

III-A1 Constructing Process of Teacher Model

III-A2 Calculation Process of Teacher Model and Student Model

III-B Least Learning Machine based Decoupling Knowledge Distillation (LLM-DKD)

III-B1 Teacher Logits

III-B2 Target Class Knowledge and Non-target Class Knowledge

III-B3 Fuzzy Dark Knowledge Decoupling Process of LLM-DKD

III-C HTSK-LLM-DKD Algorithm

IV Experiments

IV-A Experiment Setups and Performance Indicators

IV-A1 Datasets

IV-A2 Comparative Methods

IV-A3 Parameters setting

IV-A4 Performance Indicators

IV-B Experimental Results and Analysis on UCI Datasets

IV-C Experimental Results and Analysis on Cleveland Heart Disease Dataset

IV-D Effectiveness of Decoupled Knowledge Distillation

IV-E Interpretability of HTSK-LLM-DKD on Cleveland Heart Disease Dataset

V Conclusion

References

Fuzzy Knowledge Distillation from
High-Order TSK to Low-Order TSK